Clinging Way Harder to Bash than Anyone Ever Should

what I learned doing things the hard way when doing them the easy way

Jan Hensel

2024-02-25

How it started

I've been working on a paper about spam on GitHub.

When it came to gathering some data I figured, as you do in academia, it doesn't have to be pretty, I'll just bang something together in Bash with curl and jq and all the assorted Unix tooling and I know I'll get my data out of it just fine.

Normally when I have a thought like this, it turns out, uh oh, things are way more complicated than they seem and, uh oh, the complexity demons invade my peaceful existence. But this is not actually what happened this time! This time I had it right, gathering the data was kind of a breeze and all it really took was gluing some scripts together and playing around with the gathered data a bit and hitting the 5K-request rate limit a few times (predictably).

But when the complexity demons keep their distance it is not because they have lost, but because they know they have already won the battle in your mind, long ago.

Perhaps because of SWE conditioning, perhaps banal perfectionism and an irresistible urge to tinker with any problem I'm working on until it breaks, I felt the need to make my code more seamless (less seamful?), more integrated. I tell myself that I already did 90% of the work by getting it to work and the last few steps of integration would be disproportionately valuable when having to re-use this in the future. Valuable to who and in what future will they use my hacky scripts to make GitHub REST API calls? 🤷 Oh well...

So I set out to make it all work smoothly. But other demons have invaded my mind as well, and they tell me that I cannot justify rewriting my GitHub API scraping project in a proper language™ like Go. While they are sometimes defeated by the "it would be so much easier to rewrite this from scratch" demons, this time they persevered and so I set out to just try to gently mould the existing state of my scripts into something that I can just set off and forget and it will reliably get me the data from the API without making GitHub ban my account for spamming them with a trillion requests or something.

How it's going

GH_REQ_LOG="${GH_REQ_LOG:-false}"
GH_REQ_LOG_LOCK="/tmp/gh_req_log.lock"
ghreqlog() {
  if [ "${GH_REQ_LOG}" = "false" ]; then
    return
  fi
  MSG="${1}" flock "${GH_REQ_LOG_LOCK}" \
    --command "echo \"\$MSG\" >> ${GH_REQ_LOG}"
}

I think the compulsion to write a function like this in a Bash script is a pretty surefire way to know you should probably have switched to a proper language™ a while back. Just to be clear, I don't think this is some mighty-impressive Bash-wizardry, just a misguided attempt by a misguided young man to reconcile all his poor decisions. And it worked!

So what's happening here, exactly? This is just a logging function in order to debug a bit, so you call it like ghreqlog "it is working" and that's it, but it doesn't log to STDOUT but instead to the specified log file $GH_REQ_LOG, assuming that one was specified (the ${FOO:-bar} syntax means that if $FOO is not defined, use the literal bar, so we have the logging disabled by default).

But that's not all, is it...? What's this flock business?

Ah well, you see, that's where my genius comes in. If I made GitHub API calls sequentially, one after the other, each taking, say 0.2 seconds, it would take minutes to get 5000 API calls done, over sixteen minutes in fact! Instead, it is much more efficient to spend at least an hour crafting the scripts such that they leverage the power of parallel and dive into the realm of concurrency (or is it parallelism?). That way, my requests will be done in, well, also minutes, but fewer minutes most likely. Plus it enables us to hit the secondary rate limit on the GitHub REST API, something normally reserved for DDoSers and those who wrote their code so poorly they accidentally seemed as such... well, that's probably been me at some point in this endeavour.

So, of course, concurrency is where all the trouble begins, not necessarily with doing things at the same time, but with having to synchronize between them after all! At first it always seems like you can set it all off and forget it but then there is that one pesky little edge case, always.

parallel is a neat tool, and one of its less practical but very cute features is --bar which simply displays a progress bar across the width of your terminal. Realistically, my goal here was less some lofty pristine error free data-gathering framework that could be used for generations to come and more just to see the little bar make it smoothly all the way across my screen without manual intervention.

To make that happen, first let's look at what would happen if we just let it run. To be clear, my the central part of the scripts that ends up invoking requests is essentially this:

cat $list_of_usernames \
  | parallel --bar --halt soon,fail=1 get_userinfo_from_github

By default parallel actually just runs through all your data, even if some commands fail, but for talking to the API I felt it best to abort mission once something truly went wrong to avoid getting API-banned because I made the same request 10K times in 8 seconds or something. So we --halt soon,fail=1, meaning we halt when 1 job fails but only soon, so letting any running jobs finish (to avoid half-writing response objects, I suppose).

So what can go wrong? Of course, we could simply not have any connectivity to GitHub but let's set that aside for a second and assume we do. The user we are requesting could certainly have removed their account at this point; these are people targeted in spam attacks (that GitHub does woefully little to effectively prevent) so for that reason or any other they may just call it quits or get banned or something. In that case we get a 404 back from the API. Besides that, we could exceed a rate limit, in which case we get back a 403 (or, according to the docs, a 429, which I have never gotten).

Our interest, foolishly, is to catch these error cases correctly and appropriately handle them. Of course, there is not much to do when a user removes their account, and a simple tolerance of this case with a warning back to us that there was a 404 is probably the best we can do. But when we get rate-limited, we have a couple of competing interests.

If we lived in a sequential world (while read user ; do get_info $user ; done) we could just communicate the problem via the exit code and in the case of a rate-limit, take the necessary wait time here and retry later, hardly so bad. But we are parallelized, and there's really no way to communicate to parallel that "if this command fails with this exit state wait with any future commands until this long, if it fails in this other way, wait for that long, if it fails like this, that's not even a failure so just keep going, and if it fails any other way then abort". (At least I think there isn't any way to do this, but at this point nothing surprises me about these tools anymore.)

So what we ought to do is handle this gracefully in our request-making function. On the level of an individual request there's nothing much more to do than if it_went_wrong_and_you_should_wait ; then wait ; fi. But the individual instance of the request-making function is not, in fact, unrelated to all the other instances of it, due to the rate limit. So there is this pesky hidden fact that says we actually cannot consider these otherwise unrelated actions as unrelated. If we tried to see them this way, then while one instance might hit the rate limit and realize to wait (just fine) and then the next one, started at the same time also hits it and waits (just fine), but eventually parallel starts a new instance for us making yet another request that is subject to the same rate limit yet has never been told that it should actually wait and not make yet another request that might get us banned.

So what is to be done? Of course, to synchronize a bit between these requests, i.e., give those later requests access to the information that we're currently on hold so as not to anger the API overlords. We'll want some sort of indicator that tells future requests to cool it until $later^[1].

How we tell ourselves this could just be a file, couldn't it? After all, even though we are a bunch of programs executing in parallel, we do share a file system. This is true, and it would probably even work in practice (as far as I can see) because we wouldn't expect the GitHub API to tell us significantly different reset-times shortly once after the other. But it would also be oh so very wrong, because we are using no mechanism to synchronize. Technically there's plenty of mechanism that synchronize things when we use the file system on Linux, but not on the level of our program.

Let's say we get told by the API "too fast, don't talk to me again until 2PM" and note this down in a file so the other programs know and we wait and then by 2PM we talk again and it works and so we tell all the other programs it's all clear again (by removing the file) and it should be all just fine, right? Wrong, of course, because if, after we we wake up and make our request but before we remove the file, the API tells another instance of our program to wait until 3PM, so that program stores this to the file, and then we immediately remove it! Now we might spam another 20-30 API calls and get this much closer to being banned!

And if you say "ah, but you just need to read the file to see if the value is still 2PM, so you know you're removing the relevant warning only", the situation is actually no different at all when considering this:

  A                    B                    C
-------------------  ------------------  -----------------

make request ❌
hear 2PM
no warning exists
write warning 2PM
sleep til 2PM
make request ✅
read warning         make request ❌
                     hear 3PM
(stalls for          read warning
 whatever reason)    warning says 2PM
                     write 3PM warning
remove warning now
thinking it was 2PM
                                         sees no warning
                                         gets us banned T.T

It's textbook, and the textbook also has a solution: Atomicity! We need to know that whatever we read has not been changed by the time we write. If somebody tries to write after we read but before we wrote, we want them to be told to wait for a moment until we did our write.

And that's what flock gives us! With flock, we can specify a file in the file system as a lock variable and then do any operation and however long we take to do that operation, if anybody else tries to do something that also has to claim the lock, they will be made to wait until we are done with our thing. Of course, this only works when you remember to actually lock the (right) variable in all the places where it matters, which is the reason that concurrency is so hard.

Author's note:

I might extend this post in the future, or write a follow-up; for now, I'm happy to get it out of my TODO-list.

With the GitHub API, when you hit the secondary rate limit you get a retry-after HTTP header which you should adhere to (I think it always says 60, meaning sixty seconds), while for the primary rate limit you should listen to the x-ratelimit-reset header. ↩