September 14, 2020
Fun Bugs: Episode I
Introduction
I’ve been meaning to get started on a series talking about the kind of bugs we find in production on the CloudBuild Cache. For a little bit of context, our team runs over 40k machines across the US, with over 85 PB of total storage capacity, and serve as a caching layer for lots of things: source code, NuGet packages, build artifacts, etc etc. We basically provide a P2P cache made specifically for builds within CloudBuild, though the current implementation is nothing like the paper.
The scale and sheer number of operations we run ends up causing every nook and cranny of the code to be run in parallel at different times, which usually leads to really interesting bugs. The attempt of this “series” will be to show those to others, so that they may hopefully get a laugh, maybe even learn something (and also so I get more fluent at writing).
For this first one, I’m choosing an entertaining one we found last week. Admittedly, it’s not the craziest thing ever, but it’s still a fun one. In any case, enjoy!.
The Disposal Race Condition
We use a Redis cluster as a low-bandwidth fast path for some of our data. This Redis instance is rather crucial, because not being able to fetch data from it means we have to wait for data to sync via the high-bandwidth slow path (which may take up to 5 minutes, versus the few milliseconds it takes to talk with Redis). To communicate with Redis, we use the marvelous StackExchange.Redis
client.
The way you are supposed to use the library looks something like this:
|
|
The problem is that sometimes this ConnectionMultiplexer
starts acting up, and the instances it returns end up being unusable. We tried hunting down exactly what was going on, but it’s relatively hard to just catch a machine while it’s happening, because it often gets solved out of nowhere after a few minutes of wreaking havoc.
Not having a connection to Redis for us can mean a failed build, or worse, multiple failed builds (and we strive to keep a 99.9% build success rate, so we really can’t afford those!). Since the connection to Redis is critical for us, we decided to implement 2 mechanisms:
- If we start seeing the error signatures that have empirically correlated in the past with a broken
ConnectionMultiplexer
, recreate it after some number of failures. - After re-creating more than X times in a short period of time, just force the service to restart.
Essentially, this:
This “worked”: we saw those errors stop happening; service restarts were relatively low (which meant that mechanism (1) was doing its job), and we were all happy. Alas, this was not to last!.
Out of the sudden, we started seeing a lot of ObjectDisposedException
in places we were using the ConnectionMultiplexer
instance. Enough experience with race conditions made us naturally wary of our “innovative” mitigation strategy. Turns out, this code fragment is essentially all you need to understand what’s going wrong:
|
|
In our code, we’d call GetDatabase()
to fetch pooled the connection to Redis. When the need arose to reset the ConnectionMultiplexer
instances, we basically did it by just adding a CancellationTokenSource
:
- Whenever an operation finds that the current instance is broken, it calls
_resetMultiplexerCts.Cancel()
- Any new operation that needs to use the multiplexer would realize this (as per the code above), and re-create it before moving on.
The pattern on lines 2-4 is a double-checked lock. The intention is that only the first operation that realizes the multiplexer is broken does the reset. All operations that come after will wait for it to complete before actually attempting to use it.
The big issue here is ordering: we recreate the CancellationTokenSource
on line 5. The new source is not cancelled, which means that any new operations that happen after that line will evaluate _resetMultiplexerCts.IsCancellationRequested
as false, even though we haven’t actually created the new multiplexer (which happens on line 7).
Hence, we can get the following interleaving:
- Thread 1 triggers a reset by doing
_resetMultiplexerCts.Cancel()
- Thread 2 starts an operation, takes the lock, runs up to line 5 inclusive
- Thread 3 starts an operation, but sees the new
_resetMultiplexerCts
, skips the entire if statement in line 2, and takes a reference to the old multiplexer without actually using it - Thread 2 continues execution by closing the old connection multiplexer and creating the replacement
- Thread 3 attempts to use the old multiplexer, triggering an
ObjectDisposedException
Since we do A TON of Redis operations per second, the Thread 3 scenario would play out more frequently than you’d expect. This was essentially fixed by doing this:
|
|
So essentially we avoid re-creating the _resetMultiplexerCts
until we actually have a working instance. The memory barrier introduced by the Volatile.Write
is meant to avoid _resetMultiplexerCts
’s assignment being reordered with the creation or assignment of the new multiplexer.
One could argue that this code is suboptimal in the sense that we wait for the old multiplexer to close instead of just creating a new one and doing the shutdown after we have unblocked all concurrent operations, but keep in mind this is a stateful third-party library we are dealing with, and we certainly didn’t want to fall victims to “oh we’re caching some random thing underneath the hood, so BAZINGA!” kind of bugs.