September 14, 2020

Fun Bugs: Episode I

Introduction

I’ve been meaning to get started on a series talking about the kind of bugs we find in production on the CloudBuild Cache. For a little bit of context, our team runs over 40k machines across the US, with over 85 PB of total storage capacity, and serve as a caching layer for lots of things: source code, NuGet packages, build artifacts, etc etc. We basically provide a P2P cache made specifically for builds within CloudBuild, though the current implementation is nothing like the paper.

The scale and sheer number of operations we run ends up causing every nook and cranny of the code to be run in parallel at different times, which usually leads to really interesting bugs. The attempt of this “series” will be to show those to others, so that they may hopefully get a laugh, maybe even learn something (and also so I get more fluent at writing).

For this first one, I’m choosing an entertaining one we found last week. Admittedly, it’s not the craziest thing ever, but it’s still a fun one. In any case, enjoy!.

The Disposal Race Condition

We use a Redis cluster as a low-bandwidth fast path for some of our data. This Redis instance is rather crucial, because not being able to fetch data from it means we have to wait for data to sync via the high-bandwidth slow path (which may take up to 5 minutes, versus the few milliseconds it takes to talk with Redis). To communicate with Redis, we use the marvelous StackExchange.Redis client.

The way you are supposed to use the library looks something like this:

1
2
3
4
5
6
7
8
9


using StackExchange.Redis;
...
// This is a cache for connections to the Redis cluster, meant to be stored and
// reused
ConnectionMultiplexer redis = ConnectionMultiplexer.Connect("localhost"); 
// Get a connection
IDatabase db = redis.GetDatabase(); 
... // Do your thing with the connection
redis.Dispose(); // Shut the cache down!

The problem is that sometimes this ConnectionMultiplexer starts acting up, and the instances it returns end up being unusable. We tried hunting down exactly what was going on, but it’s relatively hard to just catch a machine while it’s happening, because it often gets solved out of nowhere after a few minutes of wreaking havoc.

Not having a connection to Redis for us can mean a failed build, or worse, multiple failed builds (and we strive to keep a 99.9% build success rate, so we really can’t afford those!). Since the connection to Redis is critical for us, we decided to implement 2 mechanisms:

If we start seeing the error signatures that have empirically correlated in the past with a broken ConnectionMultiplexer, recreate it after some number of failures.
After re-creating more than X times in a short period of time, just force the service to restart.

Essentially, this:

Hello, IT. Have you tried turning it off and on again? — Our bug-crushing strategy

This “worked”: we saw those errors stop happening; service restarts were relatively low (which meant that mechanism (1) was doing its job), and we were all happy. Alas, this was not to last!.

Out of the sudden, we started seeing a lot of ObjectDisposedException in places we were using the ConnectionMultiplexer instance. Enough experience with race conditions made us naturally wary of our “innovative” mitigation strategy. Turns out, this code fragment is essentially all you need to understand what’s going wrong:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


public async Task<IDatabase> GetDatabase(...) {
  if (_resetMultiplexerCts.IsCancellationRequested) {
    using (await _mutex.WaitAsync()) {
      if (_resetMultiplexerCts.IsCancellationRequested) {
        _resetMultiplexerCts = new CancellationTokenSource();
        await CloseConnectionMultiplexer(_connectionMultiplexer);
        _connectionMultiplexer = await CreateConnectionMultiplexer();
      }
    }
  }
  ...
}

In our code, we’d call GetDatabase() to fetch pooled the connection to Redis. When the need arose to reset the ConnectionMultiplexer instances, we basically did it by just adding a CancellationTokenSource:

Whenever an operation finds that the current instance is broken, it calls _resetMultiplexerCts.Cancel()
Any new operation that needs to use the multiplexer would realize this (as per the code above), and re-create it before moving on.

The pattern on lines 2-4 is a double-checked lock. The intention is that only the first operation that realizes the multiplexer is broken does the reset. All operations that come after will wait for it to complete before actually attempting to use it.

The big issue here is ordering: we recreate the CancellationTokenSource on line 5. The new source is not cancelled, which means that any new operations that happen after that line will evaluate _resetMultiplexerCts.IsCancellationRequested as false, even though we haven’t actually created the new multiplexer (which happens on line 7).

Hence, we can get the following interleaving:

Thread 1 triggers a reset by doing _resetMultiplexerCts.Cancel()
Thread 2 starts an operation, takes the lock, runs up to line 5 inclusive
Thread 3 starts an operation, but sees the new _resetMultiplexerCts, skips the entire if statement in line 2, and takes a reference to the old multiplexer without actually using it
Thread 2 continues execution by closing the old connection multiplexer and creating the replacement
Thread 3 attempts to use the old multiplexer, triggering an ObjectDisposedException

Since we do A TON of Redis operations per second, the Thread 3 scenario would play out more frequently than you’d expect. This was essentially fixed by doing this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


public async Task<IDatabase> GetDatabase(...) {
  if (_resetMultiplexerCts.IsCancellationRequested) {
    using (await _mutex.WaitAsync()) {
      if (_resetMultiplexerCts.IsCancellationRequested) {
        await CloseConnectionMultiplexer(_connectionMultiplexer);
        var newMultiplexer = await CreateConnectionMultiplexer();
        Volatile.Write(ref _connectionMultiplexer, newMultiplexer);
        _resetMultiplexerCts = new CancellationTokenSource();
      }
    }
  }
  ...
}

So essentially we avoid re-creating the _resetMultiplexerCts until we actually have a working instance. The memory barrier introduced by the Volatile.Write is meant to avoid _resetMultiplexerCts’s assignment being reordered with the creation or assignment of the new multiplexer.

One could argue that this code is suboptimal in the sense that we wait for the old multiplexer to close instead of just creating a new one and doing the shutdown after we have unblocked all concurrent operations, but keep in mind this is a stateful third-party library we are dealing with, and we certainly didn’t want to fall victims to “oh we’re caching some random thing underneath the hood, so BAZINGA!” kind of bugs.