The effect of barriers on completion times

Table of contents

Introduction

A lot of systems fan out work to many sub-tasks and then wait for all of them to finish. A build that compiles a few thousand .cc files in parallel and then links them. A scatter-gather query that hits 64 shards and joins their responses. A distributed training step that waits for every worker's gradient before applying the update.

That looks something like:

await Task.WhenAll(files.Select(DoSomethingReallyCoolAsync));

The completion time of the whole thing is the completion time of the slowest sub-task. This post works out how badly that hurts as grows, and what we can do about it.

Assumptions

We'll assume for most of this post that there are sub-tasks with i.i.d. completion times . This isn't always true in practice, but it's a good starting point for understanding the problem. In real systems, correlated sub-tasks are one reason why things will turn out to be much worse than in this idealized world; mutexes are perhaps one of the easiest ways to see how this can bite you.

We'll also assume (i.e., the completion time of the sub-tasks is log-normal ). Two things make this a reasonable starting point:

  1. Measured task and request latencies are almost always positively skewed: most finish quickly, a few run arbitrarily long ("ugh, this random thing hung again!"). A symmetric distribution like the normal can't reproduce that long right tail.
  2. There's a mechanism behind the skew. A sub-task's latency is the product of many independent multiplicative effects: queueing, cache misses, network jitter, lock contention. Taking logs turns that product into a sum, so is a sum of many independent terms and is approximately normal by the central limit theorem . That is exactly the statement that is log-normal, a multiplicative CLT sometimes called Gibrat's law .

Log-normal sits between the light-tailed exponential and genuinely heavy-tailed power laws: heavy enough to throw the occasional brutal straggler, light enough that the maximum keeps all its moments and obeys the clean scaling derived below. The scaling does depend on that tail. If real sub-task latencies are power-law in the tail (they sometimes are), the maximum grows much faster than this model predicts, so read the log-normal numbers as an optimistic case; bounded or lighter-tailed sub-tasks make it grow slower. The qualitative picture (barriers hurt as grows, dominates at scale, partial barriers help a lot) carries over to any reasonable choice.

The difference is easiest to see with both densities drawn together. Below, the normal is matched to the log-normal's mean and variance, so only the shape differs: the normal is symmetric and puts real probability on negative completion times, while the log-normal stays positive and trails off into a long right tail.

The two share a mean and a standard deviation, yet the log-normal's mean, median, and mode all disagree, the signature of a right-skewed distribution:

That skew carries into the CDF: the log-normal crawls toward 1 through its long right tail, while the normal has already spent real probability on negative completion times before the clock starts.

The effect of maximums

The maximum of i.i.d. random variables is also a random variable , and its distribution is given by:

Where is the cumulative distribution function (CDF) of , so is the CDF of . Already from this we can see why the maximum is painful: at every fixed with , as grows, so concentrates on larger and larger values. The more sub-tasks we have, the more likely at least one of them takes too long and drags the maximum upwards.

This equality means we can get a closed-form expression for the CDF of the maximum of log-normal random variables:

Where is the CDF of the standard normal distribution. Here's what that looks like:

We can use this to get the PDF of the maximum of log-normal random variables by differentiating, but it's a bit of a mess and doesn't help for things we'd like to do like computing expected values. Looking into this led me down the extreme value theory  rabbit hole, as you'll see below.

Expected value of the maximum

As it turns out, the PDF is messy and a closed-form expected value isn't pretty either, but we can simulate it. Below is a Monte Carlo estimate of along with the 68% and 95% central ranges of 's distribution (the 16th/84th and 2.5th/97.5th percentiles of for each ), for ranging from 1 to :

Two things to notice:

  1. The expected value grows monotonically with , at a decreasing rate.
  2. The spread also grows with .

Extreme value theory  tells us the log-normal distribution sits in the Gumbel domain of attraction , and after suitable centering and scaling, the maximum of i.i.d. samples converges to a Gumbel distribution as (yeah, I know, just wtf?).

The underlying normal has a well-known leading-order asymptotic "Leading order" pins down how fast the maximum grows with , not its value at the you actually run: the relative error vanishes as , but the absolute gap need not be small. So basically: it's a ballpark estimate rather than an approximation. for its maximum:

Exponentiating, the typical value of scales like

to leading order. is the level a single one of standard normals is expected to exceed: set and use the Gaussian tail , which gives to leading order. It is the centering sequence in the Fisher-Tippett-Gnedenko theorem  for the normal; the full expansion subtracts the terms mentioned above. There are corrections involving (visible in the Monte Carlo plot above as a meaningful gap at finite ), but the scaling is what tells the story.

The exponent grows like , so the maximum grows unboundedly in , but quite slowly. There are a few practical takeaways from this:

  1. The variance is way more important than the mean: dominates completion time as soon as is large enough that . In that regime, the same absolute reduction in shaves times as much off the exponent as the same reduction in , and the gap grows with .
  2. The maximum only ever moves up, and adding more parallel sub-tasks has diminishing returns on latency.
  3. A single rare slow sub-task ruins everything. The completion time is extremely sensitive to the tail of the per-task distribution.

What if we don't need all of them?

The full barrier (wait for all sub-tasks) is the worst case. In practice, we often don't need all sub-tasks to complete: we can tolerate some number of stragglers, and use only the fastest results. This is the order statistic  generalization of the maximum.

If we sort the sub-task completion times , the time to get the first of them done is , whose CDF is:

When this collapses back to (the maximum). The case is where it gets interesting:

The plot above shows the expected wait time as a function of for three "barrier fractions": waiting for 100% of sub-tasks (the full barrier), waiting for the fastest 95%, and waiting for the fastest 50% (the median):

  • The full barrier grows roughly like , as discussed above.
  • Tolerating a 5% straggler tail caps the wait at a much lower value: each additional sub-task gives a few more "outs", which together absorb the tail.
  • Tolerating a 50% straggler tail makes the wait time roughly constant in . As , the sample median converges to the population median , which doesn't depend on at all.

This idea shows up all over the place, with one famous case being in distributed training. Synchronous SGD waits for all workers' gradients before stepping, so the step time is the max of and one slow worker stalls everyone. Chen et al. (2016)  add backup workers and step as soon as the fastest of the report, dropping the slowest . The step time becomes the -th order statistic of instead of the max of , and because they still aggregate exactly gradients the effective batch (and its variance) is unchanged (at the cost of the gradients computed and discarded each step).

What can we do about it?

If the system has a hard barrier with no to spare, the previous trick is unavailable. There are still a few things that can be done.

Reduce σ

Since the maximum grows in , shaving has outsized impact. At we have , so every trimmed from divides the typical maximum by , while the same off only divides it by .

Common causes of variance in sub-task completion times include GC pauses, cold caches, head-of-line blocking, mutex contention, network jitter, and noisy neighbors. Hunting them down is usually the highest-leverage thing you can do And a great tool for this is Statistical Process Control . I really enjoy anything written by Donald J. Wheeler to learn about this. .

Reduce k

Either through batching (process many items per sub-task) or by hierarchical fan-out (splitting a -way fan-out into a -way fan-out of -way fan-outs). Batching is the one that actually moves the max, because it cuts directly:

// One task per file becomes one task per batch of 16: k drops 16x,
// at the cost of each task now doing 16x the work.
await Task.WhenAll(files.Chunk(16).Select(CompileBatchAsync));

Hierarchy mainly buys you bookkeeping (fewer children per node, easier failure handling); for pure latency it only helps when each intermediate does its own non-trivial work, since the global max in a routing-only tree is still the max of all leaves regardless of tree shape.

Hedge requests

If a sub-task can be served by multiple replicas, send the request to of them and take the first response:

// Race r replicas, take the first answer, cancel the losers.
async Task<Response> HedgedAsync(IReadOnlyList<Replica> replicas)
{
    using var cts = new CancellationTokenSource();
    var inFlight = replicas.Select(r => r.QueryAsync(cts.Token)).ToList();
    var winner = await Task.WhenAny(inFlight);
    cts.Cancel();
    return await winner;
}

Assuming the per-replica latencies are independent, this replaces the per-replica distribution with , which squares (for ) the tail probability. The downside is sending the load to the underlying system, which is rarely free. A common compromise is to wait until some quantile (e.g., p95) before issuing the hedged request, which dramatically shortens the tail while sending only ~5% extra load:

// Cheaper: only hedge the requests that have already gone slow.
async Task<Response> HedgeAfterAsync(Replica primary, Replica backup, TimeSpan p95)
{
    var first = primary.QueryAsync();
    if (await Task.WhenAny(first, Task.Delay(p95)) == first)
        return await first;                 // primary beat the p95 deadline
    using var cts = new CancellationTokenSource();
    var second = backup.QueryAsync(cts.Token);
    var winner = await Task.WhenAny(first, second);
    cts.Cancel();
    return await winner;
}

Dean and Barroso's The Tail at Scale is the canonical reference for this. Simulating all three strategies shows how much the barrier moves: full hedging () flattens it almost completely, and the p95 compromise recovers most of that benefit for a fraction of the extra load.

Speculatively re-issue stragglers

A variant of hedging for when the sub-tasks aren't naturally replicated. If a sub-task is taking longer than expected (e.g., past some quantile of the distribution), re-issue it to a different worker and take whichever response comes first. It's the same race as HedgeAfterAsync above, except the backup goes to a fresh worker rather than a replica, so the hedge plot describes it too. This is the trick MapReduce uses to deal with straggler tasks.

Conclusion

Barriers turn the per-task tail latency into the dominant cost of fan-out, and the cost grows with the number of sub-tasks. The math says the growth is slow ( in the exponent for log-normal), but in practice that's still painful: doubling pushes the maximum up, and any additional variance shows up amplified at the barrier.

The two best mitigations I've found in practice are (a) hunting down sources of variance in the per-task distribution, and (b) avoiding the full barrier when at all possible. " of " is often sufficient and dramatically cheaper; everything else (hedging, speculative re-execution, hierarchical fan-out) is downstream of those two.