Approximating LRU eviction with a quantile sketch

The problem

Suppose we have a disk-resident key-value store with a few million entries and a fixed disk budget of a few GB. Each entry knows when it was last touched. We want LRU eviction: when the store gets close to the budget, throw out the entries that haven't been touched in the longest time, until we're back under the limit.

LRU is easy when the data lives in memory. You keep a doubly linked list of keys ordered by access time, move a key to the front on every access, and pop from the back when you need space. The standard recipe assumes three things:

Duplicating the storage of keys is cheap (the list pointers are tiny next to the value).
Moving a key around is cheap (a couple of pointer writes).
The data structure that holds the order can be kept consistent with the data itself, because the underlying abstraction supports atomic updates.

None of those hold in our scenario:

The data is disk-resident, so duplicating each key into a separate access-order index would cost a part of the disk budget we're trying to enforce in the first place.
Moving keys around means disk writes, which means write amplification and SSD wear. And atomicity across related keys requires expensive locks that any reasonable design wants off the hot path.

So we need an LRU that doesn't require maintaining an LRU order.

First attempt: keep the top K

A natural first attempt is an entry-count-bounded GC. The contract is: "after we're done, the DB has at most $K$ entries, and the ones we kept are the $K$ most recently accessed". One pass over the live DB, a min-heap of size $K$ keyed by access time, and at the end evict everything that wasn't in the heap. That's $O (N lo g K)$ work and $O (K)$ memory, which is fine.

This solves a real version of the problem, but it answers the wrong question. The disk budget is in bytes, not in entries. Two databases with the same number of entries can differ in size by a factor of 10 if one of them has a few hugely fat values. To translate "we want to be under $B$ bytes" into "we want at most $K$ entries", we'd have to know the average entry size, which the heap doesn't tell us. We could compute it on the fly, but then $K$ becomes a moving target and the heap stops giving the right answer.

The other issue (perhaps less obvious) is that $O (K)$ memory can be many GBs, which we may or may not have available to spare.

Quantile sketches in 60 seconds

A quantile sketch is a small data structure that consumes a stream of numbers and answers "what value is at the $q$ -th quantile of everything I've seen?" with bounded error and bounded memory. They're a sibling of the more famous Bloom filter , Count-Min , and HyperLogLog , but where those sketch set membership, frequency, and cardinality respectively.

The well-known data structures for quantile sketching are t-digest , Greenwald-Khanna , and DDSketch . The first two give what's called rank-error guarantees: the returned value is within $ε$ of the true quantile in rank space. So at $q = 0.99$ with $ε = 0.01$ , the answer is somewhere between the true 98th and 99th percentile. That's a great guarantee for symmetric distributions and a bad one for heavy-tailed distributions: near the tail, a small rank error can be a huge value error.

DDSketch gives a relative value error: the returned $\tilde{x}_{q}$ satisfies $∣ \tilde{x}_{q} - x_{q} ∣ \leq α \cdot x_{q}$ for a configurable $α$ . It does this by putting samples into exponentially spaced buckets: bucket $i$ covers $(γ^{i - 1}, γ^{i}]$ for $γ = (1 + α) / (1 - α)$ , and the quantile is read off by counting through the buckets. That property is exactly what we want for ages that span several orders of magnitude.

Here's what that looks like with a small handful of samples so you can see individual values mapping into buckets. Each dot is one sample, stacked vertically inside its bucket; the dotted vertical lines are the bucket boundaries $γ^{i}$ , evenly spaced on the log axis. To read the $q$ -th quantile, the sketch walks buckets left-to-right and accumulates counts until the running total reaches $⌈ q N ⌉$ ; that bucket (highlighted) is the answer, and the returned value is fixed somewhere inside it. The dashed red line shows the true quantile of the underlying distribution; the relative-error guarantee says the green line is within a factor of $γ$ of the red one.

Smaller $α$ means narrower buckets (more of them, more memory) and a tighter band around the true value. The whole sketch is just a sparse map from bucket index to count, so its memory grows with the number of occupied buckets, not with $N$ .

The algorithm

The age of an entry is just age = now - lastAccessTime, in some unit. If we had a sorted list of ages, evicting the oldest $ρ$ fraction would be trivial: read the list backwards, evict until we'd freed enough bytes. The sorted list is exactly the thing we can't afford. But all we really need from it is a single number: the age cut-off $T^{*}$ above which entries should be deleted.

So the algorithm is two full passes over the DB:

Fit the sketch. Walk the DB, and for each entry insert its age into a DDSketch. Also keep two running sums: total bytes in the DB, $B_{total}$ , and total bytes we want to remove this round, $B_{remove}$ .
Read the cut-off and evict. Compute the fraction to keep: $f_{keep} = 1 - B_{remove} / B_{total}$ . Ask the sketch for the corresponding age quantile: $T^{*} = sketch.Quantile (f_{keep})$ . Walk the DB again, and Remove every entry whose age exceeds $T^{*}$ .

That's it. The sketch is the bridge between "we want to free X bytes" and "we should evict every entry older than Y minutes". The whole thing is $O (N)$ work and $O (sketch)$ memory, which in practice works out to be very little (think KBs to MBs, depending on how much accuracy you want).

Here's what the sketch state looks like in practice for a log-normal age distribution with shape $σ$ . The blue bars are the DDSketch bucket counts (note the constant spacing on the log x-axis, since bucket $i$ covers $(γ^{i - 1}, γ^{i}]$ ); the line is the underlying density. The dashed red line is the true cut-off $T^{*}$ ; the solid green line is the sketch estimate $\tilde{T}^{*}$ , read off by walking the buckets from the top until the cumulative count reaches $ρ \cdot N$ .

The relative error printed above $\tilde{T}^{*}$ combines two effects: the DDSketch bucket grid (bounded by $α$ per the next section) and finite-sample noise on which bucket the cut-off lands in. Shrinking $α$ tightens the grid; cranking $N$ up reduces the noise.

Why does this work?

There are three things that can go wrong: the sketch can give the wrong $T^{*}$ , the count-to-bytes translation in step 2 can be off because the sketch is age-weighted but the budget is byte-weighted, and the sketch can be too big to be worth it. Let's bound each.

Sketch error

Proposition 3 of the DDSketch paper is the key here: for any $q \in [0, 1]$ , the returned $\tilde{T}^{*}$ satisfies

∣ \tilde{T}^{*} - T^{*} ∣ \leq α \cdot T^{*}

deterministically, where $α$ is the relative error parameter of the sketch.

That gives an age error. What we care about is the resulting error in eviction fraction: we aim at fraction $ρ = 1 - f_{keep}$ of entries above the cut-off, and we get $\tilde{ρ} = 1 - F (\tilde{T}^{*})$ instead, where $F$ is the true CDF of ages. To first order:

∣ \tilde{ρ} - ρ ∣ \approx f (T^{*}) \cdot ∣ \tilde{T}^{*} - T^{*} ∣ \leq α \cdot T^{*} f (T^{*})

where $f$ is the density. The quantity $T^{*} f (T^{*})$ is the density of $ln T$ at the cut-off, which measures how fast the CDF moves per unit relative change in age. That's exactly what DDSketch's relative-error guarantee translates into.

For ages that are roughly log-normal with shape $σ$ , which is the right model when most things get touched in the last few minutes and a long tail goes weeks without a hit, the standard normal tail approximation gives

T^{*} f (T^{*}) \approx \frac{ρ 2 ln ( 1/ ρ )}{σ}

So the relative error in the eviction fraction is bounded by $α 2 ln (1/ ρ) / σ$ . For $α = 0.01$ , $ρ = 0.1$ , $σ = 2$ , that's about $1%$ . With $α = 0.01$ we're asking for the cut-off to be off by no more than 1% in age units, and we get an off-by-no-more-than 1% in eviction fraction back.

Count-to-bytes error

The sketch sees one observation per entry, regardless of size, so it answers in count terms: $ρ$ is the fraction of entries above the cut-off. But the budget is $B_{remove}$ bytes. The translation only holds if entry sizes are statistically independent of age, an assumption we'll come back to in the next section.

Granting independence: let $S_{1}, \dots, S_{K}$ be the sizes of the $K \approx ρN$ evicted entries. By the central limit theorem ,

i = 1 \sum K S_{i} \sim N (K μ_{S}, K σ_{S}^{2})

so the bytes freed have relative standard deviation $σ_{S} / (μ_{S} K) = (σ_{S} / μ_{S}) / ρN$ .

If sizes are heavy-tailed enough that $σ_{S} / μ_{S}$ is around 1, with $N = 1 0^{6}$ and $ρ = 0.1$ that's $1/ 1 0^{5} \approx 0.3%$ at one sigma, so about $1%$ at three sigmas (i.e., comparable to the sketch error!).

Memory bound

DDSketch uses one bucket per multiplicative step of size $γ = (1 + α) / (1 - α)$ . For $α = 0.01$ , $γ \approx 1.0202$ and $ln γ \approx 0.02$ . If ages span from a millisecond up to a few days, that's about 9 orders of magnitude, so

buckets \approx \frac{ln ( 1 0 ^{9} )}{ln γ} \approx \frac{20.7}{0.02} \approx 1 0^{3}

A few thousand buckets, each storing an integer count, comes to maybe 20 KB (so free, yay!).

Putting it together

Combine the three: the sketch error gives $\approx 1%$ relative error on eviction count, the CLT gives another $\approx 1%$ on the count-to-bytes translation at 3 $σ$ , the sketch itself costs nothing. Total: with probability $\geq 99%$ a single round frees within a few percent of the target byte count. That's good enough for a GC that runs periodically, because any miss this round gets corrected by the next round.

A few issues along the way

A few things deserve to be called out:

Snapshot drift. Both passes iterate over a DB, but the live DB keeps moving in between. The cut-off computed in pass 1 may be slightly stale by the time pass 2 evicts on it, and pass 2 may attempt to remove entries that have been freshly touched in between. If the GC window is short relative to the access cadence this doesn't matter much; otherwise the analysis above stops applying because the distribution isn't stationary across the GC run. In practice, I've seen this neat trick work even across several hours difference.

Size independent of age. The CLT bound relies on sizes being uncorrelated with age. That's roughly true when the relevant cause of "this is old" is something like a dormant access pattern rather than a property of the entry itself. This gets thorny for, say, a cache where larger files stick around longer because they're more expensive to refetch. If size and age correlate, the right fix is to insert into the sketch with weight equal to entry size: then the sketch's quantiles are byte-quantiles directly, and the count-to-bytes translation goes away.

But can we do this faster!?

The algorithm pays for two full DB scans per GC cycle. There's a few tricks we can do here to save some of that:

The first pass can be skipped if you keep the sketch warm between cycles (insert on every access, optionally with a decay). Whether that's worth it depends on whether the disk read cost dominates the per-access maintenance cost (usually yes), and on how disciplined you can be about updating the sketch on every code path that touches the DB (usually less than you'd hope).
You can scan less than the totality of the DB. For example, you might use reservoir sampling to keep a random sample of keys in memory, and fit the sketch on that sample instead of the whole DB. The sketch error analysis still applies, but the count-to-bytes translation gets an extra layer of sampling noise on top. With a big enough sample, that noise would be small enough that it wouldn't change the overall error bounds much.

Conclusion

The takeaway is the title: when the obvious data structure is a sorted index and you can't afford to maintain it, ask what you actually need from it. If you happen to need a single threshold, a quantile sketch will give you that threshold with bounded error in bounded memory.

This same trick applies anywhere a global sort would feed a cut-off (rate limiting by latency budget, capacity planning by request-size percentile, alerting on a tail metric without ranking everything). DDSketch turns out to be a great default whenever the value of interest spans several orders of magnitude.