May 2, 2018

Understanding Modulo Bias

It is often said that this code:

unsigned int randomNumber = rand() % k;

is a bad idea, at least if you are expecting a uniform distribution. I’m going to try and explore this topic in a more formal fashion than I have seen so far.

The reason why it is bad is pretty elementary and easy to understand: imagine you have a random generator that outputs values between $0$ and $9$ (i.e. RAND_MAX is $10$), and you wanted values between $0$ and $2$ (so you would set $k = 3$ ). Then, we have the following mapping:

rand()	Random Number
0, 3, 6, 9	0
1, 4, 7	1
2, 5, 8	2

So, effectively, the $P(\text{randomNumber} = 2) = \frac{3}{10}$; in contrast, $P(\text{randomNumber} = 0) = \frac{4}{10}$. Essentially, the problem lies in the fact that since $k$ does not evenly divide RAND_MAX, this gets you up to $1$ more value falling into the first few numbers in the range. To get a feeling for this, you can play around with the plot below, which shows the precise probabilities for different values of RAND_MAX and $k$:

Effectively, the effect of the parameter $k$ becomes more pronounced when the value of $M \% k$ gets bigger; until it gets too big (i.e. $k = M - 1$), and then it is almost as if you didn’t have the problem. More formally, we can analyze this by setting $X$ to be a discrete random variable such that $R_X = \{0, …, M-1\}$, with $M-1 \in \mathbb{N}$ being the highest achievable number for the random generator and $P(X = i) = \frac{1}{M}$.

Let $Y = X \% k$. What we want is to find out what the probability distribution function of $Y$ is, since that will give us a way to properly understand what is happening. First of all, notice that $R_Y = \{0, …, k-1\}$. Hence, the probability:

$P(Y = i) = P(\cup_{j \in J_i} (X = j)) = \sum_{j \in J_i} P(X = j) = \frac{\#J_i}{M}$

Where $J_i = \{x \in R_X : x \equiv i \text{(mod} k\text{)}\}$. Concretely, the probability that $Y$ is in any particular equivalence class, is the number of elements in the class over the total number of elements we get to pick from. So the next thing to do is to calculate $\#J_i$.

Elements belonging to $J_i$ all have the same form: $qk + i$, with $q \in \mathbb{Z}$. However, they are bounded above by $M-1$, and below by $i$:

$i \leq qk + i \leq M-1 \implies 0 \leq q \leq \lfloor\frac{M - 1 - i}{k}\rfloor$

Thus, $\#J_i = \lfloor\frac{M - 1 - i}{k}\rfloor + 1$. Meaning that our probability distribution function is

$P(Y = i) = \frac{\lfloor \frac{M - 1 - i}{k} \rfloor + 1}{M}$

We want to see that this sums to $1$, since this would confirm that it is a probability distribution. I will use use the identities $\lceil \frac{a}{b} \rceil = \lfloor \frac{a - 1}{b} \rfloor + 1$ and $a = \sum_{i=0}^{b-1} \lceil \frac{a - i}{b} \rceil$, both proofs can be found in {1}:

$\begin{split} \sum_{i \in R_Y} P(Y = i) &= \sum_{i=0}^{k-1} (\frac{\lfloor\frac{M - 1 - i}{k}\rfloor + 1}{M}) \\ &= \frac{1}{M} \sum_{i=0}^{k-1} (\lfloor\frac{M - 1 - i}{k}\rfloor + 1) \\ &= \frac{1}{M} \sum_{i=0}^{k-1} \lceil\frac{M - i}{k}\rceil \\ &= \frac{1}{M} M = 1 \end{split}$

The fact that the distribution depends on the value of $i$ is already a problem: we were seeking a uniform distribution, and hence expecting that $P(Y=i) = \frac{1}{k}$. If you plot the function, you will see exactly the same plot as you were playing with above.

How bad is it?

How bad this is depends on what you are doing. A more interesting question is how to quantify how bad it can be. One way to go about quantifying is to measure how different the distribution of $Y$ is with respect to the distribution of $Z$; there are a myriad of ways to do this, but I am going to use the Kullback-Leibler divergence.

Thus, suppose $Z$ was a uniformly distributed random variable such that $R_Z = \{0, .., k - 1\}$ (i.e. the actual distribution we wanted to achieve). Given some $M$, we know that $k$ can range between $1 \leq k \leq M - 1$; for each of these values, we have one probability distribution, call it $Y$, and we can compute $\text{KL}(Z||Y)$:

If you look at different values of $M$, it will become clear that $\text{KL}(Z||Y)$ only reaches 0 when $k$ evenly divides $M$; and thus some values of $M$ don’t ever get a point that reaches $0$; specifically, these are the prime numbers.

Furthermore, the space between the zeros becomes bigger and bigger as $M$ increases. This is basically because even divisors of $M$ are more spaced out as they get larger; this is easy to see by looking at the prime decomposition of $M = \prod_{i=1}^b p_i^{r_i} $: every even divisor can only look just like the decomposition, just with a lesser or equal value for each $r_i$, which means that increases in value are multiplicative!.

The last key to understand this is the parabolas. Why do we have an inverted parabola between each even divisor?. That is simply because of how modulo works!:

When increasing $k$, one more value gets biased. This means that the generated distribution drifts away from the target distribution $Z$.
When the number of biased values ($M \% k$) is roughly $\frac{k}{2}$, we reach the highest point of the parabola.
After this we can only decrease: as more values become biased, the distribution becomes more and more like an uniform distribution again.

How can I fix this?

Perhaps the easiest fix is to do:

unsigned int randomNumber = k;
while (randomNumber >= k) randomNumber = rand();

This works just as intended: the produced number is always between $0$ and $k-1$, and it is distributed uniformly in the range. To see this, imagine we set $T_i$ to be independent, identically distributed with a uniform distribution between $0$ and $M-1$; then we can compute the PDF of $T$:

$\begin{split} P(T = i) &= \sum_{r=1}^{\infty} P(\wedge_{j=1}^{r - 1} T_j \geq k \wedge T_r = i) \\ &= \sum_{r=1}^{\infty} P(T_r = i) \prod_{j=1}^{r-1} P(T_j \geq k) \\ &= \sum_{r=1}^{\infty} \frac{1}{M} (1 - \frac{k}{M})^{r-1} \\ &= \frac{1}{M} \sum_{r=0}^{\infty} (1 - \frac{k}{M})^{r} \\ &= \frac{1}{M} \frac{1}{1 - (1 - \frac{k}{M})} \\ &= \frac{1}{k} \end{split}$

Which now looks exactly as we wanted it to!. Perhaps the next interesting question is how many times will we call rand() until we find a number that satisfies the constraint.

This can be easily answered: the probability that we receive a random number in the range is $\frac{k}{M}$, and the random numbers generated are independent of each other; which means that the number of iterations $N$ is a geometric random variable. Hence, we have that $N \sim \text{Geometric}(\frac{k}{M})$, and thus we can expect to take $\mathbb{E}[N] = \frac{M}{k}$ iterations until we generate a random number.

The expectation points out a problem in our code: if $M \gg k$, which often happens in practical implementations where $M$ is the maximum representable number, then the number of iterations is going to be high.

What now?

Well, the problem with the last idea is that we discard most of the numbers that we generate, while the problem with the former idea is that there are a set of numbers that bias our generator. The next idea is to try to discard just the numbers that bias the generator.

We can write $M$ as $\lfloor \frac{M}{k} \rfloor k + (M \% k)$; if we did the modulo technique on a generator with output between $0$ and $\lfloor \frac{M}{k} \rfloor k - 1$, we would have a uniform distribution, as we have seen before (because the range would be evenly divisible). Indeed, all we need to do is to discard the $M \% k$ values at either the end or the beginning of the range, and then do modulo over the result; so this is practically a “merge” of the last two ideas. It would go something like this:

unsigned int threshold = M - (M % k);
unsigned int randomNumber;
do {
    randomNumber = rand();
} while (randomNumber >= threshold)
randomNumber = randomNumber % k;

We don’t really need to do anything to prove that this is correct: it is clear that the first four lines sample a number from the range $\{0, …, \lfloor \frac{M}{k} \rfloor k - 1\}$ uniformly at random (from the proof of the previous scheme), and then using modulo on that changes it to a uniform distribution with the desired range (due to the first scheme’s proof).

The only interesting change here is the number of calls to rand(). Now, the probability that a number is in the desired range for the loop is $ \frac{M - (M \% k)}{M} = 1 - \frac{M \% k}{M}$, and thus the number of iterations is $N \sim \text{Geometric}(1 - \frac{M \% k}{M})$. We can now look at the expected number of iterations for different values of $k$ and $M$:

Indeed, this is pretty much what we would reasonably expect to happen: the bigger $M \% k$ is the more wastage we incur, and thus more iterations are needed. We can prove, however, that the expected number of iterations is bounded by $2$:

$\begin{split} \mathbb{E}[N] &= \frac{1}{1 - \frac{M \% k}{M}}\\ &= \frac{1}{1 - \frac{M - \lfloor \frac{M}{k} \rfloor k}{M}}\\ &= \frac{M}{\lfloor \frac{M}{k} \rfloor k}\\ &= 1 + \frac{M \% k}{\lfloor \frac{M}{k} \rfloor k} \end{split}$

Where we have repeatedly used the fact that $M = \lfloor \frac{M}{k} \rfloor k + (M \% k)$. Notice that $\frac{M \% k}{\lfloor \frac{M}{k} \rfloor k}$ is strictly less than $1$, because if it was $1$ or more, we could increase the value of $\lfloor \frac{M}{k} \rfloor$, which contradicts the definition of the division operator itself. Thus, the entire expression is less than 2. This means that our algorithm has a expected running time of $\mathcal{O}(1)$, assuming the rand operation is $\mathcal{O}(1)$.

Perhaps the last interesting question with regards to this algorithm is what the probability is that we will do more than $t$ iterations before outputting a random number. Mingling with the CDF for the geometric distribution gets you:

$\begin{split} P(N > t) &= (\frac{M \% k}{M})^t\\ &= (1 - \lfloor \frac{M}{k} \rfloor \frac{k}{M})^t \end{split}$

The closer $ \lfloor \frac{M}{k} \rfloor \frac{k}{M} $ is to $1$, the less probable it will be to spend more iterations. Of course, this term looks pretty similar to the expected number of iterations, just inverted and divided by two:

It is pretty easy to see that this term is always greater than $\frac{1}{2}$. Plugging this into the previous formula, we get that $P(N > t) \leq (\frac{1}{2})^t$.

{1} Knuth, Donald. 1994. Concrete Mathematics.