Saturday, January 31, 2026

Blending the Birthday Problem and the Coupon Collector Problem

The Birthday Problem and Coupon Collector problem are well known problems in probability that are like duals of each other. While the former deals with 'first collision', the latter deals with the 'complete collection'. I believe they are too famous even warrant further introduction here.

In this post, we attempt to solve a blend of these two problems. We start with the following question: If we are collecting coupons from among $n$ different types of coupons, what is the expected numbers of coupons that are needed to complete the collection given that the first collision occurred at the $c$'th coupon.

Let $K$ be the random variable indicating the number of coupons required to complete the collection and $C$ be the random variable indicated the number of coupons for the first collision. Then, we are looking for $\mathbb{E}(K|C=c)$.

This is not very hard. First, we already have collected $c$ coupons and because the first collision occurred at the $c$'th coupon, we have '$c-1$' distinct coupons. From here, we could proceed just like in the case of the original coupon collector problem. Therefore, we have,

$\begin{align}\mathbb{E}(K|C=c)&=c+\frac{n}{n-(c-1)}+\frac{n}{n-(c-2)}+\cdots+\frac{n}{1}\\&=c+n\text{ }H_{n-c+1} \\\end{align}$

where $H_n$ is the $n$'th Harmonic number. Note that the formula is valid for $1<c\leq n$.

We now deal with the (more challenging) 'reverse' question. That is, given we have completed the collection in $K=k$ coupons, how many cards were average would the first collision have occurred. That is, we look for $\mathbb{E}(C|K=k)$.

After trying out different approaches, I settled on the following approach. Let's first count the number of valid sequences. Let $N(n,j)$ be the number of sequences which collects $n$ different coupons exactly on the $j$'th draw. Obviously, $N(n,j)=0$ for $j<n$ and $N(n,n)=n!$.

With some thought, we could see that

$N(n,j)=n\text{ }N(n-1,j-1)+(n-1)\text{ }N(n,j-1)$

This works because, to collect $n$ exactly at the $j$'th draw, we could've (1) collected $n-1$ exactly at the $j-1$'th draw and the final coupon (for which we have $n$ choices) becomes the new coupon to complete the set (or) taken a sequence which completed the set of $n$ coupons in $j-1$ draws, choose any of the $n-1$ coupons except the last and insert it the penultimate position.

The $N$ function also has 'direct' formula via Stirling numbers of the second kind but we prefer this recurrence. We'll see why in a moment. Now, the expected value we are after can be written as,

$\displaystyle\mathbb{E}(C|K=k)=\frac{T(n,k)}{N(n,k)}$

where $T(n,j)$ is the sum of the first-collision-index of all the valid $N(n,j)$ sequences.

We don't have to look further to solve $T(n,j)$. Note that the neither of the two operations we described above for deriving the recurrence of $N(n,j)$ affects the first-collision-index of that sequence because the last coupon cannot be part of first collision (by definition, it has to be a new coupon to complete the set). Therefore, the same recurrence works for $T(n,j)$ as well!

The only tricky part is the boundary. Consider $T(n,n)$ which is the sum of all first collision indices where the set of $n$ coupons is collected in exactly $n$ draws. But those sequences must contain all distinct elements and therefore cannot have first collision. However, staying true to the process, we attribute a first collision index of $n$ to these sequences as we have collected $n$ coupons and the process have concluded. Therefore, we define, $T(n,n)=n \cdot n!$.

For example, using these recurrences, I get $\displaystyle\mathbb{E}(C|K=8,n=5)=612/175\approx 3.497$.

Now that we are on this subject, we can also check for $N(n,c,k)$, the number of sequences of coupons (from $n$ different coupons) that is 'complete' in the $k$'th draw and has its first collision on the $c$'th draw.

To compute $N(n,c,k)$, we must (1) choose '$c-1$' numbers in order (2) choose one among the '$c-1$' numbers for the $c$'th draw (3) choose one of the remaining '$n-c+1$' numbers to the last that completes the collection which means by now $c$ are already present in the sequence (4) choose $i$ of the remaining '$k-c-1$' positions for the '$n-c$' unseen numbers (5) partition the '$n-c$' numbers in $i$ non-empty subsets (6) use the '$c-1$' numbers at the start in the remaining positions without any constraint. Then,

$\displaystyle N(n,c,k)=\frac{n!}{(n-c+1)!}\cdot (c-1)\cdot(n-c+1)\cdot\sum_{i=0}^{k-c-1}\binom{k-c-1}{i}\cdot (n-c)!\left\{ \begin{matrix} i \\ n-c \end{matrix} \right\} \cdot (c-1)^{k-c-1-i}$

This can be simplified to finally give,

$\displaystyle N(n,c,k)=n!\cdot (c-1)\cdot\sum_{i=0}^{k-c-1}\binom{k-c-1}{i}\cdot \left\{ \begin{matrix} i \\ n-c \end{matrix} \right\} \cdot (c-1)^{k-c-1-i}$

I tried hard to find a recurrence or a generating function but failed miserably. I then delegated to CoPilot, Gemini and ChatGPT, and had a great time pointing one's error to the other. Finally, CoPilot (though I used it more than the others) figured out

$\displaystyle \sum_{n \geq 0,c > 0,k > c}N(n,c,k)\frac{x^n}{n!}\frac{y^{k-c-1}}{(k-c-1)!}z^{n-c}=\frac{x^2}{(1-xe^y)^2}\exp(y+xz(e^y-1))$

Gemini gave me the following python code that can be used to check the validity of the formula.

(code starts here..)
import itertools

def count_sequences(n, c, k):
    """
    n: total number of distinct digits available
    c: position of the first repetition (1-indexed)
    k: total length of the sequence
    """
    digits = list(range(n))
    valid_count = 0
    
    # Brute force through all n^k possibilities
    for seq in itertools.product(digits, repeat=k):
        # 1. Find the first repetition
        first_repeat_pos = 0
        seen = set()
        for i, val in enumerate(seq):
            if val in seen:
                first_repeat_pos = i + 1
                break
            seen.add(val)
        
        if first_repeat_pos != c:
            continue
            
        # 2. Check if the k-th digit is the one that completes the n-set
        if len(set(seq)) == n and len(set(seq[:-1])) == n - 1:
            valid_count += 1
            
    return valid_count

# Specific Example
n, c, k = 6, 4, 9
result = count_sequences(n, c, k)
print(f"count_sequences(n={n}, c={c}, k={k}) = {result}")
(... and ends here.)


Hope you enjoyed the discussion.


Until then
Yours Aye
Me