In this post, we attempt to solve a blend of these two problems. We start with the following question: If we are collecting coupons from among $n$ different types of coupons, what is the expected numbers of coupons that are needed to complete the collection given that the first collision occurred at the $c$'th coupon.
Let $K$ be the random variable indicating the number of coupons required to complete the collection and $C$ be the random variable indicated the number of coupons for the first collision. Then, we are looking for $\mathbb{E}(K|C=c)$.
This is not very hard. First, we already have collected $c$ coupons and because the first collision occurred at the $c$'th coupon, we have '$c-1$' distinct coupons. From here, we could proceed just like in the case of the original coupon collector problem. Therefore, we have,
$\begin{align}\mathbb{E}(K|C=c)&=c+\frac{n}{n-(c-1)}+\frac{n}{n-(c-2)}+\cdots+\frac{n}{1}\\&=c+n\text{ }H_{n-c+1} \\\end{align}$
where $H_n$ is the $n$'th Harmonic number. Note that the formula is valid for $1<c\leq n$.
We now deal with the (more challenging) 'reverse' question. That is, given we have completed the collection in $K=k$ coupons, how many cards were average would the first collision have occurred. That is, we look for $\mathbb{E}(C|K=k)$.
After trying out different approaches, I settled on the following approach. Let's first count the number of valid sequences. Let $N(n,j)$ be the number of sequences which collects $n$ different coupons exactly on the $j$'th draw. Obviously, $N(n,j)=0$ for $j<n$ and $N(n,n)=n!$.
With some thought, we could see that
$N(n,j)=n\text{ }N(n-1,j-1)+(n-1)\text{ }N(n,j-1)$
This works because, to collect $n$ exactly at the $j$'th draw, we could've (1) collected $n-1$ exactly at the $j-1$'th draw and the final coupon (for which we have $n$ choices) becomes the new coupon to complete the set (or) taken a sequence which completed the set of $n$ coupons in $j-1$ draws, choose any of the $n-1$ coupons except the last and insert it the penultimate position.
The $N$ function also has 'direct' formula via Stirling numbers of the second kind but we prefer this recurrence. We'll see why in a moment. Now, the expected value we are after can be written as,
$\displaystyle\mathbb{E}(C|K=k)=\frac{T(n,k)}{N(n,k)}$
where $T(n,j)$ is the sum of the first-collision-index of all the valid $N(n,j)$ sequences.
We don't have to look further to solve $T(n,j)$. Note that the neither of the two operations we described above for deriving the recurrence of $N(n,j)$ affects the first-collision-index of that sequence because the last coupon cannot be part of first collision (by definition, it has to be a new coupon to complete the set). Therefore, the same recurrence works for $T(n,j)$ as well!
The only tricky part is the boundary. Consider $T(n,n)$ which is the sum of all first collision indices where the set of $n$ coupons is collected in exactly $n$ draws. But those sequences must contain all distinct elements and therefore cannot have first collision. However, staying true to the process, we attribute a first collision index of $n$ to these sequences as we have collected $n$ coupons and the process have concluded. Therefore, we define, $T(n,n)=n \cdot n!$.
For example, using these recurrences, I get $\displaystyle\mathbb{E}(C|K=8,n=5)=612/175\approx 3.497$.
Now that we are on this subject, we can also check for $N(n,c,k)$, the number of sequences of coupons (from $n$ different coupons) that is 'complete' in the $k$'th draw and has its first collision on the $c$'th draw.
To compute $N(n,c,k)$, we must (1) choose '$c-1$' numbers in order (2) choose one among the '$c-1$' numbers for the $c$'th draw (3) choose one of the remaining '$n-c+1$' numbers to the last that completes the collection which means by now $c$ are already present in the sequence (4) choose $i$ of the remaining '$k-c-1$' positions for the '$n-c$' unseen numbers (5) partition the '$n-c$' numbers in $i$ non-empty subsets (6) use the '$c-1$' numbers at the start in the remaining positions without any constraint. Then,
$\displaystyle N(n,c,k)=\frac{n!}{(n-c+1)!}\cdot (c-1)\cdot(n-c+1)\cdot\sum_{i=0}^{k-c-1}\binom{k-c-1}{i}\cdot (n-c)!\left\{ \begin{matrix} i \\ n-c \end{matrix} \right\} \cdot (c-1)^{k-c-1-i}$
This can be simplified to finally give,
$\displaystyle N(n,c,k)=n!\cdot (c-1)\cdot\sum_{i=0}^{k-c-1}\binom{k-c-1}{i}\cdot \left\{ \begin{matrix} i \\ n-c \end{matrix} \right\} \cdot (c-1)^{k-c-1-i}$
I tried hard to find a recurrence or a generating function but failed miserably. I then delegated to CoPilot, Gemini and ChatGPT, and had a great time pointing one's error to the other. Finally, CoPilot (though I used it more than the others) figured out
$\displaystyle \sum_{n \geq 0,c > 0,k > c}N(n,c,k)\frac{x^n}{n!}\frac{y^{k-c-1}}{(k-c-1)!}z^{n-c}=\frac{x^2}{(1-xe^y)^2}\exp(y+xz(e^y-1))$
Gemini gave me the following python code that can be used to check the validity of the formula.
(code starts here..)
import itertools
def count_sequences(n, c, k):
"""
n: total number of distinct digits available
c: position of the first repetition (1-indexed)
k: total length of the sequence
"""
digits = list(range(n))
valid_count = 0
# Brute force through all n^k possibilities
for seq in itertools.product(digits, repeat=k):
# 1. Find the first repetition
first_repeat_pos = 0
seen = set()
for i, val in enumerate(seq):
if val in seen:
first_repeat_pos = i + 1
break
seen.add(val)
if first_repeat_pos != c:
continue
# 2. Check if the k-th digit is the one that completes the n-set
if len(set(seq)) == n and len(set(seq[:-1])) == n - 1:
valid_count += 1
return valid_count
# Specific Example
n, c, k = 6, 4, 9
result = count_sequences(n, c, k)
print(f"count_sequences(n={n}, c={c}, k={k}) = {result}")
(... and ends here.)
Hope you enjoyed the discussion.
Until then
Yours Aye
Me