## Saturday, October 27, 2018

### German Tank problem

I was recently watching How to estimate a population using statisticians in Youtube and this reminded of the German tank problem. The problem has a very nice history and I started reading about it in Wikipedia after I watched the video.

I found it very interesting that the problem can be achieved using either frequentist inference or Bayesian inference, achieving different results. Estimating the size of a Population by gives the proof of the famous result that the number of tanks.

Let the random variables $N$ denote the unknown number of tanks, $M$ be the maximum serial number observed and $K$ be the sample size. Then given $M=m$ and $K=k$,

$\displaystyle N \approx \frac{k+1}{k}m-1$ by frequentist analysis and
$\displaystyle \mathbb{E}(N|M=m,K=k)=(m-1)\frac{k-1}{k-2}$ by Bayesian analysis.

Both of these estimates are based on the probability $M=m$ is the maximal among $N$ variables with a sample size of $K$. That is,
$\displaystyle \mathbb{P}(M=m|N,K)=\frac{\binom{m-1}{K-1}}{\binom{N}{K}}$

Now let's think differently. The history of the German tank problem meant that the serial number of the tanks were noted after they were captured or destroyed. But what if they weren't? Assume that the serial number were noted without the tanks being captured/destroyed. In other words, there is a possibility of the observed numbers being repeated. Therefore,

$\displaystyle \mathbb{P}(M=m|N,K)=\frac{\binom{K+m-1}{m-1}-\binom{K+m-2}{m-2}}{\binom{K+N-1}{N-1}}=\frac{K}{m-1}\frac{\binom{K+m-2}{m-2}}{\binom{K+N-1}{N-1}}$

Surprisingly, even in this case the total likelihood function remains the same as in the original German tank problem. That is,

$\displaystyle \sum_{n}\mathcal{L}(n)=\frac{k}{k-1}$ using the sum $\displaystyle \sum_{n=m}^\infty \frac{1}{\binom{k+n-1}{n-1}}=\frac{k}{m}\frac{1}{\binom{k+m-2}{k-2}}$

Proceeding like in the wiki article, this we have,

$\displaystyle N \approx \frac{k+1}{k}m-\frac{1}{m}$ by frequentist analysis and
$\displaystyle \mathbb{E}(N|M=m,K=k)=m\frac{k-1}{k-2}$ by Bayesian analysis.

Now coming back to the video, I quoted at the start of this post, it is clear that the population estimation is done with the famous capture-recapture method. This is basically the setting usually described in terms of Hypergeometric distribution. The estimate of the total population in this case is given by,

$\displaystyle N \approx \frac{nK}{k}$

where $K$ is the initial sample size, $n$ is the second sample size and $k$ is the number of elements that are recaptured.

But this is based on frequentist inference. What about the Bayseian inference? Turns out we would need the following sum to come up with the total likelihood function.

$\displaystyle \sum_{N \geq \text{max}(n,K)}\frac{\binom{N-K}{n-K}}{\binom{N-m}{n-m}}=\frac{n-m}{K-m-1}\frac{1}{\binom{K-m-2}{k-m-2}}$

Using we have, $\displaystyle \sum_{n}\mathcal{L}(n)=\frac{nK}{k(k-1)}$ and therefore $\displaystyle \mathbb{E}(N)=\frac{(n-1)(K-1)}{k-2}$ as given here.

Now, the final part. Matt does something interesting in the video. He uses the mark-and-recapture method to estimate the number of candies in a bowl by repeated sampling. How do we go about that?

Let $n_0$ be the total number of candies in the bowl that we are trying to estimate. To this end, we capture a bunch of them each time and mark how many times those candies were captured. Let's say we do this $k$ times and $n_1,n_2,\cdots,n_k$ be sample sizes. Let $X_j$ denote the number of candies that were captured $j$ times. Now we have,

$\displaystyle \mathbb{E}(X_j)=n_0[t^j]\prod_{i=1}^k \left(1-\frac{n_i}{n_0}+ t\frac{n_i}{n_0}t\right)$

Using the data given by Matt in the video, we observe that $2$ candies were capture $4$ times. Using this as an estimate for $\mathbb{E}(X_4)$ and solving resulting equation for $n_0$ and picking the maximal root, we have $n_0 \approx 164.3$ whereas the bowl originally had about $169$ candies.

The above formulae is based on frequentist inference. It's not so hard to arrive at the $\mathbb{P}(X_j=m)$ with recurrence but finding a closed form, which I think will be needed in Bayesian inference, remains a challenge. I'll update if I'm successful in that front.

Until then,
Yours Aye
Me