Saturday, January 16, 2021

Probability on a Contingency Table

Contingency tables are quite common in understanding a classification problems like that of a ML model or new drug tested against a disease. Given that we are just recovering from a pandemic, let's stick to the case of a Machine Learning model. In the context of ML models, it is called the Confusion matrix and we'll use both the terms interchangeably in this post.

A 2x2 contingency table usually has two columns for the binary classification (Win vs. Lose, Apple vs. Orange, White vs. Black etc.) and two rows for whether the prediction was right or wrong. Let's consider the classification as a 'Hypothesis' and the model's prediction as an 'Evidence' supporting it.

Here is how our contingency table would look like. 

Table 1

H

⌐H

E

n(H∩E)

n(⌐H∩E)

⌐E

n(H∩⌐E)

n(⌐H∩⌐E)


where $n(A)$ denotes the number of elements in set $A$.

We can normalize this table by dividing each of the four entries by the total thereby creating a new table.

Table 2

H

⌐H

E

$\mathbb{P}(H\cap E)$

$\mathbb{P}(\neg H\cap E)$

⌐E

$\mathbb{P}(H \cap \neg E)$

$\mathbb{P}(\neg H \cap \neg E)$


where we can view each entry as the probability of a classification falling in that bracket and $\mathbb{P}(A)$ denotes the probability of event $A$. Note that the sum of all the entries of Table 2 is 1.

The Wiki page on Confusion matrix gives a huge list of metrics that can be derived out of this table. In this post, we visit a couple of probability problems created from them.

Three of the important metrics that I learned in the context of ML from these matrices are

Precision, $\displaystyle p=\frac{\mathbb{P}(H\cap E)}{\mathbb{P}(H\cap E)+\mathbb{P}(\neg H\cap E)}=\frac{\mathbb{P}(H\cap E)}{\mathbb{P}(E)}=\mathbb{P}(H|E)$

Recall, $\displaystyle r=\frac{\mathbb{P}(H\cap E)}{\mathbb{P}(H\cap E)+\mathbb{P}(H\cap \neg E)}=\frac{\mathbb{P}(H\cap E)}{\mathbb{P}(H)}=\mathbb{P}(E|H)$

and Accuracy, $a=\mathbb{P}(H \cap E)+\mathbb{P}(\neg H \cap \neg E)$

Suppose, we want to create a random (normalized) confusion matrix. One way to do this would be create random variables all between 0 and 1, that also sum to 1. We can use Dirichlet distribution with four parameters to achieve this.

But there may be instances where we want to create a confusion matrix with a given precision, recall and accuracy. There are four entries in the table. Given three metrics and the fact that the entries should add upto 1 would seem to suggest, that these completely define the table.

But not all such values can create a valid table. For example, its impossible to create a valid table with a precision of 66.7%, recall 90.0% and accuracy 60%. So our first question is, given that precision, recall and accuracy are all uniformly distributed random variables, what is the probability that we will end up with a valid table.

To produce a valid table, the three variables need to satisfy the condition

$\displaystyle a \geq \frac{pr}{1-(1-p)(1-r)}$

Let $T$ be event of getting a valid table. Using the above we have,

$\displaystyle \mathbb{P}(T)=\int\limits_0^1\int\limits_0^1\int\limits_0^1\left[ a \geq \frac{pr}{1-(1-p)(1-r)} \right]\,dp\,dr\,da$

For a moment, let's assume we assume that $a$ is given. Then we first solve

$\displaystyle F(a)=\int\limits_0^1\int\limits_0^1\left[a \geq \frac{pr}{1-(1-p)(1-r)}\right]\,dp\,dr$

We can simplify the expression inside the Iverson bracket as a curve in the $r-p$ plane. The equation of the curve is given by

$\displaystyle r=\frac{ap}{(1+a)p-a}$

Plotting the region for $a=1/4$, we get the following graph.




The region to be integrated lies between the curve and the two axes. We can divide this region along the $p=r$ line. This line intersects the graph at $\left(\frac{2a}{1+a},\frac{2a}{1+a}\right)$. Therefore,

$\displaystyle F(a)=\frac{4a^2}{(1+a)^2}+2\int\limits_{\frac{2a}{1+a}}^1\frac{a p}{(1+a)p-a}\,dp$

Solving the second integral with Wolfram Alpha, we get,

$\displaystyle F(a)=\frac{4a^2}{(1+a)^2}+2\frac{a(1-a)}{(1+a)^2}-2\frac{a^2\log{a}}{(1+a)^2}=\frac{2a}{1+a}-\frac{2a^2\log{a}}{(1+a)^2}$

Plugging this back in our original equation and integrating, we see that,

$\displaystyle \mathbb{P}(T)=\int\limits_0^1\left(\frac{2a}{1+a}-\frac{2a^2\log{a}}{(1+a)^2}\right)\,da=4-\frac{\pi^2}{3}\approx 0.710132$

Thus we see that the just about 29% of the tables will not be valid. Something that truly surprised me here is the fact that $\pi$ makes an appearance here. There are no circles (not even something close to that) in this problem!!

The second problem is we see is also quite similar. Now, assume that we create tables (like that of Table 2) such that the values are uniformly distributed and sum to 1. If we want the precision and recall of our random tables to be greater than some threshold, what would be expected accuracy of the table?

For clarity, let $\mathbb{P}(H \cap E)=X$, $\mathbb{P}(\neg H \cap E)=Y$, $\mathbb{P}(H \cap \neg E)=Z$ and $\mathbb{P}(\neg H \cap \neg E)=W$, then $(X,Y,Z,W)\sim \text{Dir}(1,1,1,1)$

$\displaystyle \mathbb{E}(1-Y-Z|\mathbb{P}(E|H)\geq r,\mathbb{P}(H|E)\geq p)=\frac{1}{V}\int\limits_Q 1-y-z \,dx\,dy\,dz$

where $Q$ is the region such that

$Q=\{(x,y,z):(1-p)x\geq py \land (1-r)x\geq rz \land x+y+z \leq 1 \land x,y,z\geq 0\}$

and $V$ is the volume enclosed by $Q$.

Evaluating this integral is not so easy. The region integration depends on the value of $p$ and $r$ and it kind of ends in a mess of equations. But with some luck and a lot of Mathematica, we can see

$\displaystyle \mathbb{E}(\mathbb{P}(H \cap E)+\mathbb{P}(\neg H \cap \neg E)\text{  }|\text{  }\mathbb{P}(E|H)\geq r,\mathbb{P}(H|E)\geq p)=\frac{2(p+r)^2+p^3+r^3-pr(p^2+r^2)}{4(p+r)(1-(1-p)(1-r))}$

I have no way of making sense of that expression but, hey, we have an expectation on probabilities!

Hope you enjoyed this discussion.

Clear["Global`*"];
p = 200/1000; r = 250/1000;
ImpReg[a_, p_, r_] := 
  ImplicitRegion[
   y + z <= 1 - a && (1 - p) x >= p y && (1 - r) x >= r z && 0 <= x &&
     0 <= y && 0 <= z && x + y + z <= 1, {x, y, z}];
ImpRegap[a_, p_] := 
  ImplicitRegion[
   y + z <= 1 - a && (1 - p) x >= p y && 0 <= x && 0 <= y && 0 <= z &&
     x + y + z <= 1, {x, y, z}];
ImpRegpr[p_, r_] := 
  ImplicitRegion[(1 - p) x >= p y && (1 - r) x >= r z && 0 <= x && 
    0 <= y && 0 <= z && x + y + z <= 1, {x, y, z}];
ImpRegra[r_, a_] := 
  ImplicitRegion[
   y + z <= 1 - a && (1 - r) x >= r z && 0 <= x && 0 <= y && 0 <= z &&
     x + y + z <= 1, {x, y, z}];
ExpecAcc[p_, r_] := (-2 p^2 - p^3 - 4 p r + p^3 r - 2 r^2 - r^3 + 
   p r^3)/(4 (p + r) (-p - r + p r));
ExpecAcc[p, r]
N[%]
lim = 1000000;
cnt = 0;
val = 0;
Do[
  x = -Log[RandomReal[]]; y = -Log[RandomReal[]];
  z = -Log[RandomReal[]]; w = -Log[RandomReal[]];
  t = w + x + y + z;
  x /= t; w /= t; y /= t; z /= t;
  If[And[x/(x + y) >= p, x/(x + z) >= r],
   cnt += 1; val += 1 - y - z;
   ];
  , {i, lim}];
N[val/cnt]


Until then
Yours Aye
Me

No comments:

Post a Comment