The Ghost of Statistics Past

Flip a coin 20 times.

  1. Suppose we see heads each time. This would cause me to strongly believe the coin is unfair.

  2. Suppose we get THHTHHHTTTHHTHTHHTHH. Let us call this particular sequence \(S\). Seeing \(S\) would not cause me to strongly believe the coin is unfair. We count 12 heads in \(S\), so the coin may favour heads, but there’s too little evidence to say. Indeed, I’d believe there’s still a decent chance that the coin may even favour tails slightly.

How can we mathematically formalize my beliefs? According to my undergrad textbook:

  1. For a fair coin, the probability of getting 20 heads in 20 flips is \(2^{-20}\), which is less than 1 in a million. This is small, so the coin is likely unfair.

  2. For a fair coin, the probability of seeing at least 12 heads is approximately 0.25. As 0.25 is not small, we lack significant evidence that the coin is unfair.

The authors later state in a carefully indented paragraph: "…​the smaller the probability of a result as unusual as (or even more unusual than) the observed one, the stronger our feeling that the coin is a trick coin" (emphasis theirs). Or in more general terms, for a hypothesis \(H\) and data \(D\), then if \(P(D|H)\) is small then \(H\) is likely false.

At the time, I accepted this principle without question. My university professors surely knew what they were doing, right?

A Likely Story

So that we can discuss it, let’s name the above principle. We call it quasi-contraposition, because one specious argument for it proceeds as follows. Suppose \(A\) implies \(B\). By the law of contraposition, if \(B\) is false, then it follows that \(A\) is false:

\[ A \rightarrow B \implies \neg B \rightarrow \neg A \]

Replace \(A\) with \(H\) and \(B\) with “\(\neg D\) is likely”:

\[ H \rightarrow \neg D \mbox{ is likely } \implies D \mbox{ is likely } \rightarrow \neg H \]

We could ostensibly move the "is likely" on the right-hand side:

\[ H \rightarrow \neg D \mbox{ is likely } \implies D \rightarrow \neg H \mbox { is likely } \]

We have the law of quasi-contraposition. The last manoeuvre seems shady, but perhaps we could fool someone with enough bluster: "You see, since \(D\) happened, it suggests \(D\) was likely to happen, which as we know implies \(\neg H\). Of course, one does not simply conclude \(\neg H\), because there was a small chance that \(D\) is unlikely but it happened anyway. So we conclude \(\neg H\) is likely."

We work through the procedure given in my textbook for the sequence \(S\). We’re instructed to use the binomial distribution for this problem, so we count 12 heads in \(S\), and compute \(P(X\ge 12)\), that is, the probability of seeing at least 12 heads in 20 flips of a fair coin:

\[ \sum_k {20 \choose k} 2^{-20} [12 \le k \le 20] \]

We can compute this with a little Haskell:

> let { ch n 0 = 1; ch n k = n*(ch (n - 1) (k - 1)) / fromIntegral k }
> sum[ch 20 k | k <- [12..20]] / 2^20
0.2517223358154297

The probability is indeed a bit larger than 0.25, and since this is greater than 0.05, we deem our finding insignificant, that is, we conclude we lack strong evidence to dispute the hypothesis that the coin is fair.

What if we replace 12 with 20?

> sum[ch 20 k | k <- [20..20]] / 2^20
9.5367431640625e-7

We get a probability far smaller than 0.05, so for a coin that shows heads each time, we conclude there is strong evidence that our fair-coin hypothesis is false.

Lies of Omission

Although the end result matches our intuition, there are peculiarities in the procedure:

  1. The only thing we remember about the sequence \(S\) is the number 12. Why should all sequences containing exactly 12 heads be treated the same?

  2. Why do we compute \(P(X \ge 12)\)? Where did this inequality come from? We know there are exactly 12 heads and no more!

In other words, we deliberately throw away information. Twice.

Why do we wilfully neglect some of our data? If a ghost could whisk me back to my undergrad days, perhaps I’d witness my younger self say: "The probability of seeing any particular sequence, such as \(S\), is always \(2^{-20}\), so focusing on a particular sequence obviously fails. Since each coin flip is independent of the others, it makes sense just to count the number of heads instead."

"As for the inequality: the probability of seeing exactly \(k\) heads is:

\[ P(X = k) = {20 \choose k} 2^{-20} \]

which is always too small to work with. If we replace it with \(P(X \ge k)\) for large \(k\) and \(P(X \le k)\) for small \(k\) then we get a probability that is tiny for extreme values of \(k\), but huge for reasonable values of \(k\). In other words, we get a number that can distinguish between likely and unlikely \(k\)."

In short, my past self would say we do what we do because it works. We play around until we find a quantity that almost disappears when we want it to. It’s practical (we need only compare against 0.05) and convincing (because it involves fancy mathematics).

Isn’t this intellectually unsatisfying? On the one hand, it certainly sounds better to say "following standard procedure, the P-value is less than 0.05; therefore we have significant evidence the hypothesis is false" instead of "\(k\) seems kind of extreme so the hypothesis is probably false". On the other hand, if we’re going to all this trouble to quantify how strongly we believe a hypothesis is true, why not do a proper job and justify each step, rather than settle on some ad hoc procedure?

Perhaps the procedure only appears ad hoc because the derivation is omitted to avoid scaring students fresh out of high school. Let’s suppose this is the case and try derive probability theory from first principles, one of which the authors insist is quasi-contraposition.

We have a coin. Our hypothesis is that it is fair. The probability of seeing any particular sequence of 20 flips such as \(S\) is \(2^{-20}\), which is tiny. By quasi-contraposition, seeing such an "unusual" outcome means our hypothesis is likely wrong. So no matter what, we should always believe the coin is unfair!

By the same token, consider rolling a \(2^{20}\)-sided die that we believe to be fair. After a single roll, we see a number that has a \(2^{-20}\) chance of showing up. Wow, this is much less than 0.05! The die must be loaded!

The inescapable conclusion: quasi-contraposition is wrong.

Master Probability With This One Weird Trick

If quasi-contraposition is wrong, then what is right?

Whatever it is, it must capture our intuition. If we flip a coin 20 times and see 20 heads, we suspect the coin is unfair. If we see the sequence \(S\), we are much less suspicious. Either event occurs with probability \(2^{-20}\) so there must be other information that affects our beliefs. What could it be?

The answer is that we are aware that trick coins exist, and willing to entertain the possibility that the coin in question is such a coin. For a fair coin, the probability of seeing 20 heads in a row is \(2^{-20}\), but for certain trick coins the probability is much higher. Indeed, an extremely unfair coin might show heads every time. We think: "Is this a fair coin that just happened to come up heads every time, or is this a trick coin that heavily favours heads? Surely the latter is likelier!"

How about the sequence \(S\)? For a fair coin, the probability of seeing the sequence \(S\) is also \(2^{-20}\). But this time, we feel:

  • Unlike the previous case, the probability of seeing \(S\) ought to be miniscule for any coin, fair or not. (Exercise: Show the probability of seeing \(S\) maxes out for a coin that shows heads with probability \(12/20\), but only at a value less than double \(2^{-20}\).)

  • The coin is unlikely to be heavily biased one way or the other.

  • The coin is most likely biased \(12/20\) in favour of heads, but we’d need to flip a lot more times to tell.

We can mathematically formalize these thoughts with one simple trick. Rather than \(P(D|H)\), we flip it around and ask for \(P(H|D)\). In other words, given the data, we find a number that represents how strongly we believe the hypothesis is true.

The probability \(P(H|D)\) is the one true principle we’ve been seeking. It’s the truth, the whole truth, and nothing but the truth. It’s the number that represents how strongly we should believe \(H\), given what we’ve seen so far. With \(P(H|D)\), the difficulties we encountered melt away.

Worked Example

We can compute \(P(H|D)\) with Bayes' Theorem:

\[ P(H|D) = P(H) P(D|H) / P(D) \]

Thus our previous work has not been in vain. Computing \(P(D|H)\) is useful; it’s just not our final answer.

What about \(P(D)\)? This is the probability that \(D\) occurs, but without assuming any hypothesis in particular. Or, more accurately, with default degrees of belief in each possible hypothesis; degrees of belief held prior to examining the evidence \(D\). Similarly, \(P(H)\) is how strongly we believe \(H\) to be true in the absence of the data \(D\).

Let us say we are willing to consider the following 11 hypotheses: the coin shows heads show heads with one of the probabilities 0, 0.1, 0.2, …​, 1. Furthermore we believe each possibility is equally likely.

First suppose our data \(D\) is 20 heads in 20 coin flips. As before, let \(H\) be the hypothesis that the coin is fair. We find:

\[P(D) = \frac{1}{11} \sum_p p^{20} [p \in \{0, 0.1, ..., 1\}]\]

which is:

> sum[p^20 | p <- [0,0.1..1]] / 11
0.1030855744171205

We have \(P(D|H) = 2^{-20}\), and \(P(H) = 1/11\), hence:

\[P(H|D) = (1/11) \times 2^{-20} / 0.103... = 8.41... \times 10^{-7}\]

In other words, our belief that the coin is fair has dropped from \(1/11\) to less than one in a million.

Now suppose our data \(D\) is the sequence \(S\). This time:

\[P(D) = \frac{1}{11} \sum_p p^{12} (1-p)^8 [p \in \{0, 0.1, ..., 1\}]\]

which is:

> sum[p^12*(1 - p)^8 | p <- [0,0.1..1]] / 11
3.4365821298193906e-7

Even though \(P(D|H)\) is again \(2^{-20}\), we find:

\[P(H|D) = (1/11) \times 2^{-20} / (3.43... \times 10^{-7}) = 0.252...\]

Thus our belief that the coin is fair has increased from \(1/11\) to over \(1/4\).

The Bayesian approach has outdone my textbook. We get meaningful results without throwing away any information. We used the entire sequence, not just the number of heads. No inequalities were needed.

Willful Negligence

What if we discard information anyway, and only use the fact that exactly 12 heads were flipped? In this case, we find:

\[P(D) = \frac{1}{11} \sum_p {20 \choose 12} p^{12} (1-p)^8 [p \in \{0, 0.1, ..., 1\}]\]

and \(P(D|H) = {20 \choose 12} 2^{-20}\). When computing \(P(H|D)\), the factor \({20 \choose 12}\) cancels out, and we arrive at the same answer. In other words, we’ve shown it’s fine to forget the particular sequence and only count the number of heads after all. What is not fine is doing so without justification.

It is also reassuring that using all available information gives an answer that is at least good as using only partial information (in this case, they agree). Contrast this to quasi-contraposition, which leads to nonsense if we focus on a particular sequence of flips.

What if we go further and introduce inequalities as before? The probability that we see at least 12 heads over all possible coins is:

\[P(D) = \sum_k \frac{1}{11} \sum_p {20 \choose k} p^k (1-p)^{20-k} [p \in \{0, 0.1, ..., 1\}, 12 \le k \le 20]\]

And for a fair coin:

\[ P(D|H) = \sum_k {20 \choose k} 2^{-20} [12 \le k \le 20] \]

We find:

> let pd = sum[(ch 20 k)*p^k*(1-p)^(20-k) | p <- [0,0.1..1], k <- [12..20]] / 11
> pd
0.43506493353472037
> let pdh = sum[(ch 20 k) / 2^20 | k <- [12..20]]
> pdh
0.2517223358154297

and hence \(P(D|H)/P(D) = 0.578…​\) which means \(P(H|D)\) is smaller than \(P(H)\). That is, the evidence weakens our belief that the coin is fair. Recall seeing exactly 12 heads strengthens our belief that the coin is fair, so by introducing an inequality, we discard so much information that our conclusion runs contrary to the truth.

The above is enough for me to shun my frequentist textbook and join the "Bayesian revolution":

  • It is natural to ask if given evidence strengthens or weakens a hypothesis, and by how much, rather than merely decide if a result is "significant". All else being equal, I’d choose the method that can handle this over the one that can’t.

  • We saw that discarding information can hurt our results. In our example, frequentism preserved enough data to lead to an acceptable conclusion, but do we trust it to work for other problems? How do we know it hasn’t thrown away too much data?

  • The frequentist approach fails to mirror the way I think. Frequentism is like doing taxes: a bunch of arbitrary laws and procedures which we follow to get some number that we hope is right.

  • The Bayesian approach matches my intuition, and feels like a generalization of logical reasoning.

  • The Bayesian approach forces us to be explicit about our assumptions, such as 11 equally likely hypotheses. With frequentism, somebody assumed something long ago, figured some stuff out, and handed us a distribution and a procedure. Who knows what the implicit assumptions are?

Further Reading

Probability Theory: The Logic of Science by E. T. Jaynes. Laplace wrote: "Probability theory is nothing but common sense reduced to calculation". Jaynes explains how and why, though a vital step in his argument, Cox’s Theorem, turns out to require more axioms.

Information Theory, Inference and Learning Algorithms by David Mackay presents compelling applications for Bayesian inference: information theory; data compression, Monte Carlo search; neural networks.

Ten Great Ideas about Chance by Persi Diaconis and Brian Skyrms. (Not free.) Historical, and less technical. If I didn’t already believe the frequentist emperor is wearing no clothes, this book would have convinced me. It makes me wish I could go back and Socratically question my old professors. After they’ve defined probability in terms of frequencies, I could ask "stupid" questions like: "What does it mean to run the trial again? If I roll the dice the same way, won’t I get the same result?" They also coin a better term for what I called quasi-contraposition: Bernoulli’s swindle.

Statistical Rituals: The Replication Delusion and How We Got There by Gerd Gigerenzer. We’re actually even worse off than Fisher. Statisticians like Neyman and Pearson opposed Fisher so fiercely that hapless authors of textbooks felt compelled to compromise on a procedure that made no sense to anyone! For example, none of these statisticians stipulated a 5% threshold and all of them stressed the importance of using personal judgement (which can be made rigorous by updating a Bayesian prior; just saying) rather than following a recipe mechanically. I also learned another cute name for Bernoulli’s swindle: Bayesian wishful thinking, as well as its more formal names like inverse probability error.


Ben Lynn blynn@cs.stanford.edu 💡