The Ghost of Statistics Past

Flip a coin 20 times.

Suppose we see heads each time. This would cause me to strongly believe the coin is unfair.
Suppose we get THHTHHHTTTHHTHTHHTHH. Let us call this particular sequence \(S\). Seeing \(S\) would not cause me to strongly believe the coin is unfair. We count 12 heads in \(S\), so the coin may favour heads, but there’s too little evidence to say. Indeed, I’d believe there’s still a decent chance that the coin may even favour tails slightly.

How can we mathematically formalize my beliefs? According to my undergrad textbook (M.C. Phipps and M.P. Quine, A Primer of Statistics, second edition):

For a fair coin, the probability of getting 20 heads in 20 flips is \(2^{-20}\), which is less than 1 in a million. This is small, so the coin is likely unfair.
For a fair coin, the probability of seeing at least 12 heads is approximately 0.25. As 0.25 is not small, we lack significant evidence that the coin is unfair.

The authors later state in a carefully indented paragraph:

…the smaller the probability of a result as unusual as (or even more unusual than) the observed one, the stronger our feeling that the coin is a trick coin

(emphasis theirs). Or in more general terms, for a hypothesis \(H\) and data \(D\), then if \(P(D|H)\) is small then \(H\) is likely false.

At the time, I accepted this principle without question. My university professors surely knew what they were doing, right?

A Likely Story

So that we can discuss it, let’s name the above principle. We call it quasi-contraposition, because one specious argument for it proceeds as follows. Suppose \(A\) implies \(B\). By the law of contraposition, if \(B\) is false, then it follows that \(A\) is false:

\[ A \rightarrow B \implies \neg B \rightarrow \neg A \]

Replace \(A\) with \(H\) and \(B\) with “\(\neg D\) is likely”:

\[ H \rightarrow \neg D \mbox{ is likely } \implies D \mbox{ is likely } \rightarrow \neg H \]

We could ostensibly move the "is likely" to the right-hand side:

\[ H \rightarrow \neg D \mbox{ is likely } \implies D \rightarrow \neg H \mbox { is likely } \]

We have the law of quasi-contraposition. The last manoeuvre seems shady, but perhaps we could fool someone with enough bluster: "You see, since \(D\) happened, it suggests \(D\) was likely to happen, which as we know implies \(\neg H\). Of course, one does not simply conclude \(\neg H\), because there was a small chance that \(D\) is unlikely but it happened anyway. So we conclude \(\neg H\) is likely."

We work through the procedure given in my textbook for the sequence \(S\). We’re instructed to use the binomial distribution for this problem, so we count 12 heads in \(S\), and compute \(P(X\ge 12)\), that is, the probability of seeing at least 12 heads in 20 flips of a fair coin:

\[ \sum_k {20 \choose k} 2^{-20} [12 \le k \le 20] \]

We can compute this with a little Haskell:

ch n 0 = 1.0
ch n k = n*(ch (n - 1) (k - 1)) / k

sum[ch 20 k | k <- [12..20]] / 2^20

The probability is indeed a bit larger than 0.25, and since this is greater than 0.05, we deem our finding insignificant, that is, we conclude we lack strong evidence to dispute the hypothesis that the coin is fair.

What if we replace 12 with 20?

sum[ch 20 k | k <- [20..20]] / 2^20

We get a probability far smaller than 0.05, so for a coin that shows heads each time, we conclude there is strong evidence that our fair-coin hypothesis is false.

Lies of Omission

Although the end result matches our intuition, there are peculiarities in the procedure:

The only thing we remember about the sequence \(S\) is the number 12. Why should all sequences containing exactly 12 heads be treated the same?
Why do we compute \(P(X \ge 12)\)? Where did this inequality come from? We know there are exactly 12 heads and no more!

In other words, we deliberately throw away information. Twice.

Why do we wilfully neglect some of our data? If a ghost could whisk me back to my undergrad days, perhaps I’d witness my younger self say: "The probability of seeing any particular sequence, such as \(S\), is always \(2^{-20}\), so focusing on a particular sequence obviously fails. Since each coin flip is independent of the others, it makes sense just to count the number of heads instead."

"As for the inequality: the probability of seeing exactly \(k\) heads is:

\[ P(X = k) = {20 \choose k} 2^{-20} \]

which is always too small to work with. If we replace it with \(P(X \ge k)\) for large \(k\) and \(P(X \le k)\) for small \(k\) then we get a probability that is tiny for extreme values of \(k\), but huge for reasonable values of \(k\). In other words, we get a number that can distinguish between likely and unlikely \(k\)."

In short, my past self would say we do what we do because it works. We play around until we find a quantity that almost disappears when we want it to. It’s practical (we need only compare against 0.05) and convincing (because it involves fancy mathematics).

Isn’t this intellectually unsatisfying? On the one hand, it certainly sounds better to say "following standard procedure, the P-value is less than 0.05; therefore we have significant evidence the hypothesis is false" instead of "\(k\) seems kind of extreme so the hypothesis is probably false". On the other hand, if we’re going to all this trouble to quantify how strongly we believe a hypothesis is true, why not do a proper job and justify each step, rather than settle on some ad hoc procedure?

Perhaps the procedure only appears ad hoc because the derivation is omitted to avoid scaring students fresh out of high school. Let’s suppose this is the case and try derive probability theory from first principles, one of which the authors insist is quasi-contraposition.

We have a coin. Our hypothesis is that it is fair. The probability of seeing any particular sequence of 20 flips such as \(S\) is \(2^{-20}\), which is tiny. By quasi-contraposition, seeing such an "unusual" outcome means our hypothesis is likely wrong. So no matter what, we should always believe the coin is unfair!

By the same token, consider rolling a \(2^{20}\)-sided die that we believe to be fair. After a single roll, we see a number that has a \(2^{-20}\) chance of showing up. Wow, this is much less than 0.05! The die must be loaded!

The inescapable conclusion: quasi-contraposition is wrong.

Master Probability With This One Weird Trick

If quasi-contraposition is wrong, then what is right?

Whatever it is, it must capture our intuition. If we flip a coin 20 times and see 20 heads, we suspect the coin is unfair. If we see the sequence \(S\), we are much less suspicious. Either event occurs with probability \(2^{-20}\) so there must be other information that affects our beliefs. What could it be?

The answer is that we are aware that trick coins exist, and willing to entertain the possibility that the coin in question is such a coin. For a fair coin, the probability of seeing 20 heads in a row is \(2^{-20}\), but for certain trick coins the probability is much higher. Indeed, an extremely unfair coin might show heads every time. We think: "Is this a fair coin that just happened to come up heads every time, or is this a trick coin that heavily favours heads? Surely the latter is likelier!"

How about the sequence \(S\)? For a fair coin, the probability of seeing the sequence \(S\) is also \(2^{-20}\). But this time, we feel:

Unlike the previous case, the probability of seeing \(S\) ought to be miniscule for any coin, fair or not. (Exercise: Show the probability of seeing \(S\) maxes out for a coin that shows heads with probability \(12/20\), but only at a value less than double \(2^{-20}\).)
The coin is unlikely to be heavily biased one way or the other.
The coin is most likely biased \(12/20\) in favour of heads, but we’d need to flip a lot more times to tell.

We can mathematically formalize these thoughts with one simple trick. Rather than \(P(D|H)\), we flip it around and ask for \(P(H|D)\). In other words, given the data, we find a number that represents how strongly we believe the hypothesis is true.

The probability \(P(H|D)\) is the one true principle we’ve been seeking. It’s the truth, the whole truth, and nothing but the truth. It’s the number that represents how strongly we should believe \(H\), given what we’ve seen so far. With \(P(H|D)\), the difficulties we encountered melt away.

Worked Example

We can compute \(P(H|D)\) with Bayes' Theorem:

\[ P(H|D) = P(H) P(D|H) / P(D) \]

Thus our previous work has not been in vain. Computing \(P(D|H)\) is useful; it’s just not our final answer.

What about \(P(D)\)? This is the probability that \(D\) occurs, but without assuming any hypothesis in particular. Or, more accurately, with default degrees of belief in each possible hypothesis; degrees of belief held prior to examining the evidence \(D\). Similarly, \(P(H)\) is how strongly we believe \(H\) to be true in the absence of the data \(D\).

Let us say we are willing to consider the following 11 hypotheses: the coin shows heads show heads with one of the probabilities 0, 0.1, 0.2, …, 1. Furthermore we believe each possibility is equally likely.

First suppose our data \(D\) is 20 heads in 20 coin flips. As before, let \(H\) be the hypothesis that the coin is fair. We find:

\[P(D) = \frac{1}{11} \sum_p p^{20} [p \in \{0, 0.1, ..., 1\}]\]

which is:

sum[p^20 | p <- [0,0.1..1]] / 11

We have \(P(D|H) = 2^{-20}\), and \(P(H) = 1/11\), hence:

\[P(H|D) = (1/11) \times 2^{-20} / 0.103... = 8.41... \times 10^{-7}\]

In other words, our belief that the coin is fair has dropped from \(1/11\) to less than one in a million.

Now suppose our data \(D\) is the sequence \(S\). This time:

\[P(D) = \frac{1}{11} \sum_p p^{12} (1-p)^8 [p \in \{0, 0.1, ..., 1\}]\]

which is:

sum[p^12*(1 - p)^8 | p <- [0,0.1..1]] / 11

Even though \(P(D|H)\) is again \(2^{-20}\), we find:

\[P(H|D) = (1/11) \times 2^{-20} / (3.43... \times 10^{-7}) = 0.252...\]

Thus our belief that the coin is fair has increased from \(1/11\) to over \(1/4\).

The Bayesian approach has outdone my textbook. We get meaningful results without throwing away any information. We used the entire sequence, not just the number of heads. No inequalities were needed.

Willful Negligence

What if we discard information anyway, and only use the fact that exactly 12 heads were flipped? In this case, we find:

\[P(D) = \frac{1}{11} \sum_p {20 \choose 12} p^{12} (1-p)^8 [p \in \{0, 0.1, ..., 1\}]\]

and \(P(D|H) = {20 \choose 12} 2^{-20}\). When computing \(P(H|D)\), the factor \({20 \choose 12}\) cancels out, and we arrive at the same answer. In other words, we’ve shown it’s fine to forget the particular sequence and only count the number of heads after all. What is not fine is doing so without justification.

It is also reassuring that using all available information gives an answer that is at least good as using only partial information (in this case, they agree). Contrast this to quasi-contraposition, which leads to nonsense if we focus on a particular sequence of flips.

What if we go further and introduce inequalities as before? The probability that we see at least 12 heads over all possible coins is:

\[P(D) = \sum_k \frac{1}{11} \sum_p {20 \choose k} p^k (1-p)^{20-k} [p \in \{0, 0.1, ..., 1\}, 12 \le k \le 20]\]

And for a fair coin:

\[ P(D|H) = \sum_k {20 \choose k} 2^{-20} [12 \le k \le 20] \]

We find:

pd = sum[(ch 20 $ fromIntegral k)*p^k*(1-p)^(20-k)
  | p <- [0,0.1..1], k <- [12..20]] / 11
pdh = sum[(ch 20 $ fromIntegral k) / 2^20 | k <- [12..20]]
putStrLn $ "P(D) = " ++ show pd
putStrLn $ "P(D|H) = " ++ show pdh
putStrLn $ "P(D|H)/P(D) = " ++ show (pdh/pd)

and hence \(P(H|D) < P(H)\). That is, the evidence weakens our belief that the coin is fair. Recall seeing exactly 12 heads strengthens our belief that the coin is fair, so by introducing an inequality, we discard so much information that our conclusion runs contrary to the truth.

The above is enough for me to shun my frequentist textbook and join the "Bayesian revolution":

It is natural to ask how new evidence strengthens or weakens my hypotheses, and by how much, rather than merely decide if a result is "significant". All else being equal, I’d choose the method that can handle this over the one that can’t.
We saw that discarding information can hurt our results. In our example, frequentism preserved enough data to lead to an acceptable conclusion, but do we trust it to work for other problems? How do we know it hasn’t thrown away too much data?
The frequentist approach fails to mirror the way I think. Frequentism is like doing taxes: a bunch of arbitrary laws and procedures which we follow to get some number that we hope is right; we can also reinterpret the rules to nudge this number in a desired direction!
Bayesian reasoning matches my intuition, and feels like a generalization of logical reasoning.
Bayesian reasoning forces us to be explicit about our assumptions, such as 11 equally likely hypotheses. With frequentism, somebody assumed something long ago, figured some stuff out, and handed us a distribution and a procedure. Why don’t they reveal their assumptions? (Donald Rumsfeld might have asked: what are the unknown knowns?)

A Ghost of Statistics Present

In 2015, Fred Ross kindly informed that this page got posted to Hacker News. I’m also grateful he took the time to present an alternative view:

The underlying theory that justifies most inference (Bayesian, minimax, etc.) is decision theory, which is a subset of the theory of games. Savage’s book on the foundations of statistics has a very nice discussion of why this should be. I learned it from Kiefer’s book, which is the only book I know of that starts there. Lehmann or Casella both get to it later in their books.

The justification for p-value is actually the Neyman-Pearson theory of hypothesis testing. The p-value is the critical value of alpha in that framework. I wrote a couple of expository articles for clinicians going through this if you’re interested.

Jaynes was a wonderful thinker, but be aware that a lot of the rational actor theory breaks down when you don’t have a single utility function. That is true of using classes of prior (see the material towards the end of Berger), or in sequential decision problems (look at prospect theory in psychology, where the overall strategy may have a single utility function, but local decisions along the way can’t be described with one). So the claims in the middle of the 20th century for naturalness of Bayesian reasoning haven’t held up well.

I consider these statements outmoded, and predict that one day, they will be widely seen to be false.

Of course, I do agree with part of one sentence: "Jaynes was a wonderful thinker". This is indisputable, as Jaynes realized the principle of maximum entropy goes far beyond physics and is in fact a general principle of reasoning. But not only did he take something from physics to improve reasoning; he took something from reasoning to improve physics! Since a probability is a degree of belief, it follows that entropy is subjective; it depends on what the observer knows.

Cox’s axiom is the underlying justification for Bayesian reasoning. In particular, if there are multiple ways to solve a problem, naturally we desire confluence: all roads ought to lead to the same solution. Non-Bayesians view this as unnatural!

As for decision theory, see Chapter 36 of Mackay for a one-equation summary. Decision theory builds on Bayesian reasoning; to justify the latter with the former is to put the cart before the horse.

Appealing to psychology is dubious. There exist many humans, known as sampling theorists or frequentists, who somehow reason without a sound mathematical foundation. Why then should a utility function be universally appropriate?

Less facetiously, Gigerenzer suggests that psychologists share the blame for frequentist nonsense:

the great majority fused the two antagonistic theories into a hybrid theory, of which neither Fisher nor Neyman and Pearson would have approved.

Props to Ross for mentioning Neyman and Pearson, as psychologists seem embarrassed about their influence:

the 1965 edition of Guilford’s best-selling Fundamental Statistics in Psychology and Education cites some 100 authors in its index, the names of Neyman and Pearson are left out.

Chapter 37 of Mackay demolishes frequentism/sampling theory with flair and proposes an ingenious compromise. Mackay first observes that "from a selection of statistical methods," sampling theorists pick "whichever has the 'best' long-run properties". Thus to sneak Bayesian reasoning past them, simply state you’re choosing the method with the 'best' long-run properties, while taking care to avoid the word "Bayesian". I propose we use the phrase "Mackay’s correction"; for example, "the chi-squared significance test with Mackay’s correction" might mollify reviewers suffering from frequentism.

Mackay’s favourite reading on this topic includes: Jaynes, 1983; Gull, 1988; Loredo, 1990; Berger, 1985; Jaynes, 2003. Mackay also mentions treatises on Bayesian statistics from the statistics community: Box and Tiao, 1973; O’Hagan, 1994.

The phrase "in the middle of the 20th century" calls to mind World War II, when Allied codebreakers applied Bayesian reasoning to break Germany’s Enigma cipher. Their methods "haven’t held up well"? Which side won?

One might cut Ross some slack by claiming ignorance: Turing’s papers were only declassified in 2012. But GCHQ allowed declassification only because by 2012, everyone (except frequentists!) was well-aware that Bayesian reasoning is a formidable tool for codebreakers.

If we want to talk about 20th-century claims that haven’t held up well, how about Fisher’s eugenicist views on race and "miscegenation"? Or Fisher’s insistnet denials that smoking causes lung cancer. They seem to have fallen out of fashion, and at last, FIsher’s sampling theory is also en route to oblivion.

Ben Lynn blynn@cs.stanford.edu 💡