Information Theory

What is information? How much information content is in, say, your birthday? If you viewed it as a sort of game, how many points would you award someone guessing your birthday right away? With apologies to those born on February 29th, a reasonable answer might be 365 points.

How about guessing just the month? We might say this is worth 12 points, but this is slightly unfair since some months are longer than others. Thus guessing December correctly, for example, ought to be worth a touch less: 365/31 points.

Suppose we know the birthday is in December. How many points should be awarded for guessing the day? Surely 31 points. So if someone guesses December correctly and then guesses the day correctly, they get 365/31 points and then 31 points. The product is 365 points, which is exactly the reward for guessing the birthday in one shot.

What if their first guess is wrong? Then if they get it right on the next go, it seems they should only get 364 points, because they already knew one guess to avoid. Thus we could say the value of knowing one unbirthday is 365/364, so that we get 365 when multiply by 364.

For Your Information

If we continue this line of thought, our wish list for some kind of measure of information content might be:

The information content of a fact X determines the information content of the negation of X.
If there is more than one way to calculate the information content, then they must all lead to the same result.
The information content of the conjunction of A and B given background knowledge X is determined by the information content of A given X, and the information content of B given we know A and X.

…and so on, leading to Cox’s Theorem, from which we conclude:

information content = probability

However, information theorists take logarithms of probabilities before working with them. In his seminal paper A Mathematical Theory of Communication, Claude Shannon cites Hartley, Transmission of Information, and observes in the context of communication, logarithms better dovetailed with:

Practice: "time, bandwidth, number of relays, etc. tend to vary linearly with the logarithm".
Intuition: "two punched cards should have twice the capacity of one".
Mathematics: calculations often involve multiplying by the number of possibilities; following Napier, we use logarithms to simplify computations.

We make a couple more aesthetic choices. Firstly, the information content ought to go up when a fact is harder to guess. Thus we start with reciprocals of probabilities, or equivalently, discard the minus signs after taking logarithms of probabilities.

Secondly, as Shannon had computers in mind, he chose binary logarithms (base 2 logarithms), and called the unit of information a bit, a word coined by John Tukey.

The Shannon information content of an outcome \(x\) is:

\[ h(x) = \lg \frac{1}{P(x)} \]

where \(P(x)\) is the probability of \(x\) occurring.

Computing binary logarithms

One reason binary logarithms suit computers is because they can be calculated with an algorithm that suits binary arithmetic:

lgBin :: Double -> Double
lgBin x
  | x == 1 = 0
  | x < 1  = lgBin (x*2) - 1
  | x >= 2 = lgBin (x/2) + 1
  | otherwise = go 0 1 x
  where
  go k m z
    | k > 64 = 0
    | z >= 2 = m*(1 + go (k+1) 1 (z/2))
    | otherwise = go (k+1) (m/2) (z*z)

lgBin 16
lgBin 3
lgBin 0.3

The first phase finds the whole number part of the logarithm, and is the same as taking the mantissa or scanning for the first 1 bit. And in the second phase, apart from one squaring operation, the multiplications and divisions are bit shifts.

However, this article avoids bit twiddling and instead uses a method based on the Taylor series expansion of \(\coth^{-1}\).

acoth k = sum $ take 16 $ zipWith (*) denoms $ iterate ((r*r)*) r
  where
  r = 1/k
  denoms = recip <$> iterate (2+) 1.0

log2 = 18*acoth 26 - 2*acoth 4801 + 8*acoth 8749

lg x
  | x == 1 = 0
  | x < 1  = lg (x*2) - 1
  | x >= 2 = lg (x/2) + 1
  | otherwise = 2 * acoth ((x + 1)/(x - 1)) / log2

lg 16
lg 3
lg 0.3

Thanks to a logarithm law, we compute the Shannon information content with:

shannon p = -(lg p)

shannon (1/2)  -- Fair coin.
shannon (1/6)  -- Fair die.

See David Mackay, Information Theory, Inference and Learning Algorithms for more motivating examples, such as the guessing game Battleship.

Here, take a sample

Let \(X\) be a random variable and \(\mathcal{A}_X\) the set of possible values of \(X\). We call \(\mathcal{A}_X\) the alphabet of \(X\). (This term can be a tad confusing when \(\mathcal{A}_X\) is something like the set of all 5-letter words!)

-- Association list of symbols and their probabilities.
type RandomVariable a = [(a, Double)]

alphabet :: RandomVariable a -> [a]
alphabet = map fst

fairCoin = [('H', 1/2), ('T', 1/2)]
fairDie = zip [1..6] $ repeat (1/6)

alphabet fairCoin
alphabet fairDie

We write a function that samples from a distribution given a random number in [0..1). We do the obvious, and use the cumulative sums of the probabilities. But less obvious is why we bother sorting them first; this will turn out to be useful when looking at lossy compression.

sortOn f = \case
  [] -> []
  x:xt -> sortOn f (filter ((<= fx) . f) xt)
    ++ [x] ++ sortOn f (filter ((> fx) . f) xt) where fx = f x

cumu rv = scanl1 (\(_,p) (b,q) -> (b,p + q)) $ sortOn snd rv
sample rv r = fst $ head $ dropWhile ((< r) . snd) $ cumu rv

We wrap JavaScript’s Math.random() so there’s a few yaks to shave: my compiler can only talk to JavaScript via strings, so I wrote a one-off parser.

parseFrac = uncurry (/) . foldl (\(a, b) d
  -> (a*10 + fromIntegral (ord d - ord '0'), b*10)) (0.0, 1)
jsrandom = parseFrac . drop 2 <$> jsEval "Math.random()"

We can now flip coins and roll dice. Let the games begin!

print =<< replicateM 5 (sample fairCoin <$> jsrandom)
print =<< replicateM 5 (sample fairDie <$> jsrandom)

I know that I know nothing

The entropy of \(X\) is the average information content:

\[ H(X) = \sum P(x) h(x) [x \in \mathcal{A}_x] \]

mean :: (Double -> Double) -> RandomVariable a -> Double
mean f rv = sum [p * f p | (_, p) <- rv]
variance :: (Double -> Double) -> RandomVariable a -> Double
variance f rv = sum [p*(f p - mu)^2 | (_, p) <- rv] where mu = mean f rv
entropy :: RandomVariable a -> Double
entropy rv = mean shannon rv

entropy fairCoin
entropy fairDie

Entropy can be interpreted as a measure of "uncertainty" or "ignorance": the higher it is, the more uniform the probability distribution and the harder it is to predict an outcome. For example, most of the time, we can correctly predict the result of flipping a heavily biased coin, while we must be wrong about half the time when predicting a fair coin. The latter distribution has higher entropy than the former.

Shannon proves that \(H(X)\) is the only function that satisfies the following conditions:

The function \(H\) is continuous in each of the probabilities \(p_i\).
For uniform distributions, as the number of \(n\) possible outcomes increase, so does \(H\) because it is a measure of how much "choice" we have, that is, \(H(\frac{1}{n}, …, \frac{1}{n})\) is monotonically increasing in \(n\).
We can compute \(H\) recursively as a weighted sum. For example, imagine flipping a coin, and furthermore, if we see tails then we also try to roll 5 or 6 on a 6-sided die:

\[ H\left(\frac{1}{2}, \frac{1}{3}, \frac{1}{6}\right) = H\left(\frac{1}{2}, \frac{1}{2}\right) + \frac{1}{2} H\left(\frac{2}{3}, \frac{1}{3}\right) \]

entropy [("heads", 1/2), ("tails1234", 1/3), ("tails56", 1/6)]
entropy [("heads", 1/2), ("tails", 1/2)]
entropy [("1234", 2/3), ("56", 1/3)] / 2

See also Chapter 11 of Jaynes, and John Carlos Baez, What is Entropy?

The principle of maximum entropy guides us when groping blindly in the dark. If there are \(n\) possible outcomes and we are completely ignorant, then we have no reason to prefer any outcome over any other. Accordingly, we assign each outcome the same probability \(1/n\), corresponding to maximum entropy. If, after some exploration, we become less ignorant and are aware of constraints on the probability distribution, then we should assign probabilities that simultaneously satisfy the constraints and maximize the entropy.

For example, suppose we have a normal-looking die with faces labeled 1 to 6. Initially we maximize entropy by guessing each face shows up with probability 1/6. Suppose we somehow learn the mean value of rolling our die is 4.5. This is greater than 3.5, the mean value of rolling a fair die, thus our die is unfair. Then how should we revise the probabilities?

We could, say, assign 0.7 probability to 6 and 0.3 to 1, and claim no other face ever shows up, but is this sensible? While an expected value of 4.5 implies higher numbers are more likely than lower numbers, it’s strange to rule out the faces from 2 to 5. Intuitively, we ought to instead modify the uniform distribution as gently as possible, nudging up the probabilities of larger faces showing and nudging down the smaller ones. Zealously zeroing some probabilities is too extreme.

It turns out we maximize entropy with a Boltzmann distribution:

boltzmann :: RandomVariable Int
boltzmann = zip [1..] $ (/sum bs) <$> bs where
  bs = take 6 $ iterate (b*) b
  b = (!!32) $ newton (eval f) (eval $ derive f) 1
  f = (+(-4.5)) <$> [1..6]
  newton f f' = iterate $ \x -> x - f x / f' x
  eval f x = sum $ zipWith (*) f $ iterate (x*) 1
  derive (_:t) = zipWith (*) t [1..]

boltzmann

If our code is right, then the sum of 100 rolls should usually be close to 450, instead of 350, which would result from a fair die:

putStrLn "Boltzmann:"
replicateM 5 $ print =<< sum <$> replicateM 100 (sample boltzmann <$> jsrandom)
putStrLn "fair:"
replicateM 5 $ print =<< sum <$> replicateM 100 (sample fairDie <$> jsrandom)

We compute the entropies of:

the 1-or-6 die with expected value 4.5
the Boltzmann distribution die with expected value 4.5
a fair die

entropy [(1, 0.3), (6, 0.7)]
entropy boltzmann
entropy fairDie

Uncompressed data

Information theory is crucial for understanding data compression: lossy compression; symbol codes; stream codes. The entropy of a distribution pops up again and again in theoretical limits. But for now, we just say a few words about uncompressed data to establish a baseline.

The raw bit content of \(X\) is:

\[ H_0(X) = \lg |\mathcal{A}_X| \]

size = fromIntegral . length
rawBitContent rv = lg $ size rv

This measures the space needed to record one symbol of \(X\).

Exercise: Show \( 0 \le H(X) \le H_0(X) \). When do we have equality on either side?

The answers follow directly from the mathematics, but we can interpret them qualitatively. Consider a lopsided distribution where one symbol almost always appears while the others are rare. Then the average symbol will be the common one, which gives us almost no information. In the extreme case, the other symbols are impossible so we learn nothing from the only symbol that exists. In contrast, if the distribution is uniform, we have no idea what symbol to expect next (assuming at least two symbols).

rawBitContent [("H", undefined), ("T", undefined)]

entropy [("H", 0.5), ("T", 0.5)]
entropy [("H", 0.9), ("T", 0.1)]
entropy [("H", 0.999), ("T", 0.001)]
entropy [("H", 1)]  -- Our code doesn't handle probability 0 symbols.

In English, please

Let’s explore these ideas with the relative letter frequencies in English:

englishSrc = unlines
  [ "a 8167"
  , "b 1492"
  , "c 2782"
  , "d 4253"
  , "e 12702"
  , "f 2228"
  , "g 2015"
  , "h 6094"
  , "i 6966"
  , "j 0153"
  , "k 0772"
  , "l 4025"
  , "m 2406"
  , "n 6749"
  , "o 7507"
  , "p 1929"
  , "q 0095"
  , "r 5987"
  , "s 6327"
  , "t 9056"
  , "u 2758"
  , "v 0978"
  , "w 2360"
  , "x 0150"
  , "y 1974"
  , "z 0077"
  ]

We turn this into a probability distribution with a primitive parser:

parse :: String -> RandomVariable Char
parse s = second divTotal <$> tab where
  tab = [(c, fromIntegral $ readInteger s) | [[c], s] <- words <$> lines s]
  divTotal = (/ sum (snd <$> tab))

english = parse englishSrc
english

Without spaces, English text takes about 4.7 bits per character, though its information content is less than 4.18 bits per character.

rawBitContent english
entropy english

We emphasize that these numbers are derived solely from single-letter frequencies. In reality, English text has far less information content, as many dependencies exist between the constituent letters.

This is clear if we generate some 5-letter "words":

putStr . unlines =<< replicateM 4 (replicateM 5 $ sample english <$> jsrandom)

Even though the frequencies of individual letters are about right, these words almost certainly never occur in English; the probability of sampling such words should really be zero.

Evidence suggests humans send information at 39 bits per second when speaking. If we assume humans read English at 238 words per minute and the average English word length is 4.8 letters, then we roughly read 20 letters per second:

238 * 4.8 / 60

If the rate we read information is about the same as we speak it, then English transmits about 2 bits of information per character:

39 / (238 * 4.8 / 60)

When Shannon studied the entropy of English, he arrived at 2.62 bits per character.

Ben Lynn blynn@cs.stanford.edu 💡