Information Theory

What is information?

One example might be knowing your birthday. If you viewed it as a sort of game, how many points would you award someone guessing your birthday right away? With apologies to those born on February 29th, a reasonable answer might be 365 points.

How about guessing just the month? We might say this is worth 12 points, but this is slightly unfair since some months are longer than others. Thus guessing December correctly, for example, ought to be worth a touch less: 365/31 points.

Suppose we know the birthday is in December. How many points should be awarded for guessing the day? Surely 31 points. So if someone guesses December correctly and then guesses the day correctly, they get 365/31 points and then 31 points. The product is 365 points, which is exactly the reward for guessing the birthday in one shot.

What if their first guess is wrong? Then if they get it right on the next go, it seems they should only get 364 points, because they already knew one guess to avoid. Thus we could say the value of knowing one unbirthday is 365/364, so that we get 365 when multiply by 364.

If we continue this line of thought, our wish list for some kind of measure of information content might be:

  • The information content of a fact X determines the information content of the negation of X.

  • If there is more than one way to calculate the information content, then they must all lead to the same result.

  • The information content of the conjunction of A and B given background knowledge X is determined by the information content of A given X, and the information content of B given we know A and X.

…​and so on, leading to Cox’s Theorem, from which we conclude:

information content = probability

However, information theorists take logarithms of probabilities before working with them. In his seminal paper A Mathematical Theory of Communication, Claude Shannon cites Hartley (Transmission of Information) and observes in the context of communication, logarithms better dovetailed with:

  1. Practice: "time, bandwidth, number of relays, etc. tend to vary linearly with the logarithm".

  2. Intuition: "two punched cards should have twice the capacity of one".

  3. Mathematics: calculations often involve multiplying by the number of possibilities; following Napier, we use logarithms to simplify computations.

We make a couple more aesthetic choices. Firstly, the information content ought to go up when a fact is harder to guess. Thus we start with reciprocals of probabilities, or equivalently, discard the minus signs after taking logarithms of probabilities.

Secondly, as Shannon had computers in mind, he chose base 2 logarithms, and called the unit of information a bit, a word coined by John Tukey.

The Shannon information content of an outcome \(x\) is:

\[ h(x) = \lg \frac{1}{P(x)} \]

{-# LANGUAGE CPP #-}
#ifdef __HASTE__
import Haste.DOM
import Haste.Events
import Control.Monad (void)
#endif
import Control.Monad (replicateM)
import Data.Maybe (catMaybes)
import Data.List (sortOn, genericLength)
import System.Random

lg = logBase 2
shannon p = -(lg p)

where \(P(x)\) is the probability of \(x\) occurring.

Let \(X\) be a random variable and \(\mathcal{A}_X\) the set of possible values of \(X\). We call \(\mathcal{A}_X\) the alphabet of \(X\), though in general this term may be a bit misleading as \(\mathcal{A}_X\) could be, for example, the set of all 5-letter strings.

The entropy of \(X\) is the average information content:

\[ H(X) = \sum P(x) h(x) [x \in \mathcal{A}_x] \]

-- Association list of symbols and their probabilities.
type RandomVariable a = [(a, Double)]
mean :: (Double -> Double) -> RandomVariable a -> Double
mean f rv = sum [p * f p | (_, p) <- rv]
variance :: (Double -> Double) -> RandomVariable a -> Double
variance f rv = sum [p*(f p - mu)^2 | (_, p) <- rv] where mu = mean f rv
entropy :: RandomVariable a -> Double
entropy rv = mean shannon rv

See Chapter 11 of Jaynes, and also John Carlos Baez, What is Entropy?. Entropy can be interpreted as a measure of "uncertainty" or "ignorance": the higher it is, the more uniform the probability distribution and the harder it is to predict an outcome. For example, we can correctly predict the result of flipping a heavily biased coin most of the time, while we’re always wrong about half the time when when predicting a fair coin flip. The former has less entropy than the latter.

Entropy teaches us how best to grope blindly in the dark. If there are \(n\) possible outcomes, and we are completely ignorant, then we have no reason to prefer any one over any other, and we should assign each outcome the same probability \(1/n\). This corresponds to maximum entropy. More generally, if we are less ignorant and know something about their probability distribution, then we should assign probabilities so the entropy is maximized subject to the constraints we are aware of.

For example, suppose we have a normal-looking die with faces labeled 1 to 6, but the mean value of rolling it is 4.5, which is higher than 3.5, the mean value of a fair die. This is consistent with many probability distributions, such as a die that shows 6 with probability 0.7 and 1 with probability 0.3.

Is this choice sensible? An expected value of 4.5 means we do want higher numbers to be more likely than lower numbers, but it’s strange to completely rule out the faces from 2 to 5 without any additional evidence. Intuitively, we ought to modify the uniform distribution as gently as possible, nudging up the probabilities of larger faces showing and nudging down the smaller one. Zealously zeroing most of them seems extreme.

It turns out maximizing entropy does what we want, yielding roughly the probabilities 0.05435, 0.07877, 0.11416, 0.16545, 0.23977, 0.34749 for 1 to 6 respectively.

The raw bit content of X is

\[ H_0(X) = \lg |\mathcal{A}_X| \]

size = genericLength
rawBitContent rv = lg $ size rv

This measures the room we need to record one symbol of \(X\).

Exercise: Show \( 0 \le H(X) \le H_0(X) \). When do we have equality on either side?

The answers follow directly from the mathematics, but we can interpret them qualitatively. Consider a lopsided distribution where one symbol almost always appears while the others are rare. Then the average symbol will be the common one, which gives us almost no information. In the extreme case, the other symbols are impossible so we learn nothing from the only symbol that exists. In contrast, if the distribution is uniform, we have no idea what symbol to expect next (assuming at least two symbols).

Lossy Compression

Ever filled out a form asking you to write one letter per box? This is perhaps the simplest lossy compression scheme. It works when long names are rare. In a fixed amount of space, the form accurately records everything except for some improbable cases.

What if we reject any improbable message, long or short? Then a small amount of space may hold a long message, provided it is not too improbable. For example, suppose a one-million-bit message is unlikely to contain more than a few 1s; perhaps it describes the inhabitants of a city of population one million who have won a lottery jackpot. Then by storing the positions of any 1s, perhaps five 20-bit numbers may suffice for describing almost any one-million-bit message.

We generalize this idea. The smallest \(\delta\)-sufficient subset \(S_\delta\) is the smallest subset of \(\mathcal{A}_X\) such that:

\[ P(x \in S_\delta) \ge 1 - \delta \]

We can compute this by sorting the symbols in order of increasing probability, then discarding symbols until their cumulative probability exceeds \(\delta\). In our lottery example, if we consider each possible one-million-bit string to be a symbol, then the only strings left standing are those containing a few 1s at most.

cumu rv = scanl1 (\(_,p) (b,q) -> (b,p + q)) $ sortOn snd rv
sufficient delta rv = map fst $ dropWhile ((<= delta) . snd) $ cumu rv

The essential bit content of X is:

\[ H_\delta(X) = \lg |S_\delta| \]

essential delta rv = lg $ size $ sufficient delta rv

That is, if we are willing to risk failure with probability \(\delta\), then we only need \(H_\delta(X)\) bits.

Typicality

Let \(H = H(X)\) be the entropy of \(X\). Let \(X^N\) be the random variable whose outcome is \(N\) independent outcomes of \(X\). How much can we compress the messages of \(X^N\) with at most a probability \(\delta\) of failure?

By definition, the best answer is \(H_\delta(X^N)\), but reasoning about this quantity is difficult, so we sneak up on it by showing a message is most likely a member of a certain typical set, and studying the asymptotic limits of the size of this typical set.

Define the typical set \(T_{N\beta}\) by:

\[ T_{N \beta} = \left\{ x \in \mathcal{A}_{X^N} : \left| \frac{h(x)}{N} - H \right| < \beta \right\} \]

typical rv n beta = [s | s <- replicateM n $ map fst rv,
  abs (info rv s/fromIntegral n - h) < beta]
  where h = entropy rv

We compute \(h(x)\) for an outcome \(x\) of \(X^N\) by summing the Shannon information contents of each of the \(N\) independent outcomes of \(X\):

pr rv sym = maybe (error "bad symbol") id $ lookup sym rv
info rv s = sum $ shannon . pr rv <$> s

We may think of the typical set is the set of all strings whose information content is close to the expected information content \(N H\).

It turns out the typical set is a Goldilocks set. There may be messages which are more likely than those in the typical set, but they are so few in number that they are seldom seen. There may be many messages less likely than those in the typical set, but they are so unlikely that they also seldom seen.

We now see why it pays to design definitions to fit with standard probability theory. For any \(x \in T_{N\beta}\), by the weak law of large numbers:

\[ P(x \in T_{N\beta}) \ge 1 - \frac{\sigma^2 }{\beta^2 N} \]

where \(\sigma^2\) is the variance of the Shannon information content of \(X\). As claimed, a string most likely lies in the typical set.

We undo logarithms to express the typical set’s defining inequality in terms of probabilities:

\[ 2^{-N(H+\beta)} \le P(x) \le 2^{-N(H-\beta)} \]

The sum of all probabiltiies must be 1, so the left inequality implies:

\[ |T_{N\beta}| 2^{-N(H+\beta)} < 1 \]

so:

\[ |T_{N\beta}| < 2^{N(H+\beta)} \]

Asymptotic equipartition principle: For sufficiently large \(N\) the outcome \(x\) of \(N\) i.i.d. random variables \(X^N = {X_1, …​, X_N}\) almost certainly belongs to a certain set of size \(2^{N H}\), each of whose members occur with a probability "close" to \(2^{-N H}\).

More formally, given \(\epsilon > 0\) and a probability \(\delta > 0\), choose \(\beta = \epsilon\) and \(N_0\) so that \(P(T_{N\beta}) \ge 1 - \delta\) when \(N \ge N_0\), that is \(\frac{\sigma^2}{\epsilon^2 N_0} \le \delta\). Then:

\[ \lg |T_{N\beta}| < N(H + \epsilon) \]

We place scare quotes around "close" because the probabilities of two elements in the typical set can differ by a factor that is exponential in our parameters.

Shannon’s source coding theorem

The bound for \(|T_{N\beta}|\) immediately yields an upper bound for the size of the best possible set for lossy compression:

\[ H_\delta(X^N) \le \lg |T_{N\beta}| < N(H + \epsilon) \]

How about the other side? Can \(S_\delta\), or any other set, be significantly smaller than the typical set but still compress just as well?

Let \(S\) be a set such that \(|S| \le 2^{N(H - 2\beta)}\). Then

\[\begin{align} P(x\in S) &=& P(x\in S \cap T_{N\beta}) + P(x\in S\cap \overline{T_{N\beta}}) \\ &\le& |S| \max \{ P(x) : x \in T_{N\beta} \} + P(x \notin T_{N\beta}) \\ &\le& 2^{N(H-2\beta)} 2^{-N(H-\beta)} + \frac{\sigma^2}{\beta^2 N} \\ &=& 2^{-N\beta} + \frac{\sigma^2}{\beta^2 N} \end{align}\]

Hence for any given \(\delta\), we can choose \(N_0\) and \(\epsilon\) so that \(P(x \in S) \le 1 - \delta\), hence \(S\) fails to compress as well as \(S_\delta\), or the typical set.

Summing up, we have Shannon’s source coding theorem: Given \(\epsilon > 0\) and \(0 < \delta < 1\) there exists a positive integer \(N_0\) such that for all \(N > N_0\):

\[ \left| \frac{H_\delta (X^N)}{N} - H \right| < \epsilon \]

In other words, with a negligible risk of information loss, we can compress a string of \(N\) symbols to about \(N H\) bits, and no better.

Examples

See David Mackay, Information Theory, Inference and Learning Algorithms, which explains all the above in detail.

Our interactive demo above gives concrete numbers.

We expect symbols (single characters) and their frequencies, which we read with a primitive parser:

parse :: String -> RandomVariable Char
parse s = (\(c, f) -> (c, fromIntegral f / fromIntegral total)) <$> tab
  where
  tab = catMaybes $ charFreq . filter (/= ' ') <$> lines s
  total = sum $ snd <$> tab
  charFreq "" = Nothing
  charFreq (c:n) = Just (c, read n)
english = unlines
  [ "a 8167"
  , "b 1492"
  , "c 2782"
  , "d 4253"
  , "e 12702"
  , "f 2228"
  , "g 2015"
  , "h 6094"
  , "i 6966"
  , "j 0153"
  , "k 0772"
  , "l 4025"
  , "m 2406"
  , "n 6749"
  , "o 7507"
  , "p 1929"
  , "q 0095"
  , "r 5987"
  , "s 6327"
  , "t 9056"
  , "u 2758"
  , "v 0978"
  , "w 2360"
  , "x 0150"
  , "y 1974"
  , "z 0077"
  ]

Another example is rolling two six-sided dice:

twoDice = unlines
  [ "2 1"
  , "3 2"
  , "4 3"
  , "5 4"
  , "6 5"
  , "7 6"
  , "8 5"
  , "9 4"
  , "a 3"
  , "b 2"
  , "c 1"
  ]

We show the entropy and related statistics of the corresponding random variable. Next, we randomly generate a message \(x\) of length 100 according to the table, and compute how far its Shannon information content per symbol deviates from the entropy, that is, the criterion used to define the typical set. We call this quantity \(t\).

We also compute a bound that \(t\) exceeds with at most 20% probability in a message of length 100. Experimentally, the bound seems loose, which is perhaps unsurprising as the bound derives from the Weak Law of Large Numbers. Our main goal is to establish a theoretical limit for lossy compression, rather than produce a practical algorithm. Indeed, it’s unclear to me how to efficiently encode or decode a member of a typical set \(T\) so it takes about \(\lg |T|\) bits.

roll rv n g
  | n == 0 = []
  | otherwise = x:roll rv (n - 1) g'
  where
  (r, g') = random g
  x = fst $ head $ dropWhile ((< r) . snd) $ cumu rv

go g s = let
  h = entropy rv
  sigma2 = variance shannon rv
  x = roll rv 100 g
  hx100 = info rv x / 100
  rv = parse s in
    showStat "raw bit content" (rawBitContent rv)
  . showStat "H = entropy" h
  . showStat "variance" sigma2
  . showStat "x <- sample X^100" x
  . showStat "h(x) / 100" hx100
  . showStat "t = |h(x) / 100 - H|" (abs $ hx100 - h)
  . showStat "with at most 20% probability, t exceeds" (sqrt $ sigma2 / (0.2 * 100))
  $ ""

showStat s f = (s++) . (": "++) . shows f . ('\n':)

#ifdef __HASTE__
main :: IO ()
main = withElems ["in", "out", "go"] $ \[iEl, oEl, goB] -> do
  let
    setInp s = setProp iEl "value" s >> setProp oEl "value" ""
    handle buttonId s = do
      Just b <- elemById buttonId
      void $ b `onEvent` Click $ const $ setInp s
  setInp english
  handle "english" english
  handle "fair" "h 1\nt 1\n"
  handle "foul" "h 9\nt 1\n"
  handle "2d6" twoDice
  void $ goB `onEvent` Click $ const $ do
    g <- newStdGen
    setProp oEl "value" . go g =<< getProp iEl "value"
#endif

Without spaces, English text takes about 4.7 bits per character, that is, if we interpret an N-letter string as a big base-26 number and convert it to binary, we wind up with about 4.7N bits.

Going by just the letter frequencies, we find the entropy is around 4.2 bits, so given enough text, we can save about 0.5 bits per character (over 10%) with negligible probability of data loss.


Ben Lynn blynn@cs.stanford.edu 💡