acoth k = sum $ take 16 $ zipWith (*) denoms $ iterate ((r*r)*) r where r = 1/k denoms = recip <$> iterate (2+) 1.0 log2 = 18*acoth 26 - 2*acoth 4801 + 8*acoth 8749 lg x | x == 1 = 0 | x < 1 = lg (x*2) - 1 | x >= 2 = lg (x/2) + 1 | otherwise = 2 * acoth ((x + 1)/(x - 1)) / log2 shannon p = -(lg p) type RandomVariable a = [(a, Double)] sortOn f = \case [] -> [] x:xt -> sortOn f (filter ((<= fx) . f) xt) ++ [x] ++ sortOn f (filter ((> fx) . f) xt) where fx = f x cumu rv = scanl1 (\(_,p) (b,q) -> (b,p + q)) $ sortOn snd rv sample rv r = fst $ head $ dropWhile ((< r) . snd) $ cumu rv size = fromIntegral . length rawBitContent rv = lg $ size rv englishSrc = unlines [ "a 8167" , "b 1492" , "c 2782" , "d 4253" , "e 12702" , "f 2228" , "g 2015" , "h 6094" , "i 6966" , "j 0153" , "k 0772" , "l 4025" , "m 2406" , "n 6749" , "o 7507" , "p 1929" , "q 0095" , "r 5987" , "s 6327" , "t 9056" , "u 2758" , "v 0978" , "w 2360" , "x 0150" , "y 1974" , "z 0077" ] parse :: String -> RandomVariable Char parse s = second divTotal <$> tab where tab = [(c, fromIntegral $ readInteger s) | [[c], s] <- words <$> lines s] divTotal = (/ sum (snd <$> tab)) english = parse englishSrc mean :: (Double -> Double) -> RandomVariable a -> Double mean f rv = sum [p * f p | (_, p) <- rv] variance :: (Double -> Double) -> RandomVariable a -> Double variance f rv = sum [p*(f p - mu)^2 | (_, p) <- rv] where mu = mean f rv entropy :: RandomVariable a -> Double entropy rv = mean shannon rv parseFrac = uncurry (/) . foldl (\(a, b) d -> (a*10 + fromIntegral (ord d - ord '0'), b*10)) (0.0, 1) jsrandom = parseFrac . drop 2 <$> jsEval "Math.random()" jsEval "hideshow('reuse');"
Lossy Compression
Ever filled out a form asking you to write one letter per box? This is perhaps the simplest lossy compression scheme. It works when long names are rare. In a fixed amount of space, the form accurately records everything except for some improbable cases.
What if we reject any improbable message, long or short? Then a small amount of space may hold a long message, provided it is not too improbable. For example, suppose a one-million-bit message is unlikely to contain more than a few 1s; perhaps it describes the inhabitants of a city of population one million who have won a lottery jackpot. Then by storing the positions of any 1s, perhaps five 20-bit numbers may suffice for describing almost any one-million-bit message.
We generalize this idea, closely following David Mackay, Information Theory, Chapter 4.
The smallest \(\delta\)-sufficient subset \(S_\delta\) is the smallest subset of \(\mathcal{A}_X\) such that:
\[ P(x \in S_\delta) \ge 1 - \delta \]
We can compute this by sorting the symbols in order of increasing probability, then discarding symbols until their cumulative probability exceeds \(\delta\).
In our lottery example, if we consider each possible one-million-bit string to be a symbol, then the only strings left standing are those containing a few 1s at most.
We reuse some code from the previous page:
Then we can compute the sufficient subset with brute force:
sufficient delta rv = map fst $ dropWhile ((<= delta) . snd) $ cumu rv
The essential bit content of \(X\) up to error \(\delta\) is the raw bit content of this subset:
\[ H_\delta(X) = \lg |S_\delta| \]
essential delta rv = lg $ size $ sufficient delta rv
That is, if failure with probability \(\delta\) is acceptable, then we only need \(H_\delta(X)\) bits.
We perform a sanity check on the english
distribution. The letter "z" is last
and least, appearing under 0.1% of the time, while all other letters exceed
this threshold. In other words, the first 25 letters are a 0.001-sufficient
subset of the alphabet:
sufficient 0.001 english length $ sufficient 0.001 english essential 0.001 english lg 25
Let \(X^N\) be the random variable whose outcome is \(N\) independent outcomes of \(X\). We now have enough definitions to talk about the best possible lossy compression method for repeated independent samples:
How many bits do we need to represent a sample of \(X^N\) if we’re allowed to fail with up to probability \(\delta\)?
By definition, the best answer is \(H_\delta(X^N)\), which we can compute with brute force. The following computes the distribution of \(X^3\) where \(X\) is based on the relative letter frequencies of English.
brute n rv = foldr1 (liftA2 go) $ replicate n (first (:[]) <$> rv) where go (x, p) (y, q) = (x <> y, p*q) english3 = brute 3 english
The ten least likely values of \(X^3\) are:
rare10 = take 10 $ sortOn snd $ english3 rare10
These are the first to be dropped when computing a sufficient subset. The tiny probabilities look better when ported to information theory, thanks to logarithms:
second shannon <$> rare10
And logarithms mean we can add instead of multiply:
pr rv sym = maybe (error "bad symbol") id $ lookup sym rv info rv s = sum $ shannon . pr rv <$> s info english "zzz" info english "zzq"
How many bits per letter do we need for a 3-letter message with a 0.1% chance of failure?
essential 0.001 english3 / 3
This is smaller than the bits needed for a single-letter message with the same probability of failure (which compounds if we send 3 such messages). Does this mean that as \(N\) increases, so does the proportion of low-probability messages? If so, how many bits do we save by encoding large chunks of a message at a time?
Typicality
These questions are hard to answer exactly because of the exponential complexity of brute force algorithms, so we sneak up on approximate solutions with information theory. Define the typical set \(T_{N\beta}\) by:
\[ T_{N \beta} = \left\{ x \in \mathcal{A}_{X^N} : \left| \frac{h(x)}{N} - H \right| < \beta \right\} \]
where \(H = H(X)\) is the entropy of \(X\). The idea is to construct the set of all strings whose information content is close to the expected information content \(N H\).
Typical sets are easier to reason with than sufficient sets because the laws of probability let us decompose thorny problems into more tractable subproblems.
We enumerate the members of a typical set for small \(N\) and various values of \(\beta\) to get our bearings:
typical rv n beta = [(s, h) | s <- replicateM n $ map fst rv, let h = info rv s, abs (h/fromIntegral n - entropy rv) < beta] entropy english typical english 1 1 2*entropy english typical english 2 0.05 3*entropy english typical english 3 0.001
As the term suggests, the members of a typical set are neither common nor rare: they have all have a middling probability of being sampled.
The typical set turns out to be a Goldilocks set of sorts. There may be messages which are more likely than those in the typical set, but they are so few in number that they are seldom seen. There may be many messages less likely than those in the typical set, but they are so unlikely that they also seldom seen. This is because the weak law of large numbers implies that, for any \(x \in T_{N\beta}\):
\[ P(x \in T_{N\beta}) \ge 1 - \frac{\sigma^2 }{\beta^2 N} \]
where \(\sigma^2\) is the variance of the Shannon information content of \(X\).
We now bound the size of a typical set. We undo logarithms to express the typical set’s defining inequality \(|(h(x)/N - H| < \beta\) in terms of probabilities:
\[ 2^{-N(H+\beta)} \le P(x) \le 2^{-N(H-\beta)} \]
The sum of all probabilities must be 1, so the left inequality implies:
\[ |T_{N\beta}| 2^{-N(H+\beta)} < 1 \]
so:
\[ |T_{N\beta}| < 2^{N(H+\beta)} \]
We rephrase as follows.
Asymptotic equipartition principle: For sufficiently large \(N\) the outcome \(x\) of \(N\) i.i.d. random variables \(X^N = {X_1, …, X_N}\) almost certainly belongs to a certain set of size \(2^{N H}\), each of whose members occur with a probability "close" to \(2^{-N H}\).
More formally, given \(\epsilon > 0\) and a probability \(\delta > 0\), choose \(\beta = \epsilon\) and \(N_0\) so that \(\frac{\sigma^2}{\epsilon^2 N_0} \le \delta\). Then for \(N \ge N_0,\)
\[P(x \in T_{N\beta}) \ge 1 - \delta\]
and:
\[ \lg |T_{N\beta}| < N(H + \epsilon) \]
We place scare quotes around "close" because the probabilities of two elements in the typical set can differ by a factor that is exponential in our parameters.
Let’s compute some concrete bounds:
sqrt . (variance shannon english/) . (50*) <$> [0.1, 0.01, 0.001]
In other words, for a random sample \(x \overset{R}{\leftarrow} X^{50}\), where
\(X\) is our english
distribution:
\[ \begin{eqnarray} P(t < 0.42…) &\ge& 0.9 \\ P(t < 1.35…) &\ge& 0.99 \\ P(t < 4.28…) &\ge& 0.999 \end{eqnarray} \]
where \(t = | h(x) / 50 - H |\). Recall \(H\) is the entropy of \(X\), which in this example is:
entropy english
Generating a few samples suggests these bounds are rather loose. I’ve rerun the code many times and have yet to see \(t\) exceed 1. The weak law of large numbers is indeed weak.
typicalSample n rv = do x <- replicateM n $ sample rv <$> jsrandom let hx = info rv x h = entropy rv putStrLn $ x ++ " " ++ show (abs (hx/fromIntegral n - h)) replicateM 12 $ typicalSample 50 english
Shannon’s source coding theorem
The bound for \(|T_{N\beta}|\) immediately yields an upper bound for the size of the best possible set for lossy compression:
\[ H_\delta(X^N) \le \lg |T_{N\beta}| < N(H + \epsilon) \]
How about the other side? Can \(S_\delta\) be significantly smaller than the typical set?
Let \(S\) be a set such that \(|S| \le 2^{N(H - 2\beta)}\). Then
Hence for any given \(\delta\), we can choose \(N_0\) and \(\epsilon\) so that \(P(x \in S) \le 1 - \delta\), that is, the probability of seeing an element in \(S\) is so small that we must have \(S \ne S_\delta\). Therefore:
Shannon’s source coding theorem: Given \(\epsilon > 0\) and \(0 < \delta < 1\) there exists a positive integer \(N_0\) such that for all \(N > N_0\):
\[ \left| \frac{H_\delta (X^N)}{N} - H \right| < \epsilon \]
In other words, with a negligible risk of information loss, we can compress \(N\) independent samples to about \(N H\) bits, and no better.
For example, for a sufficiently long string from our english
distribution,
compression saves over 10%:
entropy english / rawBitContent english
Unfortunately, there is no known practical algorithm that attains Shannon’s bound. It’s infeasible in general to rank all strings of length \(N\) by their probability of occurrence. All the same, Shannon’s theorem is valuable because it establishes limits on what we can do, and serves as a measure of the effectiveness of a given lossy compression method.