Lossy Compression

Ever filled out a form asking you to write one letter per box? This is perhaps the simplest lossy compression scheme. It works when long names are rare. In a fixed amount of space, the form accurately records everything except for some improbable cases.

What if we reject any improbable message, long or short? Then a small amount of space may hold a long message, provided it is not too improbable. For example, suppose a one-million-bit message is unlikely to contain more than a few 1s; perhaps it describes the inhabitants of a city of population one million who have won a lottery jackpot. Then by storing the positions of any 1s, perhaps five 20-bit numbers may suffice for describing almost any one-million-bit message.

We generalize this idea, closely following David Mackay, Information Theory, Chapter 4.

The smallest \(\delta\)-sufficient subset \(S_\delta\) is the smallest subset of \(\mathcal{A}_X\) such that:

\[ P(x \in S_\delta) \ge 1 - \delta \]

We can compute this by sorting the symbols in order of increasing probability, then discarding symbols until their cumulative probability exceeds \(\delta\).

In our lottery example, if we consider each possible one-million-bit string to be a symbol, then the only strings left standing are those containing a few 1s at most.

We reuse some code from the previous page:

▶ Toggle reused code

acoth k = sum $ take 16 $ zipWith (*) denoms $ iterate ((r*r)*) r
  where
  r = 1/k
  denoms = recip <$> iterate (2+) 1.0

log2 = 18*acoth 26 - 2*acoth 4801 + 8*acoth 8749

lg x
  | x == 1 = 0
  | x < 1  = lg (x*2) - 1
  | x >= 2 = lg (x/2) + 1
  | otherwise = 2 * acoth ((x + 1)/(x - 1)) / log2

shannon p = -(lg p)

type RandomVariable a = [(a, Double)]

sortOn f = \case
  [] -> []
  x:xt -> sortOn f (filter ((<= fx) . f) xt)
    ++ [x] ++ sortOn f (filter ((> fx) . f) xt) where fx = f x

cumu rv = scanl1 (\(_,p) (b,q) -> (b,p + q)) $ sortOn snd rv
sample rv r = fst $ head $ dropWhile ((< r) . snd) $ cumu rv

size = fromIntegral . length
rawBitContent rv = lg $ size rv

englishSrc = unlines
  [ "a 8167"
  , "b 1492"
  , "c 2782"
  , "d 4253"
  , "e 12702"
  , "f 2228"
  , "g 2015"
  , "h 6094"
  , "i 6966"
  , "j 0153"
  , "k 0772"
  , "l 4025"
  , "m 2406"
  , "n 6749"
  , "o 7507"
  , "p 1929"
  , "q 0095"
  , "r 5987"
  , "s 6327"
  , "t 9056"
  , "u 2758"
  , "v 0978"
  , "w 2360"
  , "x 0150"
  , "y 1974"
  , "z 0077"
  ]

parse :: String -> RandomVariable Char
parse s = second divTotal <$> tab where
  tab = [(c, fromIntegral $ readInteger s) | [[c], s] <- words <$> lines s]
  divTotal = (/ sum (snd <$> tab))

english = parse englishSrc

mean :: (Double -> Double) -> RandomVariable a -> Double
mean f rv = sum [p * f p | (_, p) <- rv]
variance :: (Double -> Double) -> RandomVariable a -> Double
variance f rv = sum [p*(f p - mu)^2 | (_, p) <- rv] where mu = mean f rv
entropy :: RandomVariable a -> Double
entropy rv = mean shannon rv

parseFrac = uncurry (/) . foldl (\(a, b) d
  -> (a*10 + fromIntegral (ord d - ord '0'), b*10)) (0.0, 1)
jsrandom = parseFrac . drop 2 <$> jsEval "Math.random()"

jsEval "hideshow('reuse');"

Then we can compute the sufficient subset with brute force:

sufficient delta rv = map fst $ dropWhile ((<= delta) . snd) $ cumu rv

The essential bit content of \(X\) up to error \(\delta\) is the raw bit content of this subset:

\[ H_\delta(X) = \lg |S_\delta| \]

essential delta rv = lg $ size $ sufficient delta rv

That is, if failure with probability \(\delta\) is acceptable, then we only need \(H_\delta(X)\) bits.

We perform a sanity check on the english distribution. The letter "z" is last and least, appearing under 0.1% of the time, while all other letters exceed this threshold. In other words, the first 25 letters are a 0.001-sufficient subset of the alphabet:

sufficient 0.001 english
length $ sufficient 0.001 english
essential 0.001 english
lg 25

Let \(X^N\) be the random variable whose outcome is \(N\) independent outcomes of \(X\). We now have enough definitions to talk about the best possible lossy compression method for repeated independent samples:

How many bits do we need to represent a sample of \(X^N\) if we’re allowed to fail with up to probability \(\delta\)?

By definition, the best answer is \(H_\delta(X^N)\), which we can compute with brute force. The following computes the distribution of \(X^3\) where \(X\) is based on the relative letter frequencies of English.

brute n rv = foldr1 (liftA2 go) $ replicate n (first (:[]) <$> rv)
  where go (x, p) (y, q) = (x <> y, p*q)

english3 = brute 3 english

The ten least likely values of \(X^3\) are:

rare10 = take 10 $ sortOn snd $ english3
rare10

These are the first to be dropped when computing a sufficient subset. The tiny probabilities look better when ported to information theory, thanks to logarithms:

second shannon <$> rare10

And logarithms mean we can add instead of multiply:

pr rv sym = maybe (error "bad symbol") id $ lookup sym rv
info rv s = sum $ shannon . pr rv <$> s

info english "zzz"
info english "zzq"

How many bits per letter do we need for a 3-letter message with a 0.1% chance of failure?

essential 0.001 english3 / 3

This is smaller than the bits needed for a single-letter message with the same probability of failure (which compounds if we send 3 such messages). Does this mean that as \(N\) increases, so does the proportion of low-probability messages? If so, how many bits do we save by encoding large chunks of a message at a time?

Typicality

These questions are hard to answer exactly because of the exponential complexity of brute force algorithms, so we sneak up on approximate solutions with information theory. Define the typical set \(T_{N\beta}\) by:

\[ T_{N \beta} = \left\{ x \in \mathcal{A}_{X^N} : \left| \frac{h(x)}{N} - H \right| < \beta \right\} \]

where \(H = H(X)\) is the entropy of \(X\). The idea is to construct the set of all strings whose information content is close to the expected information content \(N H\).

Typical sets are easier to reason with than sufficient sets because the laws of probability let us decompose thorny problems into more tractable subproblems.

We enumerate the members of a typical set for small \(N\) and various values of \(\beta\) to get our bearings:

typical rv n beta = [(s, h) | s <- replicateM n $ map fst rv,
  let h = info rv s,
  abs (h/fromIntegral n - entropy rv) < beta]

entropy english
typical english 1 1

2*entropy english
typical english 2 0.05

3*entropy english
typical english 3 0.001

As the term suggests, the members of a typical set are neither common nor rare: they have all have a middling probability of being sampled.

The typical set turns out to be a Goldilocks set of sorts. There may be messages which are more likely than those in the typical set, but they are so few in number that they are seldom seen. There may be many messages less likely than those in the typical set, but they are so unlikely that they also seldom seen. This is because the weak law of large numbers implies that, for any \(x \in T_{N\beta}\):

\[ P(x \in T_{N\beta}) \ge 1 - \frac{\sigma^2 }{\beta^2 N} \]

where \(\sigma^2\) is the variance of the Shannon information content of \(X\).

We now bound the size of a typical set. We undo logarithms to express the typical set’s defining inequality \(|(h(x)/N - H| < \beta\) in terms of probabilities:

\[ 2^{-N(H+\beta)} \le P(x) \le 2^{-N(H-\beta)} \]

The sum of all probabilities must be 1, so the left inequality implies:

\[ |T_{N\beta}| 2^{-N(H+\beta)} < 1 \]

so:

\[ |T_{N\beta}| < 2^{N(H+\beta)} \]

We rephrase as follows.

Asymptotic equipartition principle: For sufficiently large \(N\) the outcome \(x\) of \(N\) i.i.d. random variables \(X^N = {X_1, …​, X_N}\) almost certainly belongs to a certain set of size \(2^{N H}\), each of whose members occur with a probability "close" to \(2^{-N H}\).

More formally, given \(\epsilon > 0\) and a probability \(\delta > 0\), choose \(\beta = \epsilon\) and \(N_0\) so that \(\frac{\sigma^2}{\epsilon^2 N_0} \le \delta\). Then for \(N \ge N_0,\)

\[P(x \in T_{N\beta}) \ge 1 - \delta\]

and:

\[ \lg |T_{N\beta}| < N(H + \epsilon) \]

We place scare quotes around "close" because the probabilities of two elements in the typical set can differ by a factor that is exponential in our parameters.

Let’s compute some concrete bounds:

sqrt . (variance shannon english/) . (50*) <$> [0.1, 0.01, 0.001]

In other words, for a random sample \(x \overset{R}{\leftarrow} X^{50}\), where \(X\) is our english distribution:

\[ \begin{eqnarray} P(t < 0.42…​) &\ge& 0.9 \\ P(t < 1.35…​) &\ge& 0.99 \\ P(t < 4.28…​) &\ge& 0.999 \end{eqnarray} \]

where \(t = | h(x) / 50 - H |\). Recall \(H\) is the entropy of \(X\), which in this example is:

entropy english

Generating a few samples suggests these bounds are rather loose. I’ve rerun the code many times and have yet to see \(t\) exceed 1. The weak law of large numbers is indeed weak.

typicalSample n rv = do
  x <- replicateM n $ sample rv <$> jsrandom
  let
    hx = info rv x
    h = entropy rv
  putStrLn $ x ++ " " ++ show (abs (hx/fromIntegral n - h))

replicateM 12 $ typicalSample 50 english

Shannon’s source coding theorem

The bound for \(|T_{N\beta}|\) immediately yields an upper bound for the size of the best possible set for lossy compression:

\[ H_\delta(X^N) \le \lg |T_{N\beta}| < N(H + \epsilon) \]

How about the other side? Can \(S_\delta\) be significantly smaller than the typical set?

Let \(S\) be a set such that \(|S| \le 2^{N(H - 2\beta)}\). Then

\[\begin{align} P(x\in S) &=& P(x\in S \cap T_{N\beta}) + P(x\in S\cap \overline{T_{N\beta}}) \\ &\le& |S| \max \{ P(x) : x \in T_{N\beta} \} + P(x \notin T_{N\beta}) \\ &\le& 2^{N(H-2\beta)} 2^{-N(H-\beta)} + \frac{\sigma^2}{\beta^2 N} \\ &=& 2^{-N\beta} + \frac{\sigma^2}{\beta^2 N} \end{align}\]

Hence for any given \(\delta\), we can choose \(N_0\) and \(\epsilon\) so that \(P(x \in S) \le 1 - \delta\), that is, the probability of seeing an element in \(S\) is so small that we must have \(S \ne S_\delta\). Therefore:

Shannon’s source coding theorem: Given \(\epsilon > 0\) and \(0 < \delta < 1\) there exists a positive integer \(N_0\) such that for all \(N > N_0\):

\[ \left| \frac{H_\delta (X^N)}{N} - H \right| < \epsilon \]

In other words, with a negligible risk of information loss, we can compress \(N\) independent samples to about \(N H\) bits, and no better.

For example, for a sufficiently long string from our english distribution, compression saves over 10%:

entropy english / rawBitContent english

Unfortunately, there is no known practical algorithm that attains Shannon’s bound. It’s infeasible in general to rank all strings of length \(N\) by their probability of occurrence. All the same, Shannon’s theorem is valuable because it establishes limits on what we can do, and serves as a measure of the effectiveness of a given lossy compression method.


Ben Lynn blynn@cs.stanford.edu 💡