Three Theorems for Reducing my Lambda

In 2017, David Turner gave a talk: "Some History of Functional Programming Languages". One slide listed "three really important theorems about lambda calculus":

Write \(A \rightarrow B\) to mean we choose some redex in \(A\) and reduce it to get \(B\). Let \(\twoheadrightarrow\) be the reflexive transitive closure of \(\rightarrow\).

Church-Rosser theorem: the relation \(\twoheadrightarrow\) satisfies the diamond property:

If \(A \twoheadrightarrow B, A \twoheadrightarrow B'\) then \(B \twoheadrightarrow C, B' \twoheadrightarrow C\) for some \(C\).
Corollary: normal forms are unique.
Normalization theorem: repeatedly reducing the leftmost redex is normalizing.
Böhm’s theorem: if \(A, B\) have distinct \(\beta\eta\)-normal forms then we can construct a term \(X\) such that \(XA \twoheadrightarrow K, XB \twoheadrightarrow KI\).

Why are these theorems important?

Church-Rosser Theorem: Programmers liberally employ equational reasoning based on beta-conversion, but why does it work? To run a program is to perform a series of beta-reductions in a specific order, yet we beta-reduce and beta-expand arbitrary subexpressions at will. An equation is a two-way street, while execution is a one-way journey. How do we get away with blithely conflating \(=\) with \(\twoheadrightarrow\)?

The Church-Rosser theorem implies all roads lead to Rome, that is, starting from a normalizing term, no matter where our beta-conversions take us, there always remains a path to the one true normal form.

Normalization theorem: My view is that normalizing evaluation strategies are worth their practical drawbacks, so this theorem is foundational. Others advocate alternative evaluation strategies, though I expect even though they may not rely the theorem in practice, they still appreciate its existence.

Böhm’s Theorem: Lucky programmers also rely on equational reasoning based on eta-conversion, for example, replacing (\x -> length x) with length. I’m grateful Haskell supports this, as for years I had no way to exploit this equivalence.

A question naturally arises. Are there other amazing tricks? Is there more to equational reasoning than beta and eta?

No! Böhm’s theorem tells us that introducing even a single equivalence beyond those based on beta- and eta-conversion brings the entire theory crashing down, for it would imply all lambda terms are equal.

Proofs

See Hendrik Barendregt, 1984, The Lambda Calculus: Its Syntax and Semantics.

We might think it’s easy to prove these fundamental theorems, as they justify what programmers intuitively know in their bones, and they assume nothing but the basic definitions of lambda calculus. Alas, each of these theorems have devils in their details.

We only give rough sketches, as I felt the various cases in structural induction add more clutter to already hairy proofs. Also, I harbour dreams to formally verify parts of these proofs, and in particular these inductions; if this ever happens, I’ll supply the missing pieces.

[I felt compelled to post this article now (December 2023) rather than leave it on a back burner because David Turner passed away recently. This page is a kind of tribute to him.]

From now on, we implictly rename any free variables so they are distinct from bound variables in any terms involved in the same setting, which makes our life slightly easier by avoiding variable capture due to unfortunate name choices. (I imagine if I manage to figure out theorem provers I’d use De Bruijn indices instead.)

Some of the proofs rely on the substitution lemma:

\[ M[x:=N][y:=L] = M[y:=L][x:=N[y:=L]] \]

where \(x \ne y\) and \(x\) is not a free variable of \(L\). This can be shown via structural induction.

Church-Rosser Theorem

The main difficulty is that reducing a redex can create new redexes. This is desirable because we want lambda calculus to be Turing-complete, but it also means taking different reduction paths may result in wildly distinct terms. How do we get a good grip on something so slippery?

We might start with baby steps: let’s study \(\rightarrow\) instead of \(\twoheadrightarrow\). Consider the term:

\[ Ix(Kab) \]

where \(I = \lambda x . x\) and \(K = \lambda x y . x\).

We get the same result \(xa\) whether we first reduce \(Ix \rightarrow x\) and then \(Kab \rightarrow a\), or the other way around.

This inspires us to see if the reflexive closure of \(\rightarrow\) (namely, reducing at most once) has the diamond property. If so, we could prove the Church-Rosser theorem. In general, if a relation has the diamond property, then by induction so does its transitive closure: with enough patches, you can stitch together a quilt of any size.

So let \(M\) be a lambda term containing redexes \(\Delta_1, \Delta_2\). Let \(M_1\) be the result of reducing \(\Delta_1\) and let \(M_2\) be the result of reducing \(\Delta_2\). Can we always find a lambda term \(N\) that can be reached from both \(M_1\) and \(M_2\) by reducing at most one redex?

When \(\Delta_1\) and \(\Delta_2\) are disjoint like above, we can easily find \(N\). When \(\Delta_1 = \Delta_2\), the problem degenerates and the solution is trivial.

However, if \(\Delta_1\) and \(\Delta_2\) are distinct, but overlap, we run into trouble. In the following example, reducing \(\Delta_1\) results in two copies of \(\Delta_2\):

\[ M = (\lambda z . f z (g z)) (I x) \]

Let \(\Delta_1 = M, \Delta_2 = I x\). (The former strictly contains the latter.)

Reducing the outer redex \(\Delta_1\) yields:

\[ M_1 = f (I x) (g (I x)) \]

While reducing the inner redex \(\Delta_2\) yields:

\[ M_2 = (\lambda z . f z (g z)) x \]

No lambda term is at most one beta-reduction away from both \(M_1\) and \(M_2.\)

Perhaps we can console ourselves by proving something weaker. In our example, we can reach:

\[ N = f x (g x) \]

from both \(M_1\) and \(M_2\) after multiple beta-reductions.

So perhaps we can show the relation \(\rightarrow\) satisfies the weak diamond property:

A rather mechanical induction shows this to be true for the non-overlapping cases and the above overlapping case, as well as the one remaining case, where one redex lies in the function of the other, rather than the argument, such as:

\[ M = (\lambda z . f (I z) (g z)) x \]

Reducing the outer redex yields:

\[ M_1 = f (I x) (g x) \]

While reducing the inner redex yields:

\[ M_2 = (\lambda z . f z (g z)) x \]

Both reduce to:

\[ N = f x (g x) \]

At last, we have a diamond of some kind: beta-reduction satisfies the weak diamond property. However, as the name suggests, this property is strictly weaker than the diamond property, as this well-known pathological example demonstrates.

We can step from \(B\) to \(C\) and vice versa. Only \(B\) can step to the dead end \(A,\) and only \(C\) can step to the dead end \(D.\)

The weak diamond property holds. For example, starting from \(B\), we can step to \(A\) or \(C\), and in both cases, we can eventually reduce to \(A\).

However, the transitive closure of this reduction lacks the diamond property. From \(B\), we can reach \(A\) or \(D\), but there is no way to reach the same term from both \(A\) and \(D.\)

In sum, the \(\rightarrow\) reduction plods too slowly to get the diamond property, while \(\twoheadrightarrow\) behaves too chaotically to reason with directly.

Let’s try again, with a compromise between \(\rightarrow\) and \(\twoheadrightarrow\). We inductively define \(\underset{l}{\twoheadrightarrow}\) as follows:

\(M \underset{l}{\twoheadrightarrow} M\)
If \(M \underset{l}{\twoheadrightarrow} M'\) then \(\lambda x.M \underset{l}{\twoheadrightarrow} \lambda x.M'\).
If \(M \underset{l}{\twoheadrightarrow} M', N \underset{l}{\twoheadrightarrow} N'\) then \(M N \underset{l}{\twoheadrightarrow} M' N'\).
If \(M \underset{l}{\twoheadrightarrow} M', N \underset{l}{\twoheadrightarrow} N'\) then \((\lambda x.M)N \underset{l}{\twoheadrightarrow} M'[x := N']\).

The last rule does the actual work, while the others pipe it to any subterms of a lambda expression.

Unlike \(\rightarrow\) it may involve more than one beta-reduction so we have a chance of proving the diamond property. Unlike \(\twoheadrightarrow\), we can only reduce redexes already present in a term; we never touch fresh redexes created by such reductions.

Furthermore, whenever beta-reduction copies an argument, the last rule forces us to choose one version of it. For example, we cannot reduce one copy and leave another copy untouched:

\[ (\lambda z . f z (g z)) (I x) \underset{l}{\twoheadrightarrow} f (I x) (g (I x)) \]

and:

\[ (\lambda z . f z (g z)) (I x) \underset{l}{\twoheadrightarrow} f x (g x) \]

but:

\[ (\lambda z . f z (g z)) (I x) \not\underset{l}{\twoheadrightarrow} f (I x) (g x) \]

We would need 2 steps:

\[ (\lambda z . f z (g z)) (I x) \underset{l}{\twoheadrightarrow} f (I x) (g (I x)) \underset{l}{\twoheadrightarrow} f (I x) (g x) \]

A few inductions later, and one use of the substitution lemma, we find \(\underset{l}{\twoheadrightarrow}\) has the diamond property, which implies its transitive closure \(\twoheadrightarrow\) has the diamond property. QED.

The above proof is due to Tait and Martin-Löf. The original proof by Church and Rosser is worth studying. Recall we showed \(\rightarrow\) has the weak diamond property. We show something stronger, by tolerating asymmetry:

Strip Lemma: Let \(M\) be a lambda term. Suppose \(M \rightarrow M_1, M \twoheadrightarrow M_2\). Then \(M_1 \twoheadrightarrow N, M_2 \twoheadrightarrow N\) for some \(N\).

In other words, we do take baby steps after all, but only on one side. From \(M,\) we perform exactly one beta-reduction to reach \(M_1\), but any number of beta-reductions to reach \(M_2\). Induction yields the Church-Rosser theorem: instead of stitching a quilt, we glue strips together.

Let \(\Delta\) be the redex of \(M\) that is reduced to produce \(M_1\). We tag the \(\lambda\) in \(\Delta,\) say, by replacing it with \(\lambda'\), and watch what happens to it on the journey from \(M\) to \(M_2\).

The redex \(\Delta\) may be copied, destroyed, or reduced. Its body and argument may have undergone many beta-reductions. But we can be sure that any surviving \(\lambda'\) must be part of a redex that was originally \(\Delta\): the application has yet to be reduced since the \(\lambda'\) still stands.

Define \(N\) to be \(M_2\) except we reduce every redex beginning with a \(\lambda'\) from the inside out. By definition \(M_2 \twoheadrightarrow N\). It turns out we can show \(M_1 \twoheadrightarrow N\) via induction and the substitution lemma.

To recap:

We wanted to prove something about \(\twoheadleftarrow M \twoheadrightarrow\) but this seems hard to tackle directly.
We tried \(\leftarrow M \rightarrow\) but this is too weak.
We tried \(\underset{l}{\twoheadleftarrow} M \underset{l}{\twoheadrightarrow}\), a reduction that is a compromise between the two extremes. It worked! (Tait and Martin-Löf.)
We alse tried \(\leftarrow M \twoheadrightarrow\). It worked! (Church and Rosser.)

Normalization Theorem

The normalization theorem was also proved by Church and Rosser, though we follow the path taken by Curry and Feys, who prove the normalization theorem by proving the standardization theorem.

To reach normal form we must reduce all redexes of a term. Reducing the leftmost redex first seems natural, because it is neither the body nor argument of any following redex. Thus if a normal form exists, surely we must reach it if we rinse and repeat?

Well, suppose reducing the leftmost redex creates a bunch of redexes, and then reducing the new leftmost redex creates a bunch more. If it keeps up, perhaps we’ll never reach normal form. What if the only way to prevent such an explosion is to reduce certain other redexes before touching the leftmost redex? We might suspect this to be impossible. This turns out to be true, but the proof is tricky.

We first establish some bounds. Given a term, choose any subset of its redexes and tag the lambda of each one. For example:

\[ (\lambda' x . x x) (\lambda y . y y) \]

Only the first lambda can be tagged, as the second lambda is not part of a redex.

We can reduce the tagged redex (which we’ll call a tagged reduction) to get:

\[ (\lambda y . y y) (\lambda y . y y) \]

No tagged redexes remain. More generally, starting from any set of tagged redexes in any term, we’ll see that we can reduce all tagged redexes so no tags remain in a finite number of steps.

Let’s take a bigger example, where we have tagged two redexes:

\[ (\lambda' x . x ( ( \lambda' y . y x y) (\lambda z . z (z z)))) (\lambda w . w) \]

We assign positive integer weights to each variable. We give them increasing powers of 2 going from right to left. For example:

\[ (\lambda' x . x^{128} ( ( \lambda' y . y^{64} x^{32} y^{16}) (\lambda z . z^8 (z^4 z^2)))) (\lambda w . w^1) \]

Then reducing a tagged lambda replaces a power of 2 with at most the sum of all smaller powers of 2, which is strictly smaller. Thus the total weight decreases.

One can show via induction (involving a couple of tricky cases) that further reductions of tagged lambdas continue to decrease the total weight.

As a decreasing sequence of positive integers must terminate, tagged reduction is strongly normalizing, that is, no matter what order they are reduced, after a finite number of steps, all tags will be eliminated. In particular, we can reduce them left to right.

By a similar proof to regular untagged beta-reduction, tagged reduction has the weak diamond property.

A strongly normalizing relation with the weak diamond property also has the diamond property, and hence a unique normal form.

It turns out \(\underset{l}{\twoheadrightarrow}\) from above is the same as tagging some redexes then reducing them until no tags remain, so we shall reuse this notation.

There’s more to do. Even if we initially tag all redexes, new redexes may be synthesized along the way as we reduce. Of course, we can then tag all the new redexes and reduce them left to right, but this is no longer normal order, which demands we always reduce the left-most redex whether or not it was initially tagged.

We solve this problem by showing various diagrams commute, implying that repeatedly retagging and reducing is equivalent to normal order.

If a term has the form:

\[ \lambda x_1 … x_n . (\lambda x . M) … \]

then we say its head redex is \( (\lambda x . M). \)

Let \(\underset{h}{\rightarrow}\) denote the reduction of the head redex, and \(\underset{i}{\rightarrow}\) denote the reduction of an internal redex: any redex besides the head redex. One can show:

And use this to show:

A single beta-reduction is equivalent to tagging a single redex and reducing it, thus removing all tags. Thus we can specialize the above and induct to get:

Any \( M \twoheadrightarrow N \) is a a mix of head and internal reductions:

\[ M \underset{h}\twoheadrightarrow M_1 \underset{i}\twoheadrightarrow M_2 \underset{h}\twoheadrightarrow M_3 \underset{i}\twoheadrightarrow … N \]

Applying the above commuting diagram repeatedly yields:

Lastly, structural induction yields:

Standardization theorem (Curry and Feys): If \(M \twoheadrightarrow N\) then \(M \underset{s}{\twoheadrightarrow} N\), where the latter denotes a standard reduction: a reduction where we never reduce redexes to the left of a redex we have already reduced.

The normalization theorem follows immediately: if \(M \underset{s}{\twoheadrightarrow} N\) and \(N\) is normal, then \(N\) contains no redexes which implies the standard reduction must have reduced the leftmost redex at each step.

Curiously, we relied on \(\underset{l}{\twoheadrightarrow}\) as featured in the proof due to Tait and Martin-Löf, as well as the tagged redex idea featured in the strip lemma due to Church and Rosser.

Böhm’s Theorem

See my online demo.

Diagrams

We draw the SVG diagrams in this page with the help of my toy diagrams library.

jsEval "curl_module('../haskell/Diagram.ob')"

import Diagram
gen name dia = jsEval_ $ name ++ ".innerHTML=`" ++ svg 20 dia ++ "`;"

textAdjust = translate (-0.4 :+ -0.25)
label = textAdjust . text
object s = label s <> (unitCircle |> named s)
dubArrow = straightArrowWith dubDart ""
existsDubArrow = straightArrowWith dubDart dashedAttrs

diamond [a, b, b', c] = vcat [a
  , strutY 3
  , beside -1 (strutX 6 ||| b') b
  , strutY 3
  , c]

gen "crdiamond" $ diamond (object <$> ["A", "B", "B'", "C"])
  |> dubArrow "A" "B"
  |> dubArrow "A" "B'"
  |> existsDubArrow "B" "C"
  |> existsDubArrow "B'" "C"

m1 = textAdjust (text "M" <> translate (0.75 :+ -0.25) (scale 0.7 $ text "1"))
  <> (unitCircle |> named "M1")
m2 = textAdjust (text "M" <> translate (0.75 :+ -0.25) (scale 0.7 $ text "2"))
  <> (unitCircle |> named "M2")

phantomUnitCircle = unitCircle { _svg = \_ -> id }

gen "reflexBetaQuestion" $ diamond [object "M", m1, m2, translate (-0.3 :+ -0.3) (scale 1.5 $ text "?") <> phantomUnitCircle |> named "?"]
  |> straightArrow "M" "M1"
  |> straightArrow "M" "M2"
  |> existsArrow "M1" "?"
  |> existsArrow "M2" "?"

gen "weakDiamond" $ diamond [object "M", m1, m2, object "N"]
  |> straightArrow "M" "M1"
  |> straightArrow "M" "M2"
  |> existsDubArrow "M1" "N"
  |> existsDubArrow "M2" "N"

gen "weakAintDiamond" $ hcat (intersperse (strutX 4) $ object . pure <$> ['A'..'D'])
  |> straightArrow "B" "A"
  |> straightArrow "C" "D"
  |> curvedArrow (-tau/6) "B" "C" (tau/8) (tau*3/8)
  |> curvedArrow -(tau/6) "C" "B" (-3*tau/8) (-tau/8)

gen "strip" $ ((strutX 4 ||| object "M") === (strutY 1 === m1))
  === (translate (5 :+ -10) $ (strutX 4 ||| m2) === (strutY 1 === object "N"))
  |> straightArrow "M" "M1"
  |> dubArrow "M" "M2"
  |> existsDubArrow "M1" "N"
  |> existsDubArrow "M2" "N"

gen "commuteHeadInternal" $ vcat
  [ hcat [object "M", strutX 5 <> translateY 0.3 (text "i"), object "M'"]
  , strutY 5 <> translateX -0.7 (text "h") <> translateX 7.2 (text "h")
  , hcat [object "N", strutX 5 <> translateY -0.7 (text "i"), object "N'"]
  ]
  |> straightArrow "M" "M'"
  |> dubArrow "M'" "N'"
  |> existsDubArrow "M" "N"
  |> existsDubArrow "N" "N'"

gen "beheadGeneral" $ vcat
  [ object "M"
  , strutY 5  <> translateX -0.7 (text "h")
  , hcat [object "M'", strutX 5 <> translateY -0.7 (text "i"), object "N"]
  ]
  |> dubArrow "M" "N"
  |> existsDubArrow "M" "M'"
  |> existsDubArrow "M'" "N"

gen "beheadDetagger" $ vcat
  [ object "M"
  , strutY 5  <> translateX -0.7 (text "h") <> translateX 4 (text "l")
  , hcat [object "M'", strutX 5 <> translate (-0.2 :+ -0.7) (text "l,i"), object "N"]
  ]
  |> dubArrow "M" "N"
  |> existsDubArrow "M" "M'"
  |> existsDubArrow "M'" "N"

gen "commuteHeadInternalDetagger" $ vcat
  [ object "M" <> translate (4.8 :+ 0.3) (text "l,i")
  , strutY 4 <> translateX -0.7 (text "h")
  , m1
  , strutY 3 <> translateX -0.7 (text "h") <> translate (5.2 :+ 0.2) (text "l")
  , object "N"
  ] ||| vcat
  [ object "M'"
  , strutY 9 <> translateX 0.7 (text "h")
  , object "N'" <> translate (-5.2 :+ -0.7) (text "l,i")
  ]
  |> dubArrow "M" "M'"
  |> straightArrow "M'" "N'"
  |> existsArrow "M" "M1"
  |> existsDubArrow "M1" "N'"
  |> existsDubArrow "M1" "N"
  |> existsDubArrow "N" "N'"