\x y z -> x z(y z)
Outcoding UNIX geniuses
Static types prevent disasters by catching bugs at compile time. Yet many languages have self-defeating types. Type annotation can be so laborious that weary programmers give up and switch to unsafe languages.
Adding insult to injury, some of these prolix type systems are simultaneously inexpressive. Some lack parametric polymorphism, forcing programmers to duplicate code or subvert static typing with casts. A purist might escape this dilemma by writing code to generate code, but this leads to another set of problems.
Type theory shows how to avoid these pitfalls, but many programmers seem unaware:
-
Popular authors Bruce Eckel and Robert C. Martin mistakenly believe strong typing implies verbosity, and worse still, testing conquers all. Tests are undoubtedly invaluable, but at best they "prove" by example. Testing can be used "…to show the presence of bugs, but never to show their absence". One could even argue a test-heavy approach helps attackers find exploits, as untested cases may hint at overlooked bugs. Instead of testing, we need strong static types so we can prove correctness of programs.
-
The designers of the Go language, including famed former Bell Labs researchers, have been stumped by polymorphism for years. To be fair, parametric polymorphism is only half the story. The other half is ad hoc polymorphism, which Haskell researchers only neatly solved in the late 1980s with type classes. Practical Haskell compilers also need more type trickery for unboxing.
Why is this so? Perhaps people think the theory is arcane, dry, and impractical?
By working through some programming interview questions, we’ll find the relevant theory is surprisingly accessible. We quickly arrive at a simple type inference or type reconstruction algorithm that seems too good to be true: it powers strongly typed languages that support parametric polymorphism without requiring any type declarations.
We’ll write code that figures out the function:
has type:
(a -> b -> c) -> (a -> b) -> a -> c
And given that (+)
has type Int -> Int -> Int
, our code determines
that the function:
\x -> (+) (x 42)
has type:
(Int -> Int) -> Int -> Int
1. Identifying twins
Determine if two binary trees storing integers in their leaf nodes are equal.
Solution: Two words: deriving Eq
.
data Tree a = Leaf a | Branch (Tree a) (Tree a) deriving Eq
Haskell’s derived instance
feature automatically works on any algebraic data type built on any type for
which equality makes sense. It even works for mutually recursive data types
(see
Data.Tree
):
data Tree a = Node a (Forest a) deriving Eq data Forest a = Forest [Tree a] deriving Eq
Perhaps my interviewer would ask me to explain deriving Eq
does.
Roughly speaking, it generates code like the following, saving the
programmer from stating the obvious:
data Tree a = Leaf a | Branch (Tree a) (Tree a) eq (Leaf x) (Leaf y) = x == y eq (Branch xl xr) (Branch yl yr) = eq xl yl && eq xr yr eq _ _ = False
2. On assignment
This time, the first tree may contain variables in place of integers. Can we unify the trees, that is, assign integers to all variables so the trees are equal? The same variable may appear more than once.
Solution: We extend our data structure to hold variables:
data Tree a = Var String | Leaf a | Tree a :. Tree a deriving Show
As before, we traverse both trees and look for nodes that differ in value
or type. If the first is a variable, then we record a constraint, that is, a
variable assignment that is required for the trees to be equal such as a = 4
or b = 2
. If there are conflicting values for the same variable, then we
indicate failure by returning Nothing
. Otherwise we return Just
the list of
assignments found.
solve1 = go [] where go as = \cases (xl :. xr) (yl :. yr) -> do as0 <- go as xl yl go as0 xr yr (Leaf x) (Leaf y) | x == y -> Just as | otherwise -> Nothing (Var v) (Leaf x) -> case lookup v as of Nothing -> Just $ (v, x):as Just x' | x == x' -> Just as | otherwise -> Nothing _ _ -> Nothing solve1 (Leaf 1) (Leaf 2) solve1 (Leaf 3) (Leaf 3) solve1 (Leaf 4) (Var "x") solve1 (Var "x" :. Var "y") (Leaf 5 :. Leaf 6) solve1 (Var "x" :. Var "x") (Leaf 5 :. Leaf 6)
3. Both Sides, Now
Now suppose leaf nodes in both trees can hold integer variables. Can two trees be made equal by assigning certain integers to the variables? If so, find the most general solution.
Solution: We proceed as before, but now we may encounter constraints such as
a = b
, which equate two variables. To handle such a constraint, we pick one
of the variables, such as a
, and replace all occurrences of a
with the
other side, which is b
in our example. This eliminates a
from all
constraints. Eventually, all our constraints have an integer on at least one
side, which we check for consistency.
We write a function that substitutes a given tree for a given variable. This is more general than necessary since we only need to substitute variables for variables, but we’ll soon need its full power.
sub x t = \case Var x' | x == x' -> t l :. r -> sub x t l :. sub x t r a -> a sub "x" (Leaf 1 :. Leaf 2) ((Var "x" :. (Leaf 3 :. Var "y")) :. Var "x")
We discard redundant constraints where the same variable appears on both sides,
such as a = a
. Thus a variable may wind up with no integer assigned to it,
which means if a solution exists, it can take any value.
Our code is inefficient. Rather than repeat the same chain of substitutions to reach a representative variable, we should use a more efficient disjoint-set data structure.
foldSub = foldr $ uncurry sub solve2 = go [] where go as = \cases (Leaf x) (Leaf y) | x == y -> Just as | otherwise -> Nothing (xl :. xr) (yl :. yr) -> go as xl yl >>= \as0 -> go as0 (foldSub xr as0) (foldSub yr as0) (Var _) (_ :. _) -> Nothing (Var v) (Var w) | v == w -> Just as (Var v) t -> Just $ (v, t):as x y@(Var _) -> go as y x _ _ -> Nothing solve2 ((Var "x" :. Leaf 1) :. Var "z") ((Var "y" :. Var "y") :. Leaf 2) solve2 ((Var "x" :. Leaf 1) :. Var "z") ((Var "y" :. Var "z") :. Leaf 2) solve2 ((Var "x" :. Leaf 1) :. Var "x") ((Var "y" :. Var "z") :. Leaf 2)
4. Once more, with subtrees
What if variables can represent subtrees?
Solution: Although we’ve significantly generalized the problem, our answer almost remains the same.
When unifying a variable v
against a tree t
, to avoid infinite recursion,
we perform an occurs check: the variable v
must not occur in t
,
except in the trivial case when v == t
. If we pass this check, then we
record that in future, we should replace v
with the tree resulting from
applying all substitutions found so far to t
.
treeUnify :: (Show a, Eq a) => Tree a -> Tree a -> Maybe [(String, Tree a)] treeUnify = go [] where go as = \cases (Leaf x) (Leaf y) | x == y -> Just as | otherwise -> Nothing (xl :. xr) (yl :. yr) -> do as0 <- go as xl yl go as0 (foldSub xr as0) (foldSub yr as0) (Var v) (Var w) | v == w -> Just as (Var v) t | occurs t -> Nothing | otherwise -> Just $ (v, foldSub t as):as where occurs = \case l :. r -> occurs l || occurs r Var w | v == w -> True _ -> False x y@(Var _) -> go as y x _ _ -> Nothing treeUnify (Var "x") (Leaf 5) treeUnify (Var "x" :. Leaf 5) (Leaf 4 :. Var "y") treeUnify ((Var "x" :. Var "x") :. Var "w") (((Leaf 3 :. Leaf 5) :. (Var "y" :. Var "z")) :. Var "w")
5. Type inference!
Design a language based on lambda calculus where integers and strings are primitve types, and where we can deduce whether a given expression is typable, that is, whether types can be assigned to the untyped bindings so that the expression is well-typed. If so, find the most general type.
For example, the expression \f -> f 2
which takes its first argument f
and applies to the integer 2 must have type (Int -> u) -> u
. Here, u
is
a type variable, that is, we can substitute u
with any type. This is known
as parametric polymorphism.
More precisely the inferred type is most general, or principal if:
-
Substituting types such as
Int
or(Int -> Int) -> Int
(sometimes called type constants for clarity) for all the type variables results in a well-typed closed term. -
There are no other ways of typing the given expression.
Solution: We define a tree to represent types. Like our previous trees, leaf
nodes can be variables (TV
for type variable) or constants (TC
for type
constant). The (:->)
constructor plays the same role as (:.)
above, but
this time it also represents functions.
infixr 5 :-> data Type = TC String | Type :-> Type | TV String instance Show Type where showsPrec d = \case TC s -> (s++) TV s -> (s++) a :-> b -> showParen (d > 0) $ showsPrec 1 a . (" -> "++) . shows b
We write a function tsub
that substitutes a given tree for a given variable,
analogous to sub
above:
tsub (x, t) = \case TV x' | x == x' -> t a :-> b -> tsub (x, t) a :-> tsub (x, t) b y -> y
Tree unification is almost the same, though this time we switch from Maybe
to Either
so we can add error messages:
unify :: Type -> Type -> Either String [(String, Type)] unify = go [] where go as = \cases (TC x) (TC y) | x == y -> Right as (xl :-> xr) (yl :-> yr) -> do as0 <- go as xl yl go as0 (foldr tsub xr as0) (foldr tsub yr as0) (TV v) (TV w) | v == w -> Right as (TV v) t | occurs t -> Left "occurs check" | otherwise -> Right $ (v, foldr tsub t as):as where occurs = \case l :-> r -> occurs l || occurs r TV w | v == w -> True _ -> False x y@(TV _) -> go as y x x y -> Left $ show x ++ " versus " ++ show y
Next, we define a data structure to hold an expression in our language. Applications, lambda abstractions, variables, integers, and strings:
infixl 5 :@ data Expr = Expr :@ Expr | Lam String Expr | V String | I Int | S String instance Show Expr where showsPrec d = \case a :@ b -> showParen (d > 0) $ shows a . (' ':) . showsPrec 1 b Lam v t -> ('\\':) . (v++) . (" -> "++) . shows t V s -> showParen (s == "+") $ (s++) I i -> shows i S s -> shows s
Unfortunately my base library lacks the state monad transformer, so we swipe some code from GHC:
data StateT s m a = StateT { runStateT :: s -> m (a,s) } instance (Monad m) => Monad (StateT s m) where return a = StateT $ \ s -> return (a, s) m >>= k = StateT $ \ s -> do (a, s') <- runStateT m s runStateT (k a) s' instance (Functor m) => Functor (StateT s m) where fmap f m = StateT $ \ s -> fmap (\(a, s') -> (f a, s')) $ runStateT m s instance (Functor m, Monad m) => Applicative (StateT s m) where pure a = StateT $ \ s -> return (a, s) StateT mf <*> StateT mx = StateT $ \ s -> do (f, s') <- mf s (x, s'') <- mx s' return (f x, s'') m *> k = m >>= \_ -> k state f = StateT (return . f) getT = state $ \ s -> (s, s) putT s = state $ \ _ -> ((), s) lift m = StateT $ \ s -> do a <- m return (a, s)
To finish off, we traverse the expression and unify types along the way to enforce the rules:
-
The type of a primitive value is its corresponding type; for example, 5 has type
TC "Int"
and "Hello, World" has typeTC "String"
. -
For an application
f x
, we recursively determine the typetf
off
andtx
ofx
, possibly requiring new assignments. We return a new type variabletfx
, and unifytf
withtx :-> tfx
after applying any new assignments. -
For a lambda abstraction
\x.t
, we generate a new type variabletx
to represent the type ofx
. Then we recursively find the typett
oft
during which we assign the typetx
to any free occurrence ofx
, then return the typetx :-> tt
.
Let’s walk through an example. The expression \f -> f 2
would be
represented as the Expr
tree Lam "f" (V "f" :@ I 2)
. To type this
expression:
-
At the top level we have a lambda so begin by generating a new type variable
t0
. -
Recursively type the lambda body, with the additional context that the symbol
f
has typet0
.-
Generate a new type variable
t1
. -
Recursively type the left child of the
(:@)
node:-
We have the symbol
f
, which has typet0
from the context.
-
-
Recursively type the right child of the
(:@)
node:-
We have the integer constant 2, so we return the type
TC "Int"
.
-
-
Unify
t0
withTC "Int" -> t1
, which assigns the latter to the former. -
Return the type
t1
.
-
-
Return the type
t0 -> t1
.
Thus \f -> f 2
has type t0 -> t1
which expands to (Int -> t1) -> t1
if we apply all the assignments we found.
We use StateT
to help with bookkeeping. To guarantee a unique name for each
type variable, we maintain a counter which we increment for each new name,
which is stored along with the type variable assignments collected so far.
The environment gamma
holds types of any predefined functions, and also
records the types of variables in lambda abstractions.
We ultimately return the type of an expression, or an error message explaining
why it is ill-typed. We name the function gather
because it accumulates more
and more type variable assignments over time, but also because "to gather" can
mean "to infer".
gather :: [(String, Type)] -> Expr -> StateT ([(String, Type)], Int) (Either String) Type gather gamma = \case I _ -> pure $ TC "Int" S _ -> pure $ TC "String" f :@ x -> do tf <- gather gamma f tx <- gather gamma x tfx <- newTV (cs, i) <- getT let rewrite t = foldr tsub t cs cs' <- lift $ unify (rewrite tf) (rewrite $ tx :-> tfx) putT ((second rewrite <$> cs') ++ cs, i) pure tfx V s -> lift $ maybe (Left $ "missing " ++ s) Right $ lookup s gamma Lam x t -> do tx <- newTV (tx :->) <$> gather ((x, tx):gamma) t where newTV = do (cs, i) <- getT putT (cs, i + 1) pure $ TV $ 't':show i
This algorithm is the heart of the Hindley-Milner type system, or HM for short. See Mark P. Jones, Typing Haskell in Haskell.
The infer
wrapper shows the type returned by gather
with all assignments
applied, or returns an error message:
infer gamma x = case runStateT (gather gamma x) ([], 0) of Right (ty, (cs, _)) -> show $ foldr tsub ty cs Left e -> "FAIL: " ++ e infer [] (Lam "x" (V "x" :@ I 2))
We build a demo with a couple of predefined functions with certain types:
prelude :: [(String, Type)] prelude = [ ("+", TC "Int" :-> TC "Int" :-> TC "Int"), ("length", TC "String" :-> TC "Int")] demo expr = do putStrLn $ "type of `" ++ show expr ++ "`:" putStrLn $ " " ++ infer prelude expr ++ "\n" demo $ V "length" demo $ V "length" :@ S "hello" demo $ V "length" :@ V "length" demo $ V "+" :@ I 1 :@ I 1 demo $ V "+" :@ I 1 :@ S "hello" demo $ Lam "x" (V "x" :@ I 2) demo $ Lam "x" (V "x" :@ V "x") demo $ Lam "x" ((V "+" :@ V "x") :@ I 42) demo $ Lam "x" (V "+" :@ (V "x" :@ I 42)) demo $ Lam "x" (V "x") demo $ Lam "x" (Lam "y" (V "x")) demo $ Lam "x" (Lam "y" (Lam "z" ((V "x" :@ V "z") :@ (V "y" :@ V "z"))))