Haskell - Parser Combinators

= Parser Combinators =

Suppose we write a function to parse an `Integer`. What type should it have?

We might try `+String -> Integer+`:

[ ]:

parseInteger :: String -> Integer
parseInteger = foldl (\n d -> 10*n + fromIntegral (fromDigit d)) 0

fromDigit c = ord c - ord '0'

parseInteger "123" + parseInteger "456"

The `read` function generalizes this, and it suffices for some
applications. Nevertheless, often we'd prefer a parser that also:

  * handles invalid input
  * allows trailing characters

Our design should take into account that we may want to parse any trailing
characters with another function.

== The hard way ==

One solution is a function that returns all valid ways to parse an `Integer` at
the start of a string along with the remainder of the string. In particular, it
returns the empty list on failure.

----------------------------------------------------------------
String -> [(Integer, String)]
----------------------------------------------------------------

The `reads` function generalizes this, and
https://webdoc.sub.gwdg.de/ebook/serien/ah/UU-CS/2008-044.pdf[Doaitse
Swierstra, _Combinator Parsing: A Short Tutorial_] builds upon this approach.

However, we'll insist on at most one correct way to parse a string.
We also provide a mechanism for reporting an error message:

----------------------------------------------------------------
String -> Either String (Integer, String)
----------------------------------------------------------------

This type accommodates parser functions that take a string and return either an
error message or the successfully parsed integer at the start of the string
along with the remainder of the input.

With this design, we might write a function to parse a single digit as follows:

[ ]:

parseDigit = \case
  c:s' | '0' <= c, c <= '9' -> Right (fromDigit c, s')
  _ -> Left "want digit"

parseDigit "2b"
parseDigit "b2"

From this we can build a parser for a two-digit number:

[ ]:

parseTwoDigits s = do
  (a, s') <- parseDigit s
  (b, s'') <- parseDigit s'
  pure (10*a + b, s'')

parseTwoDigits "221b"
parseTwoDigits "2b21"

It works, but our design is infested with boilerplate. We must carefully manage
the input string, ensuring that the unparsed remainder of one step is fed to
the next. The pain worsens as we write more complex parsers.

We would prefer to write plumbing code like this once, and reuse it many times.
This is easy in Haskell.

== The easy way ==

We start by defining a data type to hold a parsing function, which we call a
`Charser`, a sort of portmanteau meant to evoke a parser that handles one Char
at a time. We also generalize so that our parsers can return any desired type.

[ ]:

data Charser a = Charser { unCharser :: String -> Either String (a, String) }

Next, we define some fundamental parsers. The `sat` parser returns the first
character if it satisfies a given predicate, and fails otherwise. THe `eof`
parser returns `()` if the given string is empty, and fails otherwise.

[ ]:

sat f = Charser \case
  h:t | f h -> Right (h, t)
  _ -> Left "unsat"

eof = Charser \case
  [] -> Right ((), "")
  _ -> Left "want EOF"

We can easily get at the function via `unCharser`:

[ ]:

unCharser (sat (== 'x')) "xyz"
unCharser (sat (== 'x')) "abc"
unCharser eof ""
unCharser eof "abc"

Lastly, we define `Functor`, `Applicative`, `Monad`, and `Alternative`
typeclass instances for the `Charser` type. The `Functor` instance is standard
stuff (that GHC can derive automatically) that lets us easily operate on a
parsed value, while the others describe how to compose two parsers. The `Monad`
and `Applicative` instances meticulously thread the input string from the first
parser to the second parser, while the `Alternative` instance builds a parser
that tries the first parser, and on failure tries the second parser.

These correpond to sequencing and the choice operator denoted by a vertical bar
"|" in a https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form[BNF grammar],
though in our framework we iterate through the choices in the order of
appearance and take the first match.

[ ]:

instance Functor Charser where fmap f (Charser x) = Charser $ fmap (first f) . x
instance Applicative Charser where
  pure a = Charser $ Right . (a,)
  f <*> x = Charser \s -> do
    (fun, t) <- unCharser f s
    (arg, u) <- unCharser x t
    pure (fun arg, u)
instance Monad Charser where
  Charser f >>= g = Charser $ (uncurry (unCharser . g) =<<) . f
  return = pure
instance Alternative Charser where
  empty = Charser $ const $ Left ""
  x <|> y = Charser \s -> either (const $ unCharser y s) Right $ unCharser x s

And we're done! We have just defined _parser combinators_, which we compose to
build complex parsers without needing to juggle bits and pieces of input
strings. Our two-digit parser example becomes:

[ ]:

digitChar = sat \c -> '0' <= c && c <= '9'

twoDigits = do
  a <- digitChar
  b <- digitChar
  pure $ 10*fromDigit a + fromDigit b

unCharser twoDigits "221b"
unCharser twoDigits "2b21"

== Arbitrary-precision calculator ==

We demonstrate a more sophisticated parser that parses an integer arithmetic
expression. Rather than store an expression in a tree data structure, we
evaluate as we parse.

[ ]:

char :: Char -> Charser Char
char = sat . (==)

oneOf :: String -> Charser Char
oneOf s = sat (`elem` s)

chainl1 :: Charser a -> Charser (a -> a -> a) -> Charser a
chainl1 p infixOp = go =<< p where
  go x = (go =<< ($x) <$> infixOp <*> p) <|> pure x

chainr1 :: Charser a -> Charser (a -> a -> a) -> Charser a
chainr1 p infixOp = (&) <$> p <*>
  (flip <$> infixOp <*> chainr1 p infixOp <|> pure id)

sp :: Charser a -> Charser a
sp p = p <* many (oneOf " \n")  -- Eat trailing spaces.

spch = sp . char

op c f = spch c *> pure f

line = sp (pure ()) *> expr <* eof where
  num  = parseInteger <$> sp (some digitChar)
  atm  = num <|> spch '(' *> expr <* spch ')'
  pow  = chainr1 atm $ op '^' (^)
  una  = foldr (.) id <$> many (op '+' id <|> op '-' negate)
  fac  = una <*> pow
  term = chainl1 fac $ op '*' (*) <|> op '/' div <|> op '%' mod
  expr = chainl1 term $ op '+' (+) <|> op '-' (-)

unCharser line "3 + 4 * 5"

Our `line` parser closely resembles the grammar it parses:

------------------------------------------------------------------------------
num    ::= ('0'|..|'9'|'.')+
atm    ::= num | '(' expr ')'
pow    ::= atm ('^' atm )*
una    ::= ('+'|'-')*
fac    ::= una pow
term   ::= fac ( ('*'|'/'|'%') fac )*
expr   ::= term ( ('+'|'-') term )*
------------------------------------------------------------------------------

In fact, one might say the definition of `line` is an
https://en.wikipedia.org/wiki/Attribute_grammar[attribute grammar].

We omit the rules involving spaces, which are traditionally considered part of
a separate phase called lexical analysis. With parser combinators, such
divisions are blurrier, and it can be practical to combine the two phases into
a single catch-all parsing phase.

We embed the parser in a calculator web widget. Type an expression into the
textbox and press enter.

[ ]:

calc = do
  s <- jsEval "calcIn.value;"
  putStr $ case unCharser line s of
    Left err -> err
    Right (v, _) -> s ++ " = " ++ show v

jsEval [r|
repl.outdiv.insertAdjacentHTML('beforeend', `
<textarea style="border: solid 4px #999999" id="calcIn" rows="1" cols="40">8888+88 / 8+8 / 8+8 * 8</textarea>
<div id="calcOut"></div>
`);

calcIn.addEventListener("keydown", ev => {
  if (ev.keyCode == 13) {
    const r = repl.run("chat", ["Main"], "calc");
    calcOut.innerText = r.out;
    calcIn.value = "";
    ev.preventDefault();
  }
});
|]

== The Fast Way ==

Suppose we have a complex grammar and a long input that fails to parse near the
end. Due to our implementation of `(<|>)`, after the first failure, the parser
will backtrack and explore every alternative as it unwinds, before ultimately
giving up anyway.

Not only is this slow, but backtracking also discards the error thrown by the
original failure, leading to error messages that are difficult to understand.

To address these issues, popular parser combinator libraries change the meaning
of `(<|>)`. If the first alternative fails _and_ no input has been consumed,
only then is the second alternative attempted. That is, if a parser processes
even a single character before failing, then the parser halts immediately with
failure.

Short-circuiting failure is indeed faster, but introduces a problem. At times
we really do want backtracking, that is, we want to undo consuming some of the
input before trying an alternative. Thus parser combinator libraries also
provide a `try` function: it tries a given parser on the input, and on failure,
acts as if no input has been consumed so that alternatitves can be tried.
Naturally,
http://blog.ezyang.com/2014/05/parsec-try-a-or-b-considered-harmful/[we should
use `try` sparingly].

One way to implement this feature is to have parsers return a boolean
indicating whether input has been consumed.

[ ]:

data Parser a = Parser
  { unParser :: String -> (Bool, Either String (a, String)) }

instance Functor Parser where
  fmap f (Parser x) = Parser $ second (fmap $ first f) . x
instance Applicative Parser where
  pure a = Parser $ (False,) . Right . (a,)
  f <*> x = Parser \s -> let (c0, mf) = unParser f s in case mf of
    Left err -> (c0, Left err)
    Right (fun, t) -> let (c1, mg) = unParser x t in case mg of
      Left err -> (c0 || c1, Left err)
      Right (arg, u) -> (c0 || c1, Right (fun arg, u))
instance Monad Parser where
  Parser f >>= g = Parser \s -> let (c0, mf) = f s in case mf of
    Left err -> (c0, Left err)
    Right (x, t) -> let (c1, mg) = unParser (g x) t in case mg of
      Left err -> (c0 || c1, Left err)
      r -> (c0 || c1, r)
  return = pure
instance Alternative Parser where
  empty = Parser $ const (False, Left "")
  x <|> y = Parser \s -> let cm@(consumed, m) = unParser x s in case m of
    Left err -> if consumed then cm else unParser y s
    Right _ -> cm

try p = Parser \s -> let cm@(_, m) = unParser p s in case m of
  Left err -> (False, Left err)
  _ -> cm

sat f = Parser \case
  h:t | f h -> (True, Right (h, t))
  _ -> (False, Left "unsat")

(We call this version `Parser` instead of `Charser` because my interprefer has
issues with redefining typeclasses.)

Here's some examples showing how the new `(<|>)` behaves:

[ ]:

char = sat . (==)
string = mapM char

unParser (string "foo" <|> string "bar") "foo"
unParser (string "foo" <|> string "bar") "bar"
unParser (string "foo" <|> string "bar") "baz"
unParser (string "foo" <|> string "bar") "qux"
unParser (string "qux" <|> string "quux") "quux"
unParser (try (string "qux") <|> string "quux") "quux"

== The evolution of parsing ==

As a child, I was baffled by parsers because I only understood loops.
How does the computer handle arbitrarily nested parentheses?!

Parsers might have been fundamentally beyond my reach at the time, as some
research suggests that
https://modeltheory.org/papers/2018children-recursion.pdf[10-year-olds struggle
with the concept of recursion, while 11-year-olds struggle with understanding
recursive programs]. As it happened, I was 12 or so when I read about
https://en.wikipedia.org/wiki/Recursive_descent_parser[recursive descent
parsers]. (For memory, the book was Herbert Schildt, _Advanced Turbo C_.) I was
awestruck by their simplicity, elegance, and power. I felt I had leveled-up in
real life. I now knew everything about parsers.

I was awestuck again on encountering
https://en.wikipedia.org/wiki/Compilers:_Principles,_Techniques,_and_Tools[the
"red dragon book"] at university.
Forget those childish recursive descent parsers! Instead, the computer should
do the hard work; the human should just write a handful of regular expressions
and a BNF grammar. Not only is it simple and elegant, but there are
strong guarantees: the parser generator produces efficient code and detects
ambiguities. I felt I knew everything about parsers. This time for real.

I was awestruck yet again years later when I stumbled upon
https://en.wikipedia.org/wiki/Parser_combinator[parser combinators].
Forget those cumbersome parser generators! I had thought combinators were a
mathematical curiosity; a fun alternative to Turing machines for studying
computability. But parsing functions that return a parsing function in addition
to performing other duties lead to amazingly succinct recursive descent
parsers.

But the shocks didn't stop. I stumbled upon
http://bford.info/pub/lang/peg.pdf[parsing expression grammars], or PEGs, which
can be parsed in linear time and are unambiguous. They can parse some
non-context-free languages, and it is unknown if they can parse all
context-free langauges.

And then there's
link:pwd.html[parsing with derivatives], which
https://arxiv.org/pdf/1604.04695.pdf[parse any context-free grammar in cubic
time] (on par with Earley's algorithm or CYK). In brief, "even though
link:re.html[Brzozowski defined the derivative with regular languages] in mind,
it works unaltered for context-free languages if you toss in laziness,
memoization and fixed points."

I now feel I know little about parsers. In practice, I let expedience be my
guide. If I must have features guaranteed by the theory, or if parser
combinators are unavailable, then I'll resort to a parser generator. Otherwise,
I'll choose a parser combinator library such as
https://hackage.haskell.org/package/megaparsec[Megaparsec].

Ben Lynn blynn@cs.stanford.edu 💡

Contents