= Parser Combinators = Suppose we write a function to parse an `Integer`. What type should it have? We might try `+String -> Integer+`:
[ ]:
parseInteger :: String -> Integer parseInteger = foldl (\n d -> 10*n + fromIntegral (fromDigit d)) 0 fromDigit c = ord c - ord '0' parseInteger "123" + parseInteger "456"
The `read` function generalizes this, and it suffices for some applications. Nevertheless, often we'd prefer a parser that also: * handles invalid input * allows trailing characters Our design should take into account that we may want to parse any trailing characters with another function. == The hard way == One solution is a function that returns all valid ways to parse an `Integer` at the start of a string along with the remainder of the string. In particular, it returns the empty list on failure. ---------------------------------------------------------------- String -> [(Integer, String)] ---------------------------------------------------------------- The `reads` function generalizes this, and https://webdoc.sub.gwdg.de/ebook/serien/ah/UU-CS/2008-044.pdf[Doaitse Swierstra, _Combinator Parsing: A Short Tutorial_] builds upon this approach. However, we'll insist on at most one correct way to parse a string. We also provide a mechanism for reporting an error message: ---------------------------------------------------------------- String -> Either String (Integer, String) ---------------------------------------------------------------- This type accommodates parser functions that take a string and return either an error message or the successfully parsed integer at the start of the string along with the remainder of the input. With this design, we might write a function to parse a single digit as follows:
[ ]:
parseDigit = \case c:s' | '0' <= c, c <= '9' -> Right (fromDigit c, s') _ -> Left "want digit" parseDigit "2b" parseDigit "b2"
From this we can build a parser for a two-digit number:
[ ]:
parseTwoDigits s = do (a, s') <- parseDigit s (b, s'') <- parseDigit s' pure (10*a + b, s'') parseTwoDigits "221b" parseTwoDigits "2b21"
It works, but our design is infested with boilerplate. We must carefully manage the input string, ensuring that the unparsed remainder of one step is fed to the next. The pain worsens as we write more complex parsers. We would prefer to write plumbing code like this once, and reuse it many times. This is easy in Haskell. == The easy way == We start by defining a data type to hold a parsing function, which we call a `Charser`, a sort of portmanteau meant to evoke a parser that handles one Char at a time. We also generalize so that our parsers can return any desired type.
[ ]:
data Charser a = Charser { unCharser :: String -> Either String (a, String) }
Next, we define some fundamental parsers. The `sat` parser returns the first character if it satisfies a given predicate, and fails otherwise. THe `eof` parser returns `()` if the given string is empty, and fails otherwise.
[ ]:
sat f = Charser \case h:t | f h -> Right (h, t) _ -> Left "unsat" eof = Charser \case [] -> Right ((), "") _ -> Left "want EOF"
We can easily get at the function via `unCharser`:
[ ]:
unCharser (sat (== 'x')) "xyz" unCharser (sat (== 'x')) "abc" unCharser eof "" unCharser eof "abc"
Lastly, we define `Functor`, `Applicative`, `Monad`, and `Alternative` typeclass instances for the `Charser` type. The `Functor` instance is standard stuff (that GHC can derive automatically) that lets us easily operate on a parsed value, while the others describe how to compose two parsers. The `Monad` and `Applicative` instances meticulously thread the input string from the first parser to the second parser, while the `Alternative` instance builds a parser that tries the first parser, and on failure tries the second parser. These correpond to sequencing and the choice operator denoted by a vertical bar "|" in a https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form[BNF grammar], though in our framework we iterate through the choices in the order of appearance and take the first match.
[ ]:
instance Functor Charser where fmap f (Charser x) = Charser $ fmap (first f) . x instance Applicative Charser where pure a = Charser $ Right . (a,) f <*> x = Charser \s -> do (fun, t) <- unCharser f s (arg, u) <- unCharser x t pure (fun arg, u) instance Monad Charser where Charser f >>= g = Charser $ (uncurry (unCharser . g) =<<) . f return = pure instance Alternative Charser where empty = Charser $ const $ Left "" x <|> y = Charser \s -> either (const $ unCharser y s) Right $ unCharser x s
And we're done! We have just defined _parser combinators_, which we compose to build complex parsers without needing to juggle bits and pieces of input strings. Our two-digit parser example becomes:
[ ]:
digitChar = sat \c -> '0' <= c && c <= '9' twoDigits = do a <- digitChar b <- digitChar pure $ 10*fromDigit a + fromDigit b unCharser twoDigits "221b" unCharser twoDigits "2b21"
== Arbitrary-precision calculator == We demonstrate a more sophisticated parser that parses an integer arithmetic expression. Rather than store an expression in a tree data structure, we evaluate as we parse.
[ ]:
char :: Char -> Charser Char char = sat . (==) oneOf :: String -> Charser Char oneOf s = sat (`elem` s) chainl1 :: Charser a -> Charser (a -> a -> a) -> Charser a chainl1 p infixOp = go =<< p where go x = (go =<< ($x) <$> infixOp <*> p) <|> pure x chainr1 :: Charser a -> Charser (a -> a -> a) -> Charser a chainr1 p infixOp = (&) <$> p <*> (flip <$> infixOp <*> chainr1 p infixOp <|> pure id) sp :: Charser a -> Charser a sp p = p <* many (oneOf " \n") -- Eat trailing spaces. spch = sp . char op c f = spch c *> pure f line = sp (pure ()) *> expr <* eof where num = parseInteger <$> sp (some digitChar) atm = num <|> spch '(' *> expr <* spch ')' pow = chainr1 atm $ op '^' (^) una = foldr (.) id <$> many (op '+' id <|> op '-' negate) fac = una <*> pow term = chainl1 fac $ op '*' (*) <|> op '/' div <|> op '%' mod expr = chainl1 term $ op '+' (+) <|> op '-' (-) unCharser line "3 + 4 * 5"
Our `line` parser closely resembles the grammar it parses: ------------------------------------------------------------------------------ num ::= ('0'|..|'9'|'.')+ atm ::= num | '(' expr ')' pow ::= atm ('^' atm )* una ::= ('+'|'-')* fac ::= una pow term ::= fac ( ('*'|'/'|'%') fac )* expr ::= term ( ('+'|'-') term )* ------------------------------------------------------------------------------ In fact, one might say the definition of `line` is an https://en.wikipedia.org/wiki/Attribute_grammar[attribute grammar]. We omit the rules involving spaces, which are traditionally considered part of a separate phase called lexical analysis. With parser combinators, such divisions are blurrier, and it can be practical to combine the two phases into a single catch-all parsing phase. We embed the parser in a calculator web widget. Type an expression into the textbox and press enter.
[ ]:
calc = do s <- jsEval "calcIn.value;" putStr $ case unCharser line s of Left err -> err Right (v, _) -> s ++ " = " ++ show v jsEval [r| repl.outdiv.insertAdjacentHTML('beforeend', ` <textarea style="border: solid 4px #999999" id="calcIn" rows="1" cols="40">8888+88 / 8+8 / 8+8 * 8</textarea> <div id="calcOut"></div> `); calcIn.addEventListener("keydown", ev => { if (ev.keyCode == 13) { const r = repl.run("chat", ["Main"], "calc"); calcOut.innerText = r.out; calcIn.value = ""; ev.preventDefault(); } }); |]
== The Fast Way == Suppose we have a complex grammar and a long input that fails to parse near the end. Due to our implementation of `(<|>)`, after the first failure, the parser will backtrack and explore every alternative as it unwinds, before ultimately giving up anyway. Not only is this slow, but backtracking also discards the error thrown by the original failure, leading to error messages that are difficult to understand. To address these issues, popular parser combinator libraries change the meaning of `(<|>)`. If the first alternative fails _and_ no input has been consumed, only then is the second alternative attempted. That is, if a parser processes even a single character before failing, then the parser halts immediately with failure. Short-circuiting failure is indeed faster, but introduces a problem. At times we really do want backtracking, that is, we want to undo consuming some of the input before trying an alternative. Thus parser combinator libraries also provide a `try` function: it tries a given parser on the input, and on failure, acts as if no input has been consumed so that alternatitves can be tried. Naturally, http://blog.ezyang.com/2014/05/parsec-try-a-or-b-considered-harmful/[we should use `try` sparingly]. One way to implement this feature is to have parsers return a boolean indicating whether input has been consumed.
[ ]:
data Parser a = Parser { unParser :: String -> (Bool, Either String (a, String)) } instance Functor Parser where fmap f (Parser x) = Parser $ second (fmap $ first f) . x instance Applicative Parser where pure a = Parser $ (False,) . Right . (a,) f <*> x = Parser \s -> let (c0, mf) = unParser f s in case mf of Left err -> (c0, Left err) Right (fun, t) -> let (c1, mg) = unParser x t in case mg of Left err -> (c0 || c1, Left err) Right (arg, u) -> (c0 || c1, Right (fun arg, u)) instance Monad Parser where Parser f >>= g = Parser \s -> let (c0, mf) = f s in case mf of Left err -> (c0, Left err) Right (x, t) -> let (c1, mg) = unParser (g x) t in case mg of Left err -> (c0 || c1, Left err) r -> (c0 || c1, r) return = pure instance Alternative Parser where empty = Parser $ const (False, Left "") x <|> y = Parser \s -> let cm@(consumed, m) = unParser x s in case m of Left err -> if consumed then cm else unParser y s Right _ -> cm try p = Parser \s -> let cm@(_, m) = unParser p s in case m of Left err -> (False, Left err) _ -> cm sat f = Parser \case h:t | f h -> (True, Right (h, t)) _ -> (False, Left "unsat")
(We call this version `Parser` instead of `Charser` because my interprefer has issues with redefining typeclasses.) Here's some examples showing how the new `(<|>)` behaves:
[ ]:
char = sat . (==) string = mapM char unParser (string "foo" <|> string "bar") "foo" unParser (string "foo" <|> string "bar") "bar" unParser (string "foo" <|> string "bar") "baz" unParser (string "foo" <|> string "bar") "qux" unParser (string "qux" <|> string "quux") "quux" unParser (try (string "qux") <|> string "quux") "quux"
== The evolution of parsing == As a child, I was baffled by parsers because I only understood loops. How does the computer handle arbitrarily nested parentheses?! Parsers might have been fundamentally beyond my reach at the time, as some research suggests that https://modeltheory.org/papers/2018children-recursion.pdf[10-year-olds struggle with the concept of recursion, while 11-year-olds struggle with understanding recursive programs]. As it happened, I was 12 or so when I read about https://en.wikipedia.org/wiki/Recursive_descent_parser[recursive descent parsers]. (For memory, the book was Herbert Schildt, _Advanced Turbo C_.) I was awestruck by their simplicity, elegance, and power. I felt I had leveled-up in real life. I now knew everything about parsers. I was awestuck again on encountering https://en.wikipedia.org/wiki/Compilers:_Principles,_Techniques,_and_Tools[the "red dragon book"] at university. Forget those childish recursive descent parsers! Instead, the computer should do the hard work; the human should just write a handful of regular expressions and a BNF grammar. Not only is it simple and elegant, but there are strong guarantees: the parser generator produces efficient code and detects ambiguities. I felt I knew everything about parsers. This time for real. I was awestruck yet again years later when I stumbled upon https://en.wikipedia.org/wiki/Parser_combinator[parser combinators]. Forget those cumbersome parser generators! I had thought combinators were a mathematical curiosity; a fun alternative to Turing machines for studying computability. But parsing functions that return a parsing function in addition to performing other duties lead to amazingly succinct recursive descent parsers. But the shocks didn't stop. I stumbled upon http://bford.info/pub/lang/peg.pdf[parsing expression grammars], or PEGs, which can be parsed in linear time and are unambiguous. They can parse some non-context-free languages, and it is unknown if they can parse all context-free langauges. And then there's link:pwd.html[parsing with derivatives], which https://arxiv.org/pdf/1604.04695.pdf[parse any context-free grammar in cubic time] (on par with Earley's algorithm or CYK). In brief, "even though link:re.html[Brzozowski defined the derivative with regular languages] in mind, it works unaltered for context-free languages if you toss in laziness, memoization and fixed points." I now feel I know little about parsers. In practice, I let expedience be my guide. If I must have features guaranteed by the theory, or if parser combinators are unavailable, then I'll resort to a parser generator. Otherwise, I'll choose a parser combinator library such as https://hackage.haskell.org/package/megaparsec[Megaparsec].