num ::= ('0'|..|'9'|'.')+ factor ::= ('+'|'-')* ( num | '(' expr ')' ) term ::= factor ( ('*'|'/') factor )* expr ::= term ( ('+'|'-') term )* line ::= expr EOF
Parser Combinators
The above demo expects expressions conforming to a simple calculator grammar (the numbers and symbols have their usual meaning):
We’ll see how to parse and evaluate expressions with code that takes about the same space as the grammar. But first, we need a few imports so we can show off our interpreter on this webpage:
import Control.Monad
import Haste.DOM
import Haste.Events
The interpreter itself closely follows the grammar. The main difference is one extra line for unary operators:
-- I recommend Megaparsec over Parsec.
-- I'm forced to use Parsec here because of Haste.
import Text.Parsec
line :: Parsec String () Double
line = expr <* eof where
num = read <$> many1 (digit <|> char '.')
una = product <$> many ((char '+' >> pure 1) <|> (char '-' >> pure (-1)))
fac = (*) <$> una <*> num <|> between (char '(') (char ')') expr
term = fac `chainl1` ((char '*' >> pure (*)) <|> (char '/' >> pure (/)))
expr = term `chainl1` ((char '+' >> pure (+)) <|> (char '-' >> pure (-)))
Lastly, we need code that pulls from the input text area when Enter is hit and writes the corresponding result to the output text area then scrolls down. For convenience, we strip out spaces from the input string.
main :: IO ()
main = withElems ["input", "output"] $ \[iEl, oEl] ->
void $ iEl `onEvent` KeyDown $ \key -> when (key == mkKeyData 13) $ do
s <- filter (/= ' ') <$> getProp iEl "value"
setProp iEl "value" ""
hist <- (++ ("> " ++ s ++ "\n")) <$> getProp oEl "value"
setProp oEl "value" $ (hist ++) $ case parse line "" s of
Left e -> "Error: " ++ show e ++ "\n"
Right d -> show d ++ "\n"
getProp oEl "scrollHeight" >>= setProp oEl "scrollTop"
preventDefault
$ haste-cabal install parsec $ wget https://crypto.stanford.edu/~blynn/haskell/parse.lhs $ hastec parse.lhs $ sed 's/^\\.*{code}$/-----/' parse.lhs | asciidoc -o - - > parse.html
Open parse.html
in a browser to see it action.
Patching our partial parser
Our use of the partial function
read
should be viewed with suspicion, and indeed, we find inputs such as
"1.2.3" cause exceptions. We can fix this with readMaybe
from Text.Read
:
num = do s <- many1 (digit <|> char '.') case readMaybe s of Just x -> pure x Nothing -> fail "bad float" <?> "number"
Using readFloat
from Numeric
instead is similar.
Unfortunately, a Haste bug means these solutions fail. To work around the bug, we can roll our own floating-point reading routine. If we’re taking the trouble to do this, we may as well also modify the grammar to accept at most one decimal point, which produces friendlier error messages:
num = do s <- many1 digit t <- option "" $ char '.' >> many1 digit pure $ fromIntegral (read $ s ++ t) / fromIntegral (10 ^ length t)
Alternatively, we could use the Parsec library’s routines:
import Text.Parsec.Language import Text.Parsec.Token num = do e <- naturalOrFloat $ makeTokenParser emptyDef pure $ case e of Left n -> fromIntegral n Right d -> d
This also adds support for all numeric literals defined in the Haskell report such as "1.23e-4" or "0xface".
It turns out
Parsec’s Expr
module is specifically designed for expression grammars
like ours, and can build a parser from a supplied table of operators. However,
it hides the interesting part of the library, namely the combinators. Also,
using the module hardly saves any lines of code in our case.
Do or do not
Use
try
sparingly to prevent excessive backtracking and confusing error
messages. If try
must be used, then sometimes lookahead
may help clear up
the error messages by also rewinding the position on failure.
I read about using
breadth-first search in
conjunction with parser combinators to avoid try
, but haven’t tried it yet.
I also read about
a parser combinator library
focusing on diagnostics, but haven’t tried that either.
I’ve seen dubious claims that lexing and parsing should always be combined when
using parser combinators. While great for some grammars, blindly following
this rule can clutter the code and make try
harder to eliminate.
It can be better to tokenize the input stream with parser combinators and output a list of lexemes (which is really a stream), then parse this list of lexemes with another set of parser combinators. In Parsec, we can write a thin wrapper to pass position information from the tokenizer to the parser so that error reporting is unaffected.
Indeed, we can go further and trivially add more layers beyond a parser and a lexer. Doing this with parser generators and dedicated lexers is cumbersome because it involves multiple languages.
The evolution of parsing
As a child, I was baffled by parsers because I only pursued iterative approaches. How does the computer handle arbitrarily nested parentheses?!
I was awestruck when I learned about recursive descent parsers. (For memory, it was Advanced Turbo C by Herbert Schildt.) Simple, elegant, yet so powerful. I felt I knew everything about parsers.
I was awestuck again on encountering the "red dragon book" at university. Forget those childish recursive descent parsers! Instead, the computer should do the hard work; the human should just write a handful of regular expressions and a BNF grammar. Not only is it simple and elegant, but there are strong guarantees: the parser generator produces efficient code and detects ambiguities. I felt I knew everything about parsers. This time for real.
I was awestruck yet again years later when I stumbled upon parser combinators. Forget those cumbersome parser generators! I had thought combinators were a mathematical curiosity; a fun alternative to Turing machines for studying computability. But parsing functions that return a parsing function in addition to performing other duties lead to amazingly succinct recursive descent parsers.
More recently, I heard about parsing expression grammars, or PEGs, which can be parsed in linear time and are unambiguous. They can parse some non-context-free languages, and it is unknown if they can parse all context-free langauges.
And then there’s parsing with derivatives, which parse any context-free grammar in cubic time (on par with Earley’s algorithm or CYK). In brief, "even though Brzozowski defined the derivative with regular languages in mind, it works unaltered for context-free languages if you toss in laziness, memoization and fixed points."
I now feel I know hardly anything about parsers.
Knuth intends to write about parsers in Volume 5 of The Art of Computer Programming. I once thought it would be almost pointless; why not just read the dragon book?
But now I’d love a comprehensive reference that contained parser combinators, PEGs, and parsing with derivatives. Moreover, in order to teach parser combinators and parsing with derivatives, it seems one would have to teach combinators, lambda calculus, lazy evaluation, fixed points, type theory, and so on. It would be great for these topics to gain wider exposure, and great to see them neatly explained in a single series of books.
These days, I let expedience be my guide: if I must have features guaranteed by the theory, or if parser combinators are unavailable, then I’ll resort to a parser generator. Otherwise, I’ll choose parser combinators.