{-# LANGUAGE CPP #-}
#ifdef __HASTE__
{-# LANGUAGE PackageImports #-}
import "mtl" Control.Monad.State.Strict
import Haste.DOM
import Haste.Events
#else
import Control.Monad.State.Strict
#endif
import Data.List (intersperse)
import Data.Maybe
import qualified Data.Map as M
import System.Random
import Text.Parsec hiding (State)
-- NT = non-terminal. (:.) = concatenation.
data Pe = NT String | Eps Char | Nul | Ch Char | Or [Pe] | Pe :. Pe | Del Pe
Parsing With Derivatives
Hit the "Parse" button to parse a given input with a given context-free grammar with start symbol "S". We randomly pick one tree from the parse forest to display. Clicking again rerolls the dice.
We also show the entire parse forest, using sharing to avoid exponential output. Try something like "1+1+1+1" for a more readable example.
context-free grammar:
input:
parse forest:
See:
-
Yacc is dead: a brief explanation of parsing with derivatives.
-
Parsing with Derivatives functional pearl: improved version of original paper.
-
On the Complexity and Performance of Parsing with Derivatives: proves the algorithm has cubic time complexity.
Derivative work
We define a Pe
data structure to represent parsing expressions, that is, the
right-hand side of the production rules of a grammar.
Although it represents the empty string, the Eps
(for epsilon) expression
holds a character that winds up in the abstract syntax trees (AST) returned by
the parser. Similarly, the Del
(for delta) expression, which is only generated
internally, holds an expression which later helps build ASTs.
Our ASTs are full binary trees whose leaf nodes are characters (the free magma on the alphabet). The tree structure captures the order the production rules are applied.
We compute a parse forest, that is, all possible ASTs. It may contain exponentially many trees, but we represent it compactly by sharing. For this reason, our AST supports indirection nodes, and we keep around a map from strings to ASTs.
We also maintain another similar map to help compute nullability.
Lastly, we turn the grammar into yet another map to which we add more entries as we parse.
data Ast = Lf Char | Ast :@ Ast | Union [Ast] | Ind String deriving Show
data PWD = PWD
{ nullMemo :: M.Map String Bool
, astMemo :: M.Map String Ast
, grammar :: M.Map String Pe
}
To parse an input string, we repeatedly derive the start symbol with respect to
each character of the input, taking care to leave bread crumbs in the Eps
and
Del
expressions to record consumed characters. (The Del
constructor is
named for the delta symbol from the paper, but I also think of it as "deleted",
because it remembers what has just been deleted from the input.)
Then the string is accepted if and only if the resulting expression is nullable, that is, accepts the empty string. We traverse the final derivative to recover the parse forest.
We memoize derivatives by adding entries to a state of type PWD
.
Initially, this cache contains only the input grammar, mapping nonterminal
symbols to Pe
values. Later, we place a derivative at the key formed by
concatenating the characters involved in the derivative with the
nonterminal symbol being derived.
For example, if S
is a nonterminal symbol in the input grammar, then the key
abS
maps to derive 'a' (derive 'b' (NT "S"))
. We assume no nonterminal
symbol in the input grammar is a suffix of any other nonterminal symbol, which
is fine for a prototype.
It may help to imagine the grammar growing over time, gaining new production rules as we process input characters.
pwd :: PWD -> String -> String -> Maybe (String, M.Map String Ast)
pwd st start inp = if b then Just (s, astMemo st') else Nothing where
s = reverse inp ++ start
(b, st') = runState (parseForest s) st
Nothing matters
The paper mentions using Kleene’s fixed point theorem to compute nullability. I believe for context-free grammars, rather than iterating a function until it stabilizes, we can specialize to the following simple algorithm.
We maintain a map from nonterminals to booleans indicating nullability. On
encountering a nonterminal T
, if already present in the map, then we use its
memoized nullability. Otherwise, we temporarily say T
is not nullable and
insert False
for its value in the map, before recursively computing
nullability on the definition of T
. Then we update the map with the returned
nullability boolean.
For example, consider the rule:
T = T "x" | ""
Then to compute the nullability of T
:
-
Assume
T
is non-nullable. We insertT → False
into our map. -
Recurse on
T "x" | ""
:-
The first branch is non-nullable because
T
is non-nullable according to the map. -
The second branch is a nullable terminal, so the whole expression is nullable.
-
-
We update the map with the result of the recursion, that is, we insert
T → True
.
nullable :: Pe -> State PWD Bool
nullable pe = case pe of
Eps _ -> pure True
Del _ -> pure True
Nul -> pure False
Ch _ -> pure False
Or xs -> or <$> mapM nullable xs
x :. y -> (&&) <$> nullable x <*> nullable y
NT s -> maybe (update s) pure . M.lookup s =<< gets nullMemo
where
update s = do
modify $ \st -> st { nullMemo = M.insert s False $ nullMemo st }
b <- nullable =<< memoDerive s
modify $ \st -> st { nullMemo = M.insert s b $ nullMemo st }
pure b
We wave our hands to prove this works. Since we start by assuming a nonterminal
T
is not nullable, the only possible pitfall is falsely concluding that a
nullable nonterminal is non-nullable.
If the production rule for T
contains an expression whose leaf nodes are all
nullable terminals in one of its branches, then let’s say T
is 1-nullable.
The above correctly determines T
is nullable. Otherwise if T
can reduce to
an expression whose leaf nodes at most k-nullable, then by inductive
assumption our algorithm correctly computes the nullability of each of these
nodes, and will therefore correctly infer that the (k+1)-nullable T
is
also nullable.
You must remember this
To compute derivatives, we follow the rules given in the paper, memoizing as we go. There is no danger of an infinite loop because our nullability algorithm avoids nodes that have already been seen.
memoDerive :: String -> State PWD Pe
memoDerive cs@(c:s) = maybe update pure =<< M.lookup cs <$> gets grammar
where
update = do
b <- derive c =<< memoDerive s
modify $ \st -> st { grammar = M.insert cs b $ grammar st }
pure b
memoDerive _ = error "unreachable"
derive :: Char -> Pe -> State PWD Pe
derive c pe = case pe of
NT s -> pure $ NT $ c:s
Ch x | x == c -> pure $ Eps x
Or xs -> Or <$> mapM (derive c) xs
Del x :. y -> (Del x :.) <$> derive c y
x :. y -> do
b <- nullable x
dx <- derive c x
if not b then pure $ dx :. y else do
dy <- derive c y
pure $ Or [dx :. y, Del x :. dy]
_ -> pure Nul
Some algebra might make this more efficient. For example, a Nul
branch can be
removed, which in turn may simplify an Or
expression. On the other hand,
laziness could already be doing this for us. We’ll leave it as is, as we want
our prototype to be simple.
See the forest and the trees
The paper’s version of parseNull
again computes a least fixed point. I again
believe for context-free grammars we can specialize to a simpler algorithm.
Our parseForest
function memoizes nullable nonterminals to avoid infinite
loops: the first time we encounter a nullable nonterminal, we insert a dummy
map entry before recursing on its definition. Afterwards we update the map with
the correct result.
If we encounter a nonterminal already present in the map, we just return its name without examining its value.
parseForest :: String -> State PWD Bool
parseForest s = do
m <- gets astMemo
case M.lookup s m of
Nothing -> do
pe <- memoDerive s
b <- nullable pe
when b $ do
modify $ \st -> st { astMemo = M.insert s (Ind s) $ astMemo st }
t <- maybe (error "nullable bug") id <$> parseNull pe
modify $ \st -> st { astMemo = M.insert s t $ astMemo st }
pure b
Just _ -> pure True
parseNull :: Pe -> State PWD (Maybe Ast)
parseNull pe = case pe of
Eps x -> pure $ Just $ Lf x
Del x -> parseNull x
Nul -> pure Nothing
Ch _ -> pure Nothing
Or xs -> uni <$> mapM parseNull xs
x :. y -> liftM2 (:@) <$> parseNull x <*> parseNull y
NT s -> parseForest s >>= \b -> pure $ if b
then Just $ Ind s
else Nothing
uni :: [Maybe Ast] -> Maybe Ast
uni xs = case concatMap deUni $ catMaybes xs of
[] -> Nothing
[x] -> Just x
t -> Just $ Union t
where
deUni (Union a) = a
deUni x = [x]
The following prints the parse forest in a somewhat comprehensible form. We
show the start symbol, followed by a table of definitions for each nullable
nonterminal, where each Eps
and Del
has been expanded to their
corresponding expressions. We surround nonterminals with braces to denote
shared expressions.
We could clean up a little by inlining, at the cost of obscuring what the algorithm does.
showsForest :: String -> M.Map String Ast -> ShowS
showsForest s m = linkify (s++) . (" where\n"++) . foldr (.) id (line <$> M.assocs m)
where
line (k, v) = (k++) . (" = "++) . go v . ('\n':)
go v = case v of
Ind t -> linkify (t++)
Union xs -> foldr1 (.) $ intersperse (" | "++) $ (go <$> xs)
x :@ y -> go x . (' ':) . go y
Lf c -> (c:)
linkify f = ('{':) . f . ('}':)
We gratuitously show off tying the
knot to dispense with the map. In other words, we replace the Ind
nodes
with native pointers to the Ast
nodes they point to. We use the tied version
in a routine that selects a random parse tree from the parse forest.
tie :: String -> M.Map String Ast -> Ast
tie start m = aux M.! start where
aux = go <$> m
go e = case e of
Ind t -> aux M.! t
Union xs -> Union $ go <$> xs
x :@ y -> go x :@ go y
Lf _ -> e
showsOne :: (ShowS -> ShowS) -> StdGen -> Ast -> ShowS
showsOne paren g ast = case ast of
Union xs -> let (n, g1) = randomR (0, length xs - 1) g
in showsOne paren g1 $ xs !! n
Lf c -> (c:)
x :@ y -> let (gx, gy) = split g
in paren $ showsOne id gx x . showsOne addParen gy y
_ -> error "knot-tying bug"
where addParen s = ('(':) . s . (')':)
A grammar for grammars
I should be eating my own dogfood and using parsing with derivatives to read the definition of a context-free grammar, but it’ll have to wait until I add more features. For now, we use parser combinators.
cfg :: Parsec String () PWD
cfg = PWD M.empty M.empty . M.fromList <$> between filler eof (many1 rule)
where
rule = (,) <$> sym <*> between (want "=") (want ";") expr
expr = Or <$> cat `sepBy1` want "|"
cat = foldl1 (:.) <$> many1 atm
atm = str <|> NT <$> sym
sym = many1 alphaNum <* filler
str = strGram <$> between (char '"') (want "\"") (many $ noneOf "\"")
want s = string s <* filler
filler = skipMany $ com <|> void space
com = try (string "--") >> skipMany (noneOf "\n")
strGram :: String -> Pe
strGram "" = Eps '\949'
strGram s = foldl1 (:.) $ Ch <$> s
For the empty string, we use a custom Eps
so it shows up as a curly epsilon
in the parse tree. This is cute, but bad for grammars that include curly
epsilons!
Frontend
View the HTML source to see the hidden textareas that we harvest below.
#ifdef __HASTE__
main :: IO ()
main = withElems ["grammar", "str", "out", "parse", "forest"] $
\[gEl, sEl, oEl, parseB, fEl] -> do
let
handle demo = do
Just b <- elemById $ demo ++ "B"
void $ b `onEvent` Click $ const $ preset demo
preset demo = do
Just g <- elemById $ demo ++ "G"
Just s <- elemById $ demo ++ "S"
setProp oEl "value" ""
setProp fEl "value" ""
setProp gEl "value" =<< getProp g "value"
setProp sEl "value" =<< getProp s "value"
dump (m, s) = do
rnd <- newStdGen
setProp oEl "value" $ showsOne id rnd (tie m s) ""
setProp fEl "value" $ showsForest m s ""
handle "1+1"
handle "par"
handle "pal"
preset "1+1"
void $ parseB `onEvent` Click $ const $ do
setProp oEl "value" ""
setProp fEl "value" ""
mg <- parse cfg "" <$> getProp gEl "value"
case mg of
Left e -> setProp oEl "value" $ "error: " ++ show e
Right g -> do
s <- getProp sEl "value"
if M.member "S" $ grammar g
then maybe (setProp oEl "value" "[parse failed]") dump $ pwd g "S" s
else setProp oEl "value" "missing start symbol S"
#else
main :: IO ()
main = print $ pwd g "S" $ concat (replicate 39 "1+") ++ "+1" where
Right g = parse cfg "" $ unlines
[ "S = T;"
, "T = T \"+\" T | N;"
, "N = \"1\";"
]
#endif