{-# LANGUAGE CPP #-}
import Data.Bool
import Data.List
import Data.Map ((!))
import qualified Data.Map as M
import Data.Tuple
import Control.Monad
import Text.Parsec
#ifdef __HASTE__
import Control.Arrow
import System.Random
import Haste.DOM
import Haste.Events
import Haste.Foreign (ffi)
import Haste.JSString (pack)
#endif
Regex Calculus
Over half a century ago, Brzozowski published Derivatives of Regular Expressions, which describes how to:
-
Directly convert a regular expression to a deterministic finite automaton (DFA). No NFAs needed.
-
Support rich regular expressions. In particular, we have logical AND and NOT, e.g.
[a-z]*&!(()|do|for|if|while)
. -
Instantly obtain small and often minimal DFAs in typical applications.
This elegant and powerful method was almost forgotten, and inferior constructions prevail today. Happily, it was rescued from obscurity by Owens, Reppy, and Turon, in Regular-expression derivatives reexamined. May regex derivatives one day earn their rightful place in computer science.
Regexes, irregularly
We tweak the traditional definition of regular expressions. For us, there is only one kind of constant:
-
\([c_0 c_1 …]\): a character set or character class; accepts one of the characters in the given set.
Some notational conveniences: we may omit the square brackets for
singleton sets, and we use hyphens to denote ranges of characters.
For example, x
means [x]
and [a-d5-9]
means [abcd56789]
.
The empty character class \([]\) rejects all strings; even the empty string.
All other regexes are built from other regexes \(r\) and \(s\):
-
\(rs\): concatenation.
-
\(r\mid s\): logical or (alternation); the union of the two languages.
-
\(r\mbox{*}\): Kleene closure; zero or more strings of \(r\) concatenated together.
-
\(r\& s\): logical and; the intersection of the two languages.
-
\(!r\): logical not (complement); accepts a string if and only if \(r\) rejects it.
Let \(() = []\mbox{*}\), that is, the Kleene closure of the empty language. Observe \(()\) accepts only the empty string.
Textbooks often assume a small alphabet, so they can, for instance, cheaply iterate over each symbol of the alphabet. This is unrealistic for Unicode, and arguably for ASCII too. Instead of demanding a small alphabet, we ask only that certain operations on character classes be efficient.
Our code represents a character class either as Pos [Char]
where the list
holds the characters of the class, or as Neg [Char]
where the list holds the
characters outside the class. The function elemCC
tests if a given character
is in a given character class.
For example, Pos "abc"
represents the character class [abc]
while Neg
"abc"
represents the all characters except those in [abc]
, which we denote
[^abc]
.
This scheme is easy to work with and suffices for common applications,
which stick to narrow ranges such as [a-z]
or [^0-9]
.
Since we store all characters in a range rather than just the endpoints, we’d
choose a more sophisticated data structure if we expected unusually large
ranges.
data CharClass = Pos String | Neg String deriving (Eq, Ord)
elemCC :: Char -> CharClass -> Bool
elemCC c (Pos cs) = c `elem` cs
elemCC c (Neg cs) = c `notElem` cs
We use lists for the arguments of logical operations. We later explain why.
data Re = OneOf CharClass | Re :. Re | Kleene Re
| ReOr [Re] | ReAnd [Re] | ReNot Re
deriving (Eq, Ord)
Useful regex constants:
-- | The regex `()`. The language containing only the empty string.
eps :: Re
eps = Kleene noGood
-- | The regex `[]`. The empty language.
noGood :: Re
noGood = OneOf $ Pos []
-- | The regex `.*`. The language containing everything.
allGood :: Re
allGood = Kleene $ OneOf $ Neg []
Regex exercises
Let \(f\) be a regex.
1. Does \(f\) accept the empty string?
If so, we say \(f\) is nullable.
Character sets never accept the empty string. The Kleene closure always accepts the empty string. Concatenation does if and only if \(r\) and \(s\) do. The logical operations commute with nullability.
nullable :: Re -> Bool
nullable re = case re of
OneOf _ -> False
Kleene _ -> True
r :. s -> nullable r && nullable s
ReOr rs -> any nullable rs
ReAnd rs -> all nullable rs
ReNot r -> not $ nullable r
2. What regex do we get after feeding the character \(c\) to \(f\)? In other words, what regex accepts a string \(s\) precisely when \(f\) accepts \(c\) followed by \(s\)?
We call this regex the derivative of the regex \(f\)
with respect to the character \(c\), and write \(\partial_c f\).
For example, \(\partial_a\) ab*c|d*e*f|g*ah
= b*c|h
.
For a character set, the answer is \(()\) if \(c\) is a member. Otherwise
it’s \([]\). For example, \(\partial_a\) [abc]
= ()
and \(\partial_a\) [xyz]
= []
.
For the other cases, let us follow Lagrange and use a prime mark to denote a derivative (with respect to \(c\)). We find \(r'\) and \(s'\), that is, we recursively answer the question for \(r\) and \(s\). Then the logical operations commute with taking derivatives:
-
\((r\mid s)' = r' \mid s'\)
-
\((r\& s)' = r' \& s'\)
-
\((!r)' = !r'\)
For the Kleene closure, we find:
-
\((r\mbox{*})' = r'r\mbox{*}\)
The trickiest is concatenation: \(rs\). First, we answer the first question, that is, determine if \(r\) accepts the empty string. If so, the answer is \(r’s\mid s'\). If not, the answer is just \(r’s\).
In the following code, for now, ignore mk
and pretend (#.) = (:.)
.
We define them later.
derive :: Char -> Re -> Re
derive c re = case re of
OneOf cc | elemCC c cc -> eps
| otherwise -> noGood
Kleene r -> derive c r #. mkKleene r
r :. s | nullable r -> mkOr [derive c r #. s, derive c s]
| otherwise -> derive c r #. s
ReAnd rs -> mkAnd $ derive c <$> rs
ReOr rs -> mkOr $ derive c <$> rs
ReNot r -> mkNot $ derive c r
A regex operation is completely defined by its nullability and its derivative. The following function determines if a given regex accepts a given string:
accepts :: Re -> String -> Bool
accepts re "" = nullable re
accepts re (c:s) = accepts (derive c re) s
Classy Regexes
Regexes are generally content with a blurry view of the alphabet. Particular states may only care about digits, or whitespace, or certain letters. Indeed, the error state is an extreme case, acting identically on all symbols. Especially with alphabets such as Unicode, it pays to have a single arrow cover vast swathes of symbols, rather than build one arrow per symbol.
Thus we desire an algorithm that, for a given regex, partitions the alphabet into as few character classes as possible so the regex behaves correctly on the first input character even if members of the same class are indistinguishable. This is infeasible, but we get by with an imperfect algorithm. It may chop up the alphabet too finely, but it’ll do.
(Why do we only care about the first input character? All will become clear in the next section!)
If the regex is just a character class, then we can achieve perfection. Divide the alphabet into the "haves" and the "have-nots": those characters within the class, and those without.
Otherwise, we recursively construct our partition of the alphabet. For the Kleene star and logical not, we use the partition of the underlying regex. The other operations have two regex arguments, and we can stay out of trouble by taking all pairs of intersections of character classes of their partitions.
For concatenation, we can do better when the first regex rejects the empty string: in this case, we use the partition of the first regex and ignore that of the second.
classy :: Re -> [CharClass]
classy re = case re of
OneOf (Pos cs) -> [Pos cs, Neg cs]
OneOf (Neg cs) -> [Pos cs, Neg cs]
Kleene r -> classy r
ReNot r -> classy r
ReOr rs -> foldl1' allPairs $ classy <$> rs
ReAnd rs -> foldl1' allPairs $ classy <$> rs
r :. s | nullable r -> classy r `allPairs` classy s
| otherwise -> classy r
where allPairs r s = nub $ intersectCC <$> r <*> s
intersectCC :: CharClass -> CharClass -> CharClass
intersectCC (Pos xs) (Pos ys) = Pos $ intersect xs ys
intersectCC (Pos xs) (Neg ys) = Pos $ xs \\ ys
intersectCC (Neg xs) (Pos ys) = Pos $ ys \\ xs
intersectCC (Neg xs) (Neg ys) = Neg $ union xs ys
unionCC :: CharClass -> CharClass -> CharClass
unionCC (Pos xs) (Pos ys) = Pos $ union xs ys
unionCC (Pos xs) (Neg ys) = Neg $ ys \\ xs
unionCC (Neg xs) (Pos ys) = Neg $ xs \\ ys
unionCC (Neg xs) (Neg ys) = Neg $ intersect xs ys
The function repCC
returns a member of a character class.
For Pos
, we pick the first character in the list.
This always succeeds because our code never calls repCC
with Pos []
.
For Neg
, we search for the smallest character not in the list.
We assume Neg
never applies to the whole alphabet.
repCC :: CharClass -> Char
repCC (Pos (h:_)) = h
repCC (Pos []) = error "BUG! Pos [] should be filtered out."
repCC (Neg cs) | Just c <- find (`notElem` cs) [minBound..] = c
| otherwise = error "Neg with entire alphabet."
Regexes restated
We can now directly construct a DFA for any regex \(f\). We view a regex as a state of a DFA.
The start state is the input regex \(f\).
For each character class \(C\) in a sufficiently fine partition for \(f\)
(i.e. classy f
), pick any representative \(c\),
create the state \(\partial_c f\) if it doesn’t already exist, then draw an
arrow labeled \(C\) from \(f\) to \(\partial_c f\).
Repeat on all freshly created states. The accepting states are those which accept the empty string. Done!
We map states to integers to simplify our interface; users of our engine need only deal with integers. We retain a map of integers to regexes in case the caller seeks a deeper understanding of our DFA.
mkDfa :: Re -> ([(Int, Re)], Int, [Int], [((Int, Int), CharClass)])
mkDfa r = (swap <$> M.assocs states, states!r, as, M.assocs collated) where
collated = M.fromListWith unionCC edges
(states, edges) = explore (M.singleton r 0, []) r
as = snd <$> filter (nullable . fst) (M.assocs states)
explore gr q = foldl' (goto q) gr $ filter (/= Pos []) $ classy q
goto q (qs, ds) cc
| Just w <- M.lookup qc qs = (qs, mkEdge w)
| otherwise = explore (M.insert qc sz qs, mkEdge sz) qc
where qc = derive (repCC cc) q
sz = M.size qs
mkEdge dst = ((qs!q, dst), cc):ds
The above always terminates so long as we’re mindful that the logical or operation is:
-
idempotent: \(r\mid r = r\)
-
commutative: \(r\mid s = s\mid r\)
-
associative: \((r\mid s)\mid t = r\mid (s\mid t)\)
This makes sense intuitively, because taking a derivative usually yields a simpler regex. The glaring exception is the Kleene star, but on further inspection, we ought to repeat ourselves eventually after taking enough derivatives so long as we can cope with the proliferating logical ors.
In practice, we apply more algebraic identities before comparing regexes to get smaller DFAs. Ideally, we’d like to tell if two given regexes are equivalent so we could generate the minimal DFA every time, but this is too costly.
We represent arguments of (&)
and (|)
with lists so we can call nub
to
notice idempotence and sort
to sort out commutativity. We capture
associativity by flattening lists, except for concatenation where we use
pattern matching.
(#.) :: Re -> Re -> Re
r #. s
| r == noGood || s == noGood = noGood
| r == eps = s
| s == eps = r
| x :. y <- r = x #. (y #. s)
| otherwise = r :. s
mkOr :: [Re] -> Re
mkOr xs
| allGood `elem` zs = allGood
| null zs = noGood
| [z] <- zs = z
| otherwise = ReOr zs
where
zs = nub $ sort $ filter (/= noGood) flat
flat = concatMap deOr xs
deOr (ReOr rs) = rs
deOr r = [r]
mkAnd :: [Re] -> Re
mkAnd xs
| noGood `elem` zs = noGood
| null zs = allGood
| [z] <- zs = z
| otherwise = ReAnd zs
where
zs = nub $ sort $ filter (/= allGood) flat
flat = concatMap deAnd xs
deAnd (ReAnd rs) = rs
deAnd r = [r]
mkKleene :: Re -> Re
mkKleene (Kleene s) = mkKleene s
mkKleene r = Kleene r
mkNot :: Re -> Re
mkNot (OneOf (Pos [])) = allGood
mkNot (ReNot s) = s
mkNot r = ReNot r
This completes our regex engine.
Reading Regexes
We employ parser combinators to parse regex patterns. We deviate from
conventional syntax slightly. We add the metacharacters &
and !
for logical
and and logical not. We lack +
and ?
, but:
-
r+
is equivalent to!()&r*
-
r?
is equivalent to()|r
type Parser = Parsec String ()
regex :: Parser Re
regex = mkOr <$> ands `sepBy` char '|' where
ands = mkAnd <$> cats `sepBy` char '&'
cats = foldr (#.) eps <$> many nots
nots = (char '!' >> mkNot <$> nots) <|> (atm >>= kle)
atm = chCl
<|> const (OneOf $ Neg []) <$> char '.'
<|> between (char '(') (char ')') regex
kle :: Re -> Parser Re
kle r = char '*' *> kle (mkKleene r) <|> pure r
chCl = fmap OneOf $ (Pos . (:[]) <$> single)
<|> between (char '[') (char ']') parity
parity = option Pos (const Neg <$> char '^') <*>
(nub . sort . concat <$> many rng)
rng = alphaNum >>= \lo -> hiEnd lo <|> pure [lo]
single = char '\\' *> oneOf meta <|> noneOf meta
hiEnd :: Char -> Parser String
hiEnd lo = do
hi <- char '-' *> alphaNum
when (hi < lo) $ fail "invalid range end"
pure [lo..hi]
meta :: String
meta = "\\|&!*.[]()"
Rendering Regexes
Tedious code to print character classes and regexes:
instance Show CharClass where
show cc = case cc of
Pos [] -> "[]"
Neg [] -> "."
Pos [c] -> showSingle c
Neg [c] -> concat ["[^", showSingle c, "]"]
Pos s -> concat ["[" , f $ sort s, "]"]
Neg s -> concat ["[^", f $ sort s, "]"]
where
showSingle c | c `elem` meta = '\\':[c]
| otherwise = [c]
f "" = ""
f [c] = [c]
f (c:t) = rangeFinder c c t
rangeFinder lo hi (h:t) | h == succ hi = rangeFinder lo h t
rangeFinder lo hi t
| lo == hi = lo:f t
| succ lo == hi = lo:hi:f t
| succ (succ lo) == hi = [lo..hi] ++ f t
| otherwise = lo:'-':hi:f t
instance Show Re where
show = show' (0 :: Int) where
show' p re = case re of
OneOf s -> show s
Kleene (OneOf (Pos [])) -> "()"
Kleene r -> show' 4 r ++ "*"
ReNot r -> paren (p > 3) $ ('!':) $ show' 3 r
r :. s -> paren (p > 2) $ show' 2 r ++ show' 2 s
ReAnd rs -> paren (p > 1) $ intercalate "&" $ show' 1 <$> rs
ReOr rs -> paren (p > 0) $ intercalate "|" $ show' 0 <$> rs
paren True s = '(':s ++ ")"
paren False s = s
We place DFA nodes with a force-directed layout algorithm.
Here, we bump into a downside of lazy evaluation: the function iterate
is
lazy, leading to stack issues if we iterate much more. We ought to fix this.
forceDirect :: (Int -> Int -> Bool) -> [(Int, [Double])] -> [(Int, [Double])]
forceDirect isEdge vs = iterate (step isEdge) vs!!128
step :: (Int -> Int -> Bool) -> [(Int, [Double])] -> [(Int, [Double])]
step isEdge m = nudge <$> m where
nudge (a, v) = (,) a $ foldl' (zipWith (+)) v $ force <$> m where
force (b, w)
| b == a = [0, 0]
| otherwise = ((0.1 * f) *) <$> ba
where
ba = zipWith (-) v w
d = sqrt . sum $ (^(2::Int)) <$> ba
f = 300 / (d*d*d) + bool 0 (7 / d - 1) (isEdge a b)
Lastly, we have messy code that draws the DFA with SVG, and sets up the above UI. It has nothing to do with regexes.
Exponential worst-case
Converting regexes to DFAs can involve an exponential blow-up. For example, my browser crashes on this page on:
(a|b)*a(a|b)(a|b)(a|b)(a|b)
One might try forgoing the DFA, and only computing derivatives for the strings that we wish to match, but regex derivatives also blow up badly. This is a pity, as it suggests in general we may need NFAs.