Subsections

4 Customization

Both the parsers and the scanners can be customized. The parser is usually extended by subclassing, and the scanner can either be subclassed or completely replaced.

4.1 Customizing Parsers

If additional fields and methods are needed in order for a parser to work, Python subclassing can be used. (This is unlike parser classes written in static languages, in which these fields and methods must be defined in the generated parser class.) We simply subclass the generated parser, and add any fields or methods required. Expressions in the grammar can call methods of the subclass to perform any actions that cannot be expressed as a simple expression. For example, consider this simple grammar:

parser X:
    rule goal:  "something"  {{ self.printmsg() }}

The printmsg function need not be implemented in the parser class X; it can be implemented in a subclass:

import Xparser

class MyX(Xparser.X):
    def printmsg(self):
        print "Hello!"

4.2 Customizing Scanners

The generated parser class is not dependent on the generated scanner class. A scanner object is passed to the parser object’s constructor in the parse function. To use a different scanner, write your own function to construct parser objects, with an instance of a different scanner. Scanner objects must have a token method that accepts an integer N as well as a list of allowed token types, and returns the Nth token, as a tuple. The default scanner raises NoMoreTokens if no tokens are available, and SyntaxError if no token could be matched. However, the parser does not rely on these exceptions; only the parse convenience function (which calls wrap_error_reporter) and the print_error error display function use those exceptions.

The tuples representing tokens have four elements. The first two are the beginning and ending indices of the matched text in the input string. The third element is the type tag, matching either the name of a named token or the quoted regexp of an inline or ignored token. The fourth element of the token tuple is the matched text. If the input string is s, and the token tuple is (b,e,type,val), then val should be equal to s[b:e].

The generated parsers do not the beginning or ending index. They use only the token type and value. However, the default error reporter uses the beginning and ending index to show the user where the error is.

Note: This isn't well documented and I recommend you also look through the source code. Also, see how Python RE can be used as a tokenizer, or read this long article.

Amit J Patel, amitp@cs.stanford.edu