Parser Generators


Almost all parser generators belong to one of two families-top-down or bottom-up-whose names describe how they go about comparing text to grammar.

Bottom-up parser — better known as LR parsers: "scan Left, expanding Rightmost subrule(in reverse)" — are usually implemented as a series of states, encoded in a lookup table. The parser moves from state to state by examining the next available token in the text, and then consulting the table to see what to do next. The choices are: perform an action, change to new state, or give up in digust. A bottom-up parser succeed if all the text is processed and the parser is left in a predefined "successful" state. Parsing can fail at anytime if the parser sees a token for which there is no suitable next state. In effect, a bottom-up parser converts the original grammar into a maze, which the text must thread its way through, state-to-state, one token at a time. If the trail of tokens leads to a dead end, then the text didn't conform to the rules of the grammar and is rejected. In this metaphor, any actions that were embedded in the grammar are scattered throughout the maze and are executed whenever they are encountered.

Top-down parsers — a.k.a LL parsers: "scan Left, expanding Leftmost subrule" — work quite differently. Instead of working token-to-token through a flattened-out version of the grammar rules, top-down parsers start at the highest level and attempt to match the most general rule first. Thus a top-down parser would start by attempting to match the entire DiffOutput rule, and immediately realize that to do so it must match one or more instances of the Command rule. So it would attempt to match a Command rule, and immediately realize that to do so it must match one of the three productions in the Command rule… It would keep moving ever further down the hierarchy of rules until it actually reach a point where it could try to match a terminal (against the first available token). If the terminal and token matched, the parser would move back up the hierarchy looking for the next bit of the enclosing production, and following that down util the next token was matched. If all the parts of the production were matched, the parser would move back up, matching higher and higher rules, until it succeeded in matching the top-most rule completely. If at some point a token fails to match, then the parser would work back up the tree, looking for an alternative way of matching the original rule (i.e. via a different production of the same rule). If all alternatives eventually fail, then the parse as a whole fails. In other words, a top-down parser converts the grammar into a tree (or more generally a graph), which it then traverses in a top-to-bottom, left-to-right order. If a particular branch of the tree fails to match, then the parser backs up and tries the next possible branch. If all branch fail, then the parse fail. In this approach, embedded actions represent leaves on the tree, and are executed if the traversal ever reaches them.

Bottom-up parser generator: yacc, perl-byacc, Parse::Yapp Top-down parser generator: PCCTS, libparse, Parse::RecDescent

Most parser generators involve an unpleasantly complicated ritual. First you create a lexer to split up your text into labelled tokens. That usually involves hand-crafting some code or using a lexer generator such as lex, flex, or DLG. Next, you create your grammar and feed to a parser generator. The parser generator spits out some code (your parser). Then you write your program, importing both the lexer and parser code you built previously. You hook the three bits together and viola! Of course, this multi-stage, multi-task, multi-file, multi-tool approach quickly becomes a multi-lobe headache.


As the match proceeds, the various actions specified for each rule will be executed. If the text successfully parsed, the actions which get fired along the way will cause the program to print a "diff-reversed" version of its input.

Other parsers:
Lemon Parser Generator Tutorial

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License