Antlr - printed

What is ANTLR?

ANTLR is a parser generator that help us parse structure text and transform it into an organized structure, or an Abstract Syntax Tree. ANTLR is a parser generator, so it generate the parser that will parse our structure text. ANTLR contains of two parts, the tool and the run-time. The tool must be install in our development machine. The run-time must be included and used in our final product.

What does AST abbreviate for?

Abstract Syntax Tree.

What languages does ANTLR support?

ANTLR can generate parser that parse our structure text if our targeted language is among C#, Java, Python, JavaScript.

What do we need to do in order to use ANTLR?

  1. Download and install ANTLR
  2. Create the grammar that describes our structure text
  3. Invoke ANTLR with appropriate options to generate the lexer and parser for our targeted language
  4. Use the generated lexer and parser in our code.

What are the two parts of ANTLR?

  1. The tool
  2. The runtime.

ANTLR consist of two parts, the tool, and the runtime. The tool must be installed on our machine, but the runtime must be included in our final product.

Should we use ANTLR or should we implement our own parser?

Some people argue that writing a parser by hand you can make it faster and you can produce better error messages. There is some truth in this, but in my experience parsers generated by ANTLR are always fast enough. You can tweak them and improve both performance and error handling by working on your grammar, if you really need to. And you can do that once you are happy with your grammar.

Where can we find examples of using ANTLR?

In the above repository, you are going to find all the code with testing, even where we don’t see it in this article. The examples will be in different languages, but the knowledge would be generally applicable to any language.

How can we install the tool part of ANTLR?

mv antlr-4.6-complete.jar /usr/local/lib
export CLASSPATH=".:/usr/local/lib/antlr-4.6-complete.jar:$CLASSPATH"
alias antlr4='java -Xmx500M -cp "/usr/local/lib/antlr-4.6-complete.jar:$CLASSPATH" org.antlr.v4.Tool'
alias test-rig='java org.antlr.v4.gui.TestRig'

What should we use as the filename extension for our grammar file?

.g4. When you use ANTLR you start by writing a grammar, a file with extension .g4 which contains the rules of the language that you are analyzing.

How should we name our grammar file?

First we have to give our grammar an arbitrary but meaningful name. If we name our grammar Chat, then name of our grammar file should be Chat.g4.

How can we invoke ANTLR to generate the parser for our structured text if our targeted language is JavaScript?

antlr4 -Dlanguage=JavaScript grammarFileName.g4

Notice that the option is case-sensitive, so pay attention to the uppercase ‘S’. If you make a mistake you will receive a message like the following:

error(31):  ANTLR cannot generate Javascript code as of version 4.6

You can put your grammars in the same folder as your Javascript files. The file containing the grammar must have the same name of the grammar, which must be declared at the top of the file.

How can we deploy the ANTLR run-time for JavaScript as part of our application?

ANTLR can be used both with node.js and in the browser. For the browser you need to use webpack or require.js. If you don’t know how to use either of the two you can look at the official documentation for some help or read this tutorial on antlr on the web. We are going to use node.js, for which you can install the ANTLR runtime simply by using the following standard command:

npm install antlr4

How can we invoke ANTLR to generate the parser for our structured text if our targeted language is Python 2?

When you have a grammar you put that in the same folder as your Python files. The file must have the same name of the grammar, which must be declared at the top of the file. In the following example the name is Chat and the file is Chat.g4.

We can create the corresponding Python parser simply by specifying the correct option with the ANTLR4 Java program. For Python, you also need to pay attention to the version of Python, 2 or 3.

antlr4 -Dlanguage=Python3 Chat.g4

How can we deploy the ANTLR run-time for Python as part of our application?

The ANTLR runtime for Python is available from PyPi so you just can install it using pip:

pip install antlr4-python3-runtime

Again, you just have to remember to specify the proper python version.

How can we use ANTLR with Maven?

Specify in our POM that we need antlr4-runtime as a dependency. We will also use a Maven plugin to run ANTLR through Maven. We can also specify if we ANTLR to generate visitors or listeners. To do that we define a couple of corresponding properties in our pom.xml:

<project xmlns="" xmlns:xsi=""




      <!-- Plugin to compile the g4 files ahead of the java files
           Except that the grammar does not need to contain the package declaration as stated in the documentation (I do not know why)
           To use this plugin, type:
             mvn antlr4:antlr4
           In any case, Maven will invoke this plugin before the Java source is compiled

Now you have to put the *.g4 files of your grammar under src/main/antlr4/me/tomassetti/examples/MarkupParser. Once you have written your grammars you just run mvn package and all the magic happens: ANTLR is invoked, it generates the lexer and the parser and those are compiled together with the rest of your code.

There is a clear advantage in using Java for developing ANTLR grammars: there are plugins for several IDEs and it’s the language that the main developer of the tool actually works on. So they are tools, like the org.antlr.v4.gui.TestRig, that can be easily integrated in you workflow and are useful if you want to easily visualize the AST of an input.

How can we get ANTLR to work with C#?

See the "C# Setup" section from

What is a listener and what is a visitor?

As our parser parse the provided structured text, whenever it find a token, it can call out to our code, so our code is the listener. INCOMPLETE: NEED TO PROVIDE DEFINITION FOR VISITOR.

How can we generate the visitor?

antlr4 -visitor grammarFileName.g4

By default only the listener is generated, so to create the visitor you use the -visitor command line option, and -no-listener if you don’t want to generate the listener. There are also the opposite options, -no-visitor and -listener, but they are the default values.

How can we generate the listener?

By default only the listener is generated, so to create the visitor you use the -visitor command line option, and -no-listener if you don’t want to generate the listener. There are also the opposite options, -no-visitor and -listener, but they are the default values.

How can we test our grammar?

You can optionally test your grammar using a little utility named TestRig:

test-rig <grammar-name> <rule-to-test> <input-filename(s)>

The filename(s) are optional and you can instead analyze the input that you type on the console. If you want to use the testing tool you need to generate a Java parser, even if your program is written in another language. This is useful when testing manually the first draft of your grammar. As it becomes more stable you may want to relay on automated tests. We will cover more on automated testing later.

TestRig has a few useful options: -tokens, to shows the tokens detected, -gui to generate an image of the AST.

Before trying our new grammar we have to add a name for it, at the beginning of the file. The name must be the same of the file, which should have the .g4 extension. The first line of our Chat.g4 grammar file should be:

grammar Chat;
// lines preceded by $ are commands
// > are input to the tool
// - are output from the tool
antlr4 Chat.g4 // generate the parser
javac Chat*.java // compile the parser
test-rig Chat chat // run the testing tool, Chat is the name of the grammar and 
                          // chat is the rule that we want to parse

The above steps produce:

line 1:0 mismatched input 'john SAYS: hello @michael this will not work\n' expecting WORD

Okay, it doesn’t work. Why is it expecting WORD? It’s right there! Let’s try to find out, using the option -tokens to make it shows the tokens it recognizes.

$ test-rig Chat chat -tokens
> john SAYS: hello @michael this will not work
- [@0,0:44='john SAYS: hello @michael this will not work\n',<TEXT>,1:0]
- [@1,45:44='<EOF>',<EOF>,2:0]

So it only sees the TEXT token. But we put it at the end of the grammar, what happens? The problem is that it always try to match the largest possible token. And all this text is a valid TEXT token. How we solve this problem? There are many ways, the first, of course, is just getting rid of that token. But for now we are going to see the second easiest.

Our Chat.g4 (only changes):


link                : TEXT TEXT ;


TEXT                : ('['|'(') ~[\])]+ (']'|')');

We have changed the problematic token to make it include a preceding parenthesis or square bracket. Note that this isn’t exactly the same thing, because it would allow two series of parenthesis or square brackets. But it is a first step and we are learning here, after all. Let’s check if it works:

$ grun Chat chat -tokens
> john SAYS: hello @michael this will not work
- [@0,0:3='john',<WORD>,1:0]
- [@1,4:4=' ',<WHITESPACE>,1:4]
- [@2,5:8='SAYS',<SAYS>,1:5]
- [@3,9:9=':',<':'>,1:9]
- [@4,10:10=' ',<WHITESPACE>,1:10]
- [@5,11:15='hello',<WORD>,1:11]
- [@6,16:16=' ',<WHITESPACE>,1:16]
- [@7,17:17='@',<'@'>,1:17]
- [@8,18:24='michael',<WORD>,1:18]
- [@9,25:25=' ',<WHITESPACE>,1:25]
- [@10,26:29='this',<WORD>,1:26]
- [@11,30:30=' ',<WHITESPACE>,1:30]
- [@12,31:34='will',<WORD>,1:31]
- [@13,35:35=' ',<WHITESPACE>,1:35]
- [@14,36:38='not',<WORD>,1:36]
- [@15,39:39=' ',<WHITESPACE>,1:39]
- [@16,40:43='work',<WORD>,1:40]
- [@17,44:44='\n',<NEWLINE>,1:44]
- [@18,45:44='<EOF>',<EOF>,2:0]

What are the useful options for TestRig?

  1. -tokens: shows the tokens detected
  2. -gui: generate an image of the AST.

What is the definition of lexer and parser?

Before looking into parsers, we need to first to look into lexers, also known as tokenizers. They are basically the first stepping stone toward a parser, and of course ANTLR allows you to build them too. A lexer takes the individual characters and transforms them in tokens, the atoms that the parser uses to create the logical structure.

What are the rules for organizing the grammar file?


 * Parser Rules

operation  : NUMBER '+' NUMBER ;

 * Lexer Rules

NUMBER     : [0-9]+ ;

WHITESPACE : ' ' -> skip ;

The above grammar is used to parse mathematical expression such as 437 + 734. The above grammar is not complete but we can already see that lexer rules are all uppercase, while parser rules are all lowercase. Technically the rule about case applies only to the first character of their names, but usually they are all uppercase or lowercase for clarity.

Rules are typically written in this order: first the parser rules and then the lexer ones, although logically they are applied in the opposite order. It’s also important to remember that lexer rules are analyzed in the order that they appear, and they can be ambiguous.

The typical example is the identifier: in many programming language it can be any string of letters, but certain combinations, such as “class” or “function” are forbidden because they indicate a class or a function. So the order of the rules solves the ambiguity by using the first match and that’s why the tokens identifying keywords such as class or function are defined first, while the one for the identifier is put last.

The basic syntax of a rule is easy: there is a name, a colon, the definition of the rule and a terminating semicolon

The definition of NUMBER contains a typical range of digits and a ‘+’ symbol to indicate that one or more matches are allowed. These are all very typical indications with which I assume you are familiar with, if not, you can read more about the syntax of regular expressions.

The most interesting part is at the end, the lexer rule that defines the WHITESPACE token. It’s interesting because it shows how to indicate to ANTLR to ignore something. Consider how ignoring whitespace simplify parser rules: if we couldn’t say to ignore WHITESPACE we would have to include it between every single subrule of the parser, to let the user puts spaces where he wants. Like this:


And the same typically applies to comments: they can appear everywhere and we do not want to handle them specifically in every single piece of our grammar so we just ignore them (at least while parsing) .

What is the difference between the top-down approach and the bottom-up approach for defining our grammar?

With the top-down approach, we starts out with the big pieces and work our way down to the smaller piece. For example, when parsing a language similar to Java, we would start with the package statement, the class statement, and then down to the child statements.

With the bottom-up approach, we would start with the child statements and work our way up to the class statement and the package statement. The bottom-up approach consists of focusing in the small elements first: defining how the tokens are captured, how the basic expressions are defined and so on, and then we move to higher level constructs until we define the rule representing the whole file.

I personally prefer to start from the bottom, the basic items, that are analyzed with the lexer. And then you grow naturally from there to the structure, that is dealt with the parser. This approach permits to focus on a small piece of the grammar, build thests for that, ensure it works as expected and then move on to the next bit.

This approach mimic the way we learn. Furthermore, there is the advantage of starting with real code that is actually quite common among many languages. In fact, most languages have things like identifiers, comments, whitespace, etc. Obviously you might have to tweak something, for example a comment in HTML is functionally the same as a comment in C#, but it has different delimiters.

The disadvantage of a bottom-up approach rests on the fact that the parser is the thing you actually cares about. You weren’t asked to build a lexer, you were asked to build a parser, that could provide a specific functionality. So by starting on the last part, the lexer, you might end up doing some refactoring, if you don’t already know how the rest of the program will work.

What are rule fragments?

In this example we use rules fragments: they are reusable building blocks for lexer rules. You define them and then you refer to them in lexer rule. If you define them but do not include them in lexer rules they have simply no effect. For example:

fragment DIGIT : [0-9] ;
NUMBER         : DIGIT+ ([.,] DIGIT+)? ;

In the above grammar, we define the fragment named DIGIT, and then we define the rule named NUMBER which is based on the DIGIT fragment. A number start out with one or more digits, and optionally followed by either dot or comma following by more digits.

Should we use the parser to enforce semantic checks?

Maybe not. This must be checked by the logic of the program, that can access which colors are available. You have to find the right balance of dividing enforcement between the grammar and your own code.

The parser should only check the syntax. So the rule of thumb is that when in doubt you let the parser pass the content up to your program. Then, in your program, you check the semantics and make sure that the rule actually have a proper meaning.

What files are created when we feed our grammar file through ANTLR?

Assume that we will be using our structure text parser with the JavaScript language, when we feed our grammar file through ANTLR using:

antlr4 -Dlanguage=JavaScript Chat.g4

it will create some new files with names such as ChatLexer.js, ChatParser.js and there are also *.tokens files, none of which contains anything interesting for us, unless you want to understand the inner workings of ANTLR.

The file you want to look at is ChatListener.js, you are not going to modify anything in it, but it contains methods and functions that we will override with our own listener. We are not going to modify it, because changes would be overwritten every time the grammar is regenerated.

Looking into it you can see several enter/exit functions, a pair for each of our parser rules. These functions will be invoked when a piece of code matching the rule will be encountered. This is the default implementation of the listener that allows you to just override the functions that you need, on your derived listener, and leave the rest to be.

What are the differences between a listener and a visitor?

The alternative to creating a Listener is creating a Visitor. The main differences are that you can’t neither control the flow of a listener nor returning anything from its functions, while you can do both of them with a visitor. So if you need to control how the nodes of the AST are entered, or to gather information from several of them, you probably want to use a visitor. This is useful, for example, with code generation, where some information that is needed to create new source code is spread around many parts. Both the listener and the visitor use depth-first search.

This time we are building a visitor, whose main advantage is the chance to control the flow of the program.

What is the depth-first approach?

A depth-first search means that when a node will be accessed its children will be accessed, and if one the children nodes had its own children they will be accessed before continuing on with the other children of the first node.

How can we actually use our generated parser?

A typical ANTLR program (antlr.js) looks:

const http = require('http');
const antlr4 = require('antlr4/index');
const ChatLexer = require('./ChatLexer');
const ChatParser = require('./ChatParser');
const HtmlChatListener = require('./HtmlChatListener').HtmlChatListener;

http.createServer((req, res) => {

   res.writeHead(200, {
       'Content-Type': 'text/html',        

   res.write('<html><head><meta charset="UTF-8"/></head><body>');

   var input = "john SHOUTS: hello @michael /pink/this will work/ :-) \n";
   var chars = new antlr4.InputStream(input);
   var lexer = new ChatLexer.ChatLexer(chars);
   var tokens  = new antlr4.CommonTokenStream(lexer);
   var parser = new ChatParser.ChatParser(tokens);
   parser.buildParseTrees = true;   
   var tree =;   
   var htmlChat = new HtmlChatListener(res);
   antlr4.tree.ParseTreeWalker.DEFAULT.walk(htmlChat, tree);



At the beginning of the main file we import (using require) the necessary libraries and file, antlr4 (the runtime) and our generated parser, plus the listener that we are going to see later.

The above code shows us how to create the lexer, the parser, and actually parse a message. In the above code, we create the stream of chars from the input, you give it to the lexer and it transforms them in tokens, that are then interpreted by the parser.

What is a walker and can we stop its execution?

These functions, enter* and exit*, are called by the walker everytime the corresponding nodes are entered or exited while it’s traversing the AST that represents the program newline. A listener allows you to execute some code, but it’s important to remember that you can’t stop the execution of the walker and the execution of the functions.

How can we solve Ambiguities with Semantic Predicates?

As the name implies Semantic Predicatesare expressions that produce a boolean value. They selectively enable or disable the following rule and thus permit to solve ambiguities. Another reason that they could be used is to support different version of the same language, for instance a version with a new construct or an old without it.

Technically they are part of the larger group of actions, that allows to embed arbitrary code into the grammar. The downside is that the grammar is no more language independent, since the code in the action must be valid for the target language. For this reason, usually it’s considered a good idea to only use semantic predicates, when they can’t be avoided, and leave most of the code to the visitor/listener. A partial Chat.g4:

link                : '[' TEXT ']' '(' TEXT ')';

TEXT                : {self._input.LA(-1) == ord('[') or self._input.LA(-1) == ord('(')}? ~[\])]+ ;

In the above grammar, we added a semantic predicate to the TEXT token, written inside curly brackets and followed by a question mark. We use self._input.LA(-1) to check the character before the current one, if this character is a square bracket or the open parenthesis, we activate the TEXT token. It’s important to repeat that this must be valid code in our target language.

This matters not just for the syntax itself, but also because different targets might have different fields or methods, for instance LA returns an int in python, so we have to convert the char to a int.

If you want to test for the preceding token, you can use the _input.LT(-1,)but you can only do that for parser rules.

How can we write automated unit test for our grammar and parser?

python3 -m unittest discover -s . -p

That’s how you run the tests, but before that we have to write them. Actually, even before that, we have to write an ErrorListener to manage errors that we could find. While we could simply read the text outputted by the default error listener, there is an advantage in using our own implementation, namely that we can control more easily what happens.

import sys
from antlr4 import *
from ChatParser import ChatParser
from ChatListener import ChatListener
from antlr4.error.ErrorListener import *
import io

class ChatErrorListener(ErrorListener):

    def __init__(self, output):
        self.output = output        
        self._symbol = ''

    def syntaxError(self, recognizer, offendingSymbol, line, column, msg, e):        
        self._symbol = offendingSymbol.text

    def symbol(self):
        return self._symbol

See the "Testing with Python" section from

What does a typical ANTLR application look like in Java?

package me.tomassetti.examples.MarkupParser;
import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.tree.*;

public class App 
    public static void main( String[] args )
        ANTLRInputStream inputStream = new ANTLRInputStream(
            "I would like to [b][i]emphasize[/i][/b] this and [u]underline [b]that[/b][/u] ." +
            "Let's not forget to quote: [quote author=\"John\"]You're wrong![/quote]");
        MarkupLexer markupLexer = new MarkupLexer(inputStream);
        CommonTokenStream commonTokenStream = new CommonTokenStream(markupLexer);
        MarkupParser markupParser = new MarkupParser(commonTokenStream);

        MarkupParser.FileContext fileContext = markupParser.file();                
        MarkupVisitor visitor = new MarkupVisitor();                

package me.tomassetti.examples.MarkupParser;

import org.antlr.v4.runtime.*;
import org.antlr.v4.runtime.misc.*;
import org.antlr.v4.runtime.tree.*;

public class MarkupVisitor extends MarkupParserBaseVisitor<String>
    public String visitFile(MarkupParser.FileContext context)


         return null;

    public String visitContent(MarkupParser.ContentContext context)

        return visitChildren(context);

You can see how to control the flow, either by calling visitChildren, or any other visit* function, and deciding what to return. We just need to override the methods that we want to change. Otherwise, the default implementation would just do like visitContent, and it will visit the children nodes and allows the visitor to continue. Just like for a listener, the argument is the proper context type. If you want to stop the visitor just return null.

Transforming code, even at a very simple level, comes with some complications. Let’s start easy with some basic visitor methods. (part):

public String visitContent(MarkupParser.ContentContext context)    
    return context.getText();        

public String visitElement(MarkupParser.ElementContext context)
    if(context.parent instanceof MarkupParser.FileContext)
        if(context.content() != null)            
        if(context.tag() != null)

    return null;

Before looking at the main method, let’s look at the supporting ones. Foremost we have changed visitContent by making it return its text instead of printing it. Second, we have overridden the visitElement so that it prints the text of its child, but only if it’s a top element, and not inside a tag. In both cases, it achieve this by calling the proper visit* method. It knows which one to call because it checks if it actually has a tag or content node. (part):

public String visitTag(MarkupParser.TagContext context)    
    String text = "";
    String startDelimiter = "", endDelimiter = "";

    String id = context.ID(0).getText();

        case "b":
            startDelimiter = endDelimiter = "**";                
        case "u":
            startDelimiter = endDelimiter = "*";                
        case "quote":
            String attribute = context.attribute().STRING().getText();
            attribute = attribute.substring(1,attribute.length()-1);
            startDelimiter = System.lineSeparator() + "> ";
            endDelimiter = System.lineSeparator() + "> " + System.lineSeparator() + "> – "
                         + attribute + System.lineSeparator();

    text += startDelimiter;

    for (MarkupParser.ElementContext node: context.element())
        if(node.tag() != null)
            text += visitTag(node.tag());
        if(node.content() != null)
            text += visitContent(node.content());                

    text += endDelimiter;

    return text;        

VisitTag contains more code than every other method, because it can also contain other elements, including other tags that have to be managed themselves, and thus they cannot be simply printed. We don’t need to check that the corresponding end tag matches, because the parser will ensure that, as long as the input is well formed.

How does a visitor work?

  1. every top element visit each child
    • if it’s a content node, it directly returns the text
    • if it’s a tag, it setups the correct delimiters and then it checks its children. It repeats step 2 for each children and then it returns the gathered text
  2. it prints the returned text

Return null to stop the visit. Return children to continue. Return something to perform an action ordered at an higher level of the tree.

How can we write automated test cases for our ANTLR Java application? (part):

// private variables inside the class AppTest
private MarkupErrorListener errorListener;
private MarkupLexer markupLexer;

public void testText()
    MarkupParser parser = setup("anything in here");

    MarkupParser.ContentContext context = parser.content();        


public void testInvalidText()
    MarkupParser parser = setup("[anything in here");

    MarkupParser.ContentContext context = parser.content();        


public void testWrongMode()
    MarkupParser parser = setup("author=\"john\"");                

    MarkupParser.AttributeContext context = parser.attribute(); 
    TokenStream ts = parser.getTokenStream();        

    assertEquals(MarkupLexer.DEFAULT_MODE, markupLexer._mode);

public void testAttribute()
    MarkupParser parser = setup("author=\"john\"");
    // we have to manually push the correct mode

    MarkupParser.AttributeContext context = parser.attribute(); 
    TokenStream ts = parser.getTokenStream();        



public void testInvalidAttribute()
    MarkupParser parser = setup("author=/\"john\"");
    // we have to manually push the correct mode

    MarkupParser.AttributeContext context = parser.attribute();        


How can we deal with the left-recursive problem?

An expression usually contains other expressions. For example the typical binary expression is composed by an expression on the left, an operator in the middle and another expression on the right. This can lead to ambiguities. Think, for example, at the expression 5 + 3 * 2, for ANTLR this expression is ambiguous because there are two ways to parse it. It could either parse it as 5 + (3 * 2) or (5 +3) * 2.

These types of rules are called left-recursive rules. You might say: just parse whatever comes first. The problem with that is semantic: the addition comes first, but we know that multiplications have a precedence over additions. Traditionally the way to solve this problem was to create a complex cascade of specific expressions like this:

expression     : addition;
addition       : multiplication ('+' multiplication)* ;
multiplication : atom ('*' atom)* ;
atom           : NUMBER ;

This way ANTLR would have known to search first for a number, then for multiplications and finally for additions. This is cumbersome and also counterintuitive, because the last expression is the first to be actually recognized. Luckily ANTLR4 can create a similar structure automatically, so we can use a much more natural syntax.

expression : expression '*' expression
           | expression '+' expression                      
           | NUMBER

In practice ANTLR consider the order in which we defined the alternatives to decide the precedence. By writing the rule in this way we are telling to ANTLR that the multiplication has precedence on the addition.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License