Starting out with ANTLR

ANTLR (ANother Tool for Language Recognition) is a parser generator.

You might be designing your own programming/scripting language or defining something a little simpler, such as query language for your application or even just something to parse user input (which is not necessarily simpler).

ANTLR can be downloaded from here and whilst the tool itself is written in Java it can generate parser code in Java, C#, Python, JavaScript and more.

Note: I’ve downloaded the “Complete ANTLR 4.8 Java binaries jar” hence the commands listed will all be in relation to that JAR.

Grammar files

Before we can do anything meaningful with ANTLR we need to define a grammar for our lexer and parser to be generated from. These are text files with the extension .g or .g4 (for ANTLR 4 compatible grammars) and looks lot like BNF. Let’s create a grammar for a very simple query language that we’d like to incorporate into an application.

We start by creating a file named querylanguage.g4

I’m using Visual Studio Code to write the grammar and using the excellent ANTLR4 grammar syntax support extension to aid in developing the grammar.

The first thing we add to the grammar is

grammar querylanguage;

The grammar name must match the name of your grammar file. It can be prefixed with lexer or parser if you want to write grammar specific to either a lexer or parser, without these we’re creating a combined lexer and parser grammar.

Here’s all the options for the grammar keyword.

lexer grammar querylanguage;

// or

parser grammar querylanguage;

// or 

grammar querylanguage;

Before we extend our grammar let’s talk about comments – single line comments can be created using // and multi line using /* */.

We create our grammar using a combination of rules and tokens. Rules should be declared using camelCase whereas tokens should be all upper case. We’ll need an entry point (or start rule) so let’s extend our grammar within a start rule named query and this will be made up of an expression rule. We’ll also add a token to handle string types.

grammar querylanguage;

query
    : expression
    ;

expression
    : STRING
    | NUMBER
    ;

STRING : '"' .*? '"';
SIGN
   : ('+' | '-')
   ;
NUMBER  
    : SIGN? ( [0-9]* '.' )? [0-9]+;

We’ll look at the syntax of our rules etc. in a moment.

As you can see, a rule is made up of a name followed by a colon and then either another rule or token and finally a rule or token is terminated with a ;. Alternate rules or tokens are defined using the | operator. So in the example above we have two token rules, query and expression and three tokens rules, STRING, SIGN and NUMBER.

If we input “Hello World” then the ANTLR tokenize and parse this as a STRING. If we input 1234 then we’ll see this tokenize and parsed as a NUMBER. We can now start to build up our grammar from these basic building blocks, so let’s add some logic operators, AND and OR. We’ll also add a token rule to deal with whitespace characters (which we want to ignore). So change/add the following to the existing grammar

expression
    : STRING
    | NUMBER
    | expression 'AND' expression
    | expression 'OR' expression
    ;

WS  : (' '|'\t'|'\r'|'\n')+ -> skip;

WS is a special token rule which basically skips/ignores the various whitespace characters.

Let’s now take a look at the rule syntax we’ve used.

  • ‘ ‘ – we denote literals within single quotes, so our previous declaration of a STRING show that a string starts with a string literal double quote and ends with the same.
  • | – this is used to give alternate options, so the WS token rule is a ‘ ‘ OR ‘\t’ OR…
  • (…) – this brackets acts as a subrule or grouping, so in this case of the WS token rule, we’re simply creating a group of white space characters following by a +, hence the + acts upon the group of characters
  • + – the + sign means 1 or more, hence for the WS token rule we’re saying the WS is 1 or more of any of the supplied literal characters.
  • -> – this means rewrite this rule, in the case of WS this basically rewrites the rules to skip (which simply means ignore WS characters
  • . – the dot is a wildcard, so in the STRING example we’re simply saying match any character
  • * – this means zero or more, in the STRING example when applied to the ., i.e. .* we’re saying a string is 0 or more of any character
  • ? – means optional, an example is the use of SIGN? which simply means a SIGN is optional
  • […] – square brackets denote a character set, for example [0-9] means characters 0 through to 9 inclusive

See Grammars for the rest of the Grammar syntax.

Another interesting keyword is fragment, for example

HexLiteral : '0' ('x'|'X') HexDigit+ ;
fragment HexDigit : ('0'..'9'|'a'..'f'|'A'..'F') ;

A fragment is a modifier which doesn’t result in a token being visible to the parser but is more like an inline rule, in that it’s used “inline” within other rules, what if gives us is a way to create more reusable constructs and is useful if we want to share rules amongst other rules or to just make rules more readable.

Reserved keywords

ANTLR4 includes the following reserved words


import, fragment, lexer, parser, grammar, returns,
locals, throws, catch, finally, mode, options, tokens

So obviously you cannot use these for your own rule names.

Debugging our grammar with the VS Code, ANTLR4 grammar syntax support extension

This extension is really useful when working with our grammars. First off we have syntax highlight and auto completion, which is always useful but it also includes a debugger and can display ANTLR exceptions when we’re not matching tokens or rules.

To debug your grammar in VS Code, either editor your launch.json or add a configuration via Run | Add Configuration, here’s the configuration for testing our grammar

"version": "0.2.0",
"configurations": [
  {
    "name": "Debug ANTLR4 grammar",
    "type": "antlr-debug",
    "request": "launch",
    "input": "sample.txt",
    "grammar": "querylanguage.g4",
    "startRule": "query",
    "printParseTree": true,
    "visualParseTree": true
  }
]

You’re might want to change the “input” file to a named file of your choice, but basically this is just a text file where we put our test to be parsed via the grammar, so for example my sample.text file looks like this

"HELLO" AND 123

The “startRule” is the top level rule we want to parse our input through, so in our example it’s the query rule. Ofcourse we also need to tell the antrl-debug the “grammar” to use.

Now in VS code press the run button using this configuration and you’ll get to see the output of the parse tree, for example

Parse Tree:
query (
 expression (
  expression (
   ""HELLO""
  )
  "AND"
  expression (
   "123"
  )
 )
)

There’s also a lovely parse tree so you can see how your grammar was parsed and which rules matched to which input.

Generating code

Everything we’ve looked at is great, but ofcourse we’ll want to actually include our grammar within our application and ANTLR comes to help by allowing us to use the previously downloaded JAR to generate the code for our preferred (and ofcourse supported) language.

By default if you run

java -jar antlr-4.8-complete.jar .\querylanguage.g4

Then you’ll get a set of files produced which include Java source files for a Lexer, Parser and Listener. If you want to generate a Visitor then add the -visitor switch, like this

java -jar antlr-4.8-complete.jar -visitor .\querylanguage.g4  

As I’m wanting to generate C# source files, we can simply add the -Dlanguage switch, for example

java -jar antlr-4.8-complete.jar -visitor -Dlanguage=CSharp .\querylanguage.g4

But sadly this doesn’t work. It seems that whilst the JAR supports generating C# code the ANTLR .NET package is not compatible. In another post we’ll look at how we can generate C# code from our grammar.

References

ANTLR Tool Command Line Options
ANTLR 4 Documentation
vscode-antlr4
Sample grammars
Grammars
Cheat Sheet
Lexer Rules