Category Archives: ANTLR

ANTLR in C#

In the previous post Starting out with ANTLR we look at the basics of creating a grammar and generating code from it, now let’s take that very simple grammar and integrate it into a C# application.

Here’s the grammar again (from our grammar file queryLanguage.g4)

Note: We’re going to capitalize the grammar name as this will then by more in the style of C# class names.

grammar QueryLanguage;

query
    : expression
    ;

expression
    : STRING
    | NUMBER
    | expression 'AND' expression
    | expression 'OR' expression
    ;

WS  : (' '|'\t'|'\r'|'\n')+ -> skip;

STRING : '"' .*? '"';
SIGN
   : ('+' | '-')
   ;
NUMBER  
    : SIGN? ( [0-9]* '.' )? [0-9]+;

The ANTLR4 JAR is not compatible with the ANTRL4 Nuget package, so instead for our Example application we’ll use an alternative, the Antlr4 CodeGenerator, so follow these steps to create an application

  • Create a .NET Core Console application
  • Editor the SDK project file and change netcoreapp3.1 to net472
  • Add the ANTLR4.Runtime and Antlr4.CodeGenerator Nuget packages
  • Add your QueryLanguage.g4 grammar to the project

If you select the .g4 file you can now view the properties for that file within Visual Studio 2019 and (if you wish to) change what’s generated by ANTLR. Let’s just ensure Generate Visitor is Yes.

For some reason a .NET framework 4.7.2 project does not include the properties and whilst we can edit the .csproj file to get things working, I’ve found the above steps the simplest to get ANTLR running in a .NET application at the time of writing.

I’ve found I do still need to edit the .csproj file to add the following

<ItemGroup>
  <Antlr4 Update="QueryLanguage.g4">
    <Listener>false</Listener>
    <CustomToolNamespace>Example.Generated</CustomToolNamespace>
  </Antlr4>
</ItemGroup>

<PropertyGroup>
  <Antlr4UseCSharpGenerator>True</Antlr4UseCSharpGenerator>
</PropertyGroup>

Change Example.Generated to the preferred namespace for the generated files.

Now build the project and if all goes well there should be no errors and the ANTLR code should be generated in obj/Debug/net472 (or whatever configuration you’re using).

Let’s now make some changes to our grammar to make writing Visitor code simpler by adding labels to our expression code, the changes are listed below

expression
    : STRING #String
    | NUMBER #Number
    | expression 'AND' expression #And
    | expression 'OR' expression  #Or
    ;

We use # to create a label and this will turn into a Visit function with the label, i.e. VisitAnd, VisitoOr etc.

All we’re going to do with this grammar is use the Visitor pattern/class to generate code where strings are all lowercase, AND becomes & and OR becomes |, ofcourse you could produce byte code or do all sorts of things with your input.

Create a new file name QueryLanguageVisitor.cs and it should look like this

using Example.Generated;

namespace Example
{
  public class QueryLanguageVisitor : QueryLanguageBaseVisitor<string>
  {
    public override string VisitString(QueryLanguageParser.StringContext context)
    {
      return context.GetText().ToLower();
    }

    public override string VisitNumber(QueryLanguageParser.NumberContext context)
    {
      return context.GetText();
    }

    public override string VisitAnd(QueryLanguageParser.AndContext context)
    {
      return Visit(context.expression(0)) + "&" + Visit(context.expression(1));
    }

    public override string VisitOr(QueryLanguageParser.OrContext context)
    {
      return Visit(context.expression(0)) + "|" + Visit(context.expression(1));
    }
  }
}

As you can see from the above code, we subclass QueryLanguageBaseVisitor (a generated file) and the generic parameter is set as a string as our result of running through the QueryLanguageVisitor will simply be another string.

In the case of the AND and OR which ofcourse are binary expressions, i.e. require two parameters either side of the AND or OR and these may themselves be expression, hence we call Visit those expressions.

At this point, we have nothing to actually run the QueryLanguageVisitor so in the Main method place the following code

// add these using clauses
// using Antlr4.Runtime;
// using Example.Generated;

// example expression
var expression = "\"HELLO\" AND 123";

var inputStream = new AntlrInputStream(expression);
var lexer = new QueryLanguageLexer(inputStream);
var tokenStream = new CommonTokenStream(lexer);
var parser = new QueryLanguageParser(tokenStream);

var visitor = new QueryLanguageVisitor();
var query = parser.query();
var result = visitor.Visit(query);

In the code above, we create an ANTLR input stream (you can ofcource use an AntlrFileStream if you’re taking input from a file). Next we use our generated lexer which is passed into the CommonTokenStream and this in turn is passed into our generated QueryLanguageParser.

Finally we create our newly added QueryLanguageVisitor which will have functions based upon our grammar, in our case the startRule is query hence we call this method and pass the result into the Visit method of our QueryLanguageVisitor. The result (assuming no errors) will be

"hello" & 123

A more fully featured (i.e. includes error handling) implementation would be as follows (concepts and code snippets taken from various existing samples)

public class ParserResult
{
  public bool IsValid { get; internal set; }
  public int ErrorPosition { get; internal set; } = -1;
  public string ErrorMessage { get; internal set; }
  public string Result { get; internal set; }
}

public static class Query
{
  public static ParserResult Parse(string expression, bool secondRun = false)
  {
    if (String.IsNullOrWhiteSpace(expression))
    {
      return new ParserResult
      {
        IsValid = true,
        Result = null
      };
    }

    var inputStream = new AntlrInputStream(expression);
    var lexer = new QueryLanguageLexer(inputStream);
    var tokenStream = new CommonTokenStream(lexer);
    var parser = new QueryLanguageParser(tokenStream);

    lexer.RemoveErrorListeners();
    parser.RemoveErrorListeners();
    var customErrorListener = new QueryLanguageErrorListener();
    parser.AddErrorListener(customErrorListener);
    var visitor = new QueryLanguageVisitor();

    var queryExpression = parser.query();
    var result = visitor.Visit(queryExpression);
    var isValid = customErrorListener.IsValid;
    var errorLocation = customErrorListener.ErrorLocation;
    var errorMessage = customErrorListener.ErrorMessage;
    if (result != null)
    {
      isValid = false;
    }

    if (!isValid && !secondRun)
    {
      var cleanedFormula = string.Empty;
      var tokenList = tokenStream.GetTokens().ToList();
      for (var i = 0; i < tokenList.Count - 1; i++)
      {
        cleanedFormula += tokenList[i].Text;
      }
      var originalErrorLocation = errorLocation;
      var retriedResult = Parse(cleanedFormula, true);
      if (!retriedResult.IsValid)
      {
        retriedResult.ErrorPosition = originalErrorLocation;
        retriedResult.ErrorMessage = errorMessage;
      }
      return retriedResult;
    }
    return new ParserResult
    {
      IsValid = isValid,
      Result = isValid || result != null 
        ? result
        : null,
      ErrorPosition = errorLocation,
      ErrorMessage = isValid ? null : errorMessage
    };
  }
}

public class QueryLanguageErrorListener : BaseErrorListener
{
  public bool IsValid { get; private set; } = true;
  public int ErrorLocation { get; private set; } = -1;
  public string ErrorMessage { get; private set; }

  public override void ReportAmbiguity(
    Parser recognizer, DFA dfa, 
    int startIndex, int stopIndex, 
    bool exact, BitSet ambigAlts, 
    ATNConfigSet configs)
  {
    IsValid = false;
  }

  public override void ReportAttemptingFullContext(
    Parser recognizer, DFA dfa, 
    int startIndex, int stopIndex, 
    BitSet conflictingAlts, SimulatorState conflictState)
  {
    IsValid = false;
  }

  public override void ReportContextSensitivity(
    Parser recognizer, DFA dfa, 
    int startIndex, int stopIndex, 
    int prediction, SimulatorState acceptState)
  {
    IsValid = false;
  }

  public override void SyntaxError(
    IRecognizer recognizer, IToken offendingSymbol, 
    int line, int charPositionInLine, 
   string msg, RecognitionException e)
 {
   IsValid = false;
   ErrorLocation = ErrorLocation == -1 ? charPositionInLine : ErrorLocation;
   ErrorMessage = msg;
 }
}

Now the code that uses our parser simply looks like this (and includes error handling)

var expression = "\"HELLO\" AND 123";
var result = Query.Parse(expression);

Starting out with ANTLR

ANTLR (ANother Tool for Language Recognition) is a parser generator.

You might be designing your own programming/scripting language or defining something a little simpler, such as query language for your application or even just something to parse user input (which is not necessarily simpler).

ANTLR can be downloaded from here and whilst the tool itself is written in Java it can generate parser code in Java, C#, Python, JavaScript and more.

Note: I’ve downloaded the “Complete ANTLR 4.8 Java binaries jar” hence the commands listed will all be in relation to that JAR.

Grammar files

Before we can do anything meaningful with ANTLR we need to define a grammar for our lexer and parser to be generated from. These are text files with the extension .g or .g4 (for ANTLR 4 compatible grammars) and looks lot like BNF. Let’s create a grammar for a very simple query language that we’d like to incorporate into an application.

We start by creating a file named querylanguage.g4

I’m using Visual Studio Code to write the grammar and using the excellent ANTLR4 grammar syntax support extension to aid in developing the grammar.

The first thing we add to the grammar is

grammar querylanguage;

The grammar name must match the name of your grammar file. It can be prefixed with lexer or parser if you want to write grammar specific to either a lexer or parser, without these we’re creating a combined lexer and parser grammar.

Here’s all the options for the grammar keyword.

lexer grammar querylanguage;

// or

parser grammar querylanguage;

// or 

grammar querylanguage;

Before we extend our grammar let’s talk about comments – single line comments can be created using // and multi line using /* */.

We create our grammar using a combination of rules and tokens. Rules should be declared using camelCase whereas tokens should be all upper case. We’ll need an entry point (or start rule) so let’s extend our grammar within a start rule named query and this will be made up of an expression rule. We’ll also add a token to handle string types.

grammar querylanguage;

query
    : expression
    ;

expression
    : STRING
    | NUMBER
    ;

STRING : '"' .*? '"';
SIGN
   : ('+' | '-')
   ;
NUMBER  
    : SIGN? ( [0-9]* '.' )? [0-9]+;

We’ll look at the syntax of our rules etc. in a moment.

As you can see, a rule is made up of a name followed by a colon and then either another rule or token and finally a rule or token is terminated with a ;. Alternate rules or tokens are defined using the | operator. So in the example above we have two token rules, query and expression and three tokens rules, STRING, SIGN and NUMBER.

If we input “Hello World” then the ANTLR tokenize and parse this as a STRING. If we input 1234 then we’ll see this tokenize and parsed as a NUMBER. We can now start to build up our grammar from these basic building blocks, so let’s add some logic operators, AND and OR. We’ll also add a token rule to deal with whitespace characters (which we want to ignore). So change/add the following to the existing grammar

expression
    : STRING
    | NUMBER
    | expression 'AND' expression
    | expression 'OR' expression
    ;

WS  : (' '|'\t'|'\r'|'\n')+ -> skip;

WS is a special token rule which basically skips/ignores the various whitespace characters.

Let’s now take a look at the rule syntax we’ve used.

  • ‘ ‘ – we denote literals within single quotes, so our previous declaration of a STRING show that a string starts with a string literal double quote and ends with the same.
  • | – this is used to give alternate options, so the WS token rule is a ‘ ‘ OR ‘\t’ OR…
  • (…) – this brackets acts as a subrule or grouping, so in this case of the WS token rule, we’re simply creating a group of white space characters following by a +, hence the + acts upon the group of characters
  • + – the + sign means 1 or more, hence for the WS token rule we’re saying the WS is 1 or more of any of the supplied literal characters.
  • -> – this means rewrite this rule, in the case of WS this basically rewrites the rules to skip (which simply means ignore WS characters
  • . – the dot is a wildcard, so in the STRING example we’re simply saying match any character
  • * – this means zero or more, in the STRING example when applied to the ., i.e. .* we’re saying a string is 0 or more of any character
  • ? – means optional, an example is the use of SIGN? which simply means a SIGN is optional
  • […] – square brackets denote a character set, for example [0-9] means characters 0 through to 9 inclusive

See Grammars for the rest of the Grammar syntax.

Another interesting keyword is fragment, for example

HexLiteral : '0' ('x'|'X') HexDigit+ ;
fragment HexDigit : ('0'..'9'|'a'..'f'|'A'..'F') ;

A fragment is a modifier which doesn’t result in a token being visible to the parser but is more like an inline rule, in that it’s used “inline” within other rules, what if gives us is a way to create more reusable constructs and is useful if we want to share rules amongst other rules or to just make rules more readable.

Reserved keywords

ANTLR4 includes the following reserved words


import, fragment, lexer, parser, grammar, returns,
locals, throws, catch, finally, mode, options, tokens

So obviously you cannot use these for your own rule names.

Debugging our grammar with the VS Code, ANTLR4 grammar syntax support extension

This extension is really useful when working with our grammars. First off we have syntax highlight and auto completion, which is always useful but it also includes a debugger and can display ANTLR exceptions when we’re not matching tokens or rules.

To debug your grammar in VS Code, either editor your launch.json or add a configuration via Run | Add Configuration, here’s the configuration for testing our grammar

"version": "0.2.0",
"configurations": [
  {
    "name": "Debug ANTLR4 grammar",
    "type": "antlr-debug",
    "request": "launch",
    "input": "sample.txt",
    "grammar": "querylanguage.g4",
    "startRule": "query",
    "printParseTree": true,
    "visualParseTree": true
  }
]

You’re might want to change the “input” file to a named file of your choice, but basically this is just a text file where we put our test to be parsed via the grammar, so for example my sample.text file looks like this

"HELLO" AND 123

The “startRule” is the top level rule we want to parse our input through, so in our example it’s the query rule. Ofcourse we also need to tell the antrl-debug the “grammar” to use.

Now in VS code press the run button using this configuration and you’ll get to see the output of the parse tree, for example

Parse Tree:
query (
 expression (
  expression (
   ""HELLO""
  )
  "AND"
  expression (
   "123"
  )
 )
)

There’s also a lovely parse tree so you can see how your grammar was parsed and which rules matched to which input.

Generating code

Everything we’ve looked at is great, but ofcourse we’ll want to actually include our grammar within our application and ANTLR comes to help by allowing us to use the previously downloaded JAR to generate the code for our preferred (and ofcourse supported) language.

By default if you run

java -jar antlr-4.8-complete.jar .\querylanguage.g4

Then you’ll get a set of files produced which include Java source files for a Lexer, Parser and Listener. If you want to generate a Visitor then add the -visitor switch, like this

java -jar antlr-4.8-complete.jar -visitor .\querylanguage.g4  

As I’m wanting to generate C# source files, we can simply add the -Dlanguage switch, for example

java -jar antlr-4.8-complete.jar -visitor -Dlanguage=CSharp .\querylanguage.g4

But sadly this doesn’t work. It seems that whilst the JAR supports generating C# code the ANTLR .NET package is not compatible. In another post we’ll look at how we can generate C# code from our grammar.

References

ANTLR Tool Command Line Options
ANTLR 4 Documentation
vscode-antlr4
Sample grammars
Grammars
Cheat Sheet
Lexer Rules