Oct 31, 2011

Parsing queries with ANTLR using embedment helper

ANTLR is the most popular Java-based parser generator used by many products from Hibernate to Hive. There are two extremely well written and highly practical books written by the ANTLR developer. The most recent Martin Fowler's book also relies on ANTLR for external DSL examples.

So much high quality information could be confusing because not only each language can be described by a few different grammars but there are different options for parser designs. The rule of thumb seems to be that building and then walking the AST is appropriate when "compilation" is expected to have multiple passes (e.g. for tree rewriting for optimization purposes). In case of a one-pass "compilation" it could be more beneficial to use what Martin calls Embedment Helper.

The idea is to have a Builder-like object called by parser on detection of every interesting script element. The object is implemented in target language and embedded into the generated parser as Foreign Code. Thus, neither resources are spent on AST processing nor ANTLR grammar becomes unintelligible because of a lot of injected code.

A typical use case implies building for future use of a semantic model as opposed to executing the script while parsing. The model is populated by embedment helper. This post is intended to highlight a few simple idioms for developing a query parser with this approach.

Grammar headers
  • Remember that if you name your grammar TestQuery then ANTLR will generate classes TestQueryParser and TestQueryLexer
  • You will want to put them into some package using @header and @lexer::header sections
  • Embedment helper-related code will go to the @member section
grammar TestQuery;
tokens {
 FROM = 'FROM' ;

@header {
package net.ndolgov.antlrtest;

import org.antlr.runtime.*;
import java.io.IOException;

@members {
private EmbedmentHelper helper;

@lexer::header {
package net.ndolgov.antlrtest;

Embedment helper interface

The helper is a reference to an interface that defines a callback method for each interesting grammar element.

 * Embedment Helper object (see http://martinfowler.com/dslCatalog/embedmentHelper.html) called by ANTLR-generated query parser.
interface EmbedmentHelper {
     * Set storage id
     * @param id storage id
    void onStorage(String id);

     * Add variable definition
     * @param type variable type
     * @param name variable name
    void onVariable(Type type, String name);

Production rules

Now let's look at the rest of the grammar file to see how the helper is invoked. Note that the top-level production rule name "query" will result in a method called "query()" generated by ANTLR in the parser.

query : SELECT variable (',' variable )*
   FROM ID {helper.onStorage($ID.text);};
variable : type=varType id=ID {helper.onVariable(type, $id.text);};
varType returns [Type type]
 : LONG_TYPE {$type = Type.LONG;}
 | DOUBLE_TYPE {$type = Type.DOUBLE;};
  • Notice how Java code is embedded into the grammar using curly brackets e.g. "{helper.onStorage($ID.text);}"
  • Notice how to refer to a parsed piece of the query e.g. "$ID.text" in case of storage id
  • Notice how an alias can be used for the same purpose e.g. "type=varType" and later "{helper.onVariable(type ..."
  • Instead of returning just a substring of the original query a rule can return a Java object type e.g. "varType returns [Type type]" where Type is enumeration type. To create its instance we assign "{$type = Type.LONG;}". This also shows how to use Java enumerations in ANTLR grammars.

Embedment helper injection
Let's look again at the member section, this time updated to include helper injection.
@members {
    private EmbedmentHelper helper;

    public  EHT parseWithHelper(EHT helper) throws RecognitionException {
        this.helper = helper;
        return helper;

The parseWithHelper method shows three convenient points:
  • the method returns exactly the helper type it is given
  • it assigns given helper to private variable so that it can be called from the code embedded into the grammar
  • it hides the call to the top-level query method of the generated parser

Parser facade
It could be also convenient to hide parser instantiation from the client code with a facade class:

 * Parse a given query and return extracted execution-time representation of the parsed query
public final class QueryParser {
     * Parse a query expression and return the extracted request configuration
     * @param expr query expression
     * @return extracted request configuration
    public static QueryDescriptor parse(String expr) {
        try {
            final TestQueryParser parser = new TestQueryParser(new CommonTokenStream(new TestQueryLexer(new ANTLRStringStream(expr))));
            final EmbedmentHelperImpl helper = parser.parseWithHelper(new EmbedmentHelperImpl());

            return helper.queryDescriptor();
        } catch (RecognitionException e) {
            throw new RuntimeException("Could not parse query: " + expr, e);

Parsing error processing
One last thing to remember is error processing. In our case we just throw a runtime exception.

@members {

    public void emitErrorMessage(String msg) {
        throw new IllegalArgumentException("Query parser error: " + msg);

Complete source code is available as a maven project on GitHub

No comments: