Dec 27, 2016

Parquet file query DSL with ANTLR

Last time on Sheldon Cooper presents fun with flags we saw a lightweight alternative to ANTLR used to parse a simple query DSL. I was having too much trouble with another alternative so in the meantime I decided to finish a proper ANTLR version of the parser. As luck would have it, in ANTLR4 there is more than one way to write a parser. So for a low, low price of a single grammar file you get a parser written in the officially recommended way and the other following a more traditional approach.

Strictly speaking, there are two grammar files but if you squint a little you can see one was created by adding actions to a copy of the other. ANTLR4 makes an attempt to avoid actions in a grammar file to simplify the file and make possible parser generation in languages other than Java. 

Instead of actions one can implement a parse tree listener interface. The good news is that it indeed allows better isolation between grammars and Java code. The bad news is that there is even more choice if you want to do something that requires sharing state between parser rules. Like building an AST.

Of the two simplest choices, I decided to use a manually maintained stack. It could probably be done with a visitor implementation returning AST nodes but the listener with a stack approach makes the juxtaposition with Parboiled even better. A new trick in ANTLR4 is using "# Something" snippets in the grammar to make ANTLR generate a listener method for every rule alternative. 

The listener ended up looking more similar to the Parboiled parser than I had expected. That made me think about trying the traditional approach with actions embedded into grammar. At first glance, the listener approach looks cleaner indeed at no discernible cost.

ANTLR is still very good with skipping white spaces at the lexer level. No need to contaminate the parser rules with redundant checks everywhere. But to my surprise I actually found a couple other problems. To begin with, there is no easy way to make keywords case-insensitive. Secondly, I failed to write a parser rule that would return "unquoted" string values without breaking the ID lexer rule. I am probably not the only one confused with the latter. After some experimentation I simply settled on manually calling String::substring.

In addition to two spectacular books there is a lot of online documentation as well. The ANTLR maven plugin works seamlessly with no additional configuration so the gossip of "additional build process step" is widely exaggerated. 

Between an easier to read grammar file and very powerful IDE integration ANTLR remains the leader. But from what I see the code size and complexity are roughly the same as for Parboiled. There seems to be the only 300K ANTLR runtime library. Parboiled has half a dozen dependencies (mostly ASM jars) of comparable total size.

So as a provisional conclusion, I don't feel strongly anymore that ANTLR is the only right answer to parsing needs in Java. I should probably run a performance test at some point but I doubt +-20ms is of real concern to most small DSLs consumers.

Dec 24, 2016

Parquet file query DSL with parboiled

Now that we know how to push down Parquet predicates the next logical step is to come up with a DSL to make writing predicates easier. The DSL would be handy for debugging, especially if it can be implemented with few additional dependencies. Running Spark just to have access to Dataset query syntax is not always convenient.

What I have in mind is something very straightforward, after all there is little else to express than predicates. It could be as simple as "SELECT * FROM '/tmp/somefile.par' WHERE ('column1'=42) || ('column2'=24)" to filter a file by a couple of columns.

http://mshang.ca/syntree/?i=%5BQUERY%20%5BSELECT%20123%2C456%5D%20%5BFROM%20%2Ftmp%2Ffile.par%5D%20%5BWHERE%20%5B%26%26%20%5B%3D%5B'col1'%5D%20%5B33%5D%5D%20%5B%7C%7C%20%5B%3D%5B'col2'%5D%20%5B42%5D%5D%20%5B%3D%5B'col2'%5D%20%5B24%5D%5D%5D%5D%5D%5D

When I think about query parsers ANTLR is the default choice. Its functionality, stability, and documentation are outstanding. But the syntax I mentioned makes me think that a lighter approach could be possible. So to a significant degree this exercise will be an attempt to compare a few parser libraries in the context of very small DSLs.

I decided to start with a heterogeneous AST representation and the pseudo-SQL syntax I mentioned above. To evaluate the alternatives I am going to pay attention to:
  • parser-related code size, complexity
  • grammar readability
  • transitive dependencies brought by the parser library
  • build process changes required (e.g. maven plugins)
  • tool chain, support for debugging of parser errors

Today's contender is Parboiled and its Java library in particular. The actual parser for my DSL is a reasonably small class. Syntactically, the PEG rules look quite isomorphic to how I imagine the corresponding CFG. The library provides a built-in stack that is very helpful for building AST nodes. When the stack is not enough, one can also use action variables that allow collecting results from other rules. 

Parboiled wiki provides a lot of useful information and in addition there are a few nice parser examples. Parboiled is a pure Java library so there is no need for additional maven plugins. It has just a few ASM dependencies. As for debugging, I found the tracing parser very convenient. 

Coding-wise, it took me some time to get used to calling "push/pop" as the last Rule in a Sequence. I also started with too many action variables but was able to covert most of them to operations on the stack. 

One minor problem with Parboiled is the lack of easy means of skipping white spaces in the tokenizer. I am pretty sure my current parser can be broken by adding space characters to a query here and there. Guarding every Sequence with a white space consuming rule seems to be the only viable approach.