Aug 21, 2016

Customizing Lucene scoring loop with DIY scorer

Since my last Lucene post I have been shown what seems to be the ultimate way to customize Lucene scoring loop. If you squint enough, you could probably see it discussed in blogs and official documentation. But it is rather hard to visualize it in detail unless you are already well knowledgeable of the internals. In a certain sense it's more like wholesale replacement of the standard Lucene query processing path with something more lean and specialized. 

As usual, I am mostly interested in using Lucene in the OLAP context as opposed to classical full-text search. It implies the need to retrieve values from all the documents matching some predicates. It also mostly requires numeric fields. There are a few systems out there that internally use an inverted index for fast analytics and you could imagine them using Lucene in early prototypes.

Before we start, I should probably remind you that internally Lucene creates multiple index segments. Every segment is represented by a leaf reader context in the search path. You can easily obtain them from the standard IndexReader. And they will be our entry point into the DIY scoring loop.

If you are interested in this area, you most likely have some proprietary query DSL. Under normal circumstances, an expression in that DSL format is converted to a Lucene query. Most likely it takes a top-level BooleanQuery with multiple clauses to represent all query predicates. Internally, Lucene uses such constructions as ConjunctionDISI to represent those clauses.

Fundamentally, the idea is to:

  • use your DSL instead of standard Lucene Query/Weight abstractions
  • implement your own scorer hierarchy, mostly TermScorer-like ones for numeric types and Conjunction/Disjunction ones to compose them into trees; yours will be simpler and you also will be able to support resetting them and save on memory allocation
  • implement you own query compiler to transform your DSL query into a composite of the new scorers
  • continue using document processing of the kind discussed in previous posts; this is an efficient way to accumulate results in a more iterative manner
To illustrate this approach I implemented one scorer and a couple of toy DSL classes to represent a query, a compiler for it, and a document processor. With the exception of the scorer, all of them would be different in your system and so cannot be implemented in a reusable way.

In my example I have got a few representative abstractions and their most trivial implementations ever. Any realistic implementation would venture too far from Lucene proper but you should be able to see the idea. A quick look at the test code should be enough to see how the parts fit each other.

  • IndexSearcher - a perfectly standard Lucene searcher that we use to find leaf contexts
  • QueryBuilder - represents a DIY Lucene query compiler that transforms a DSL query into a Scorer (in real life, most likely hiding a Scorer tree behind a composite class)
  • Searcher - DIY IndexSearcher replacement that knows how to find index segments and apply the same scorer and processor to all of them
  • Processor - a mean of collecting field values from matching documents
  • QueryScorer -  DIY scorer that replaces Scorers and Weights and can be reused
I can imagine a less onerous approach where you would reuse standard scorers. Unfortunately it requires dealing with many low-level details and interfaces even though you presumably don't need their functionality. Well, if nothing else, this discussion illustrates how little it takes to retrieve data from a Lucene index when you don't need to actually score the results. 

No comments: