Sep 3, 2015

Customizing Lucene scoring loop

By default, one thinks about Lucene as a full-text search engine. You give it many text files and it will be able to find the most relevant ones for a query. The API is quite straightforward and there is no need to know anything about what happens underneath it.

There is another, arguably less orthodox, way to think about it. Lucene is widely known for its search speed. So instead of looking for the most relevant text documents one could use it as a more generic data storage with excellent indices. Some people used it this way for storing time series.

Once you start this kind of development you quickly recognize the need for calling Lucene efficiently. It's trivial to have millions of matching documents for your query. So if you want to extract data from all of them and not just a few top-scoring ones you will be wise to pay attention to new object allocation. It start from simple things such as caching and re-using Lucene document and filed instances. Then you stop calling search methods returning arrays and switch to callback-based ones.

And then there is the next stage when you find out that it's possible to go a level down and abuse Lucene a little. This approach is usually referred to as "custom scoring loop". Oddly enough, by far the best explanation of it I have seen is a blog post. I can't recommend enough that description, the author knows his stuff and goes quite deep into Lucene internals. 

The idea is to:

  • create a special kind of Query that can register a listener for matching documents
  • when the listener is called, instead of scoring a document in a different way, read the fields required for processing; if possible, process them right there
  • ignore the results returned by IndexSearcher

It's actually surprisingly simple to implement this idea. It's enough to extend three classes. The sequence diagram below shows a representative control flow:

  • extend StoredFieldVisitor to have a Visitor that knows the Lucene document fields you want to read
  • create a Processor that owns the Visitor and can be notified about availability of document field values
  • extend CustomScoreProvider to have a matching document listener; when called, apply the Visitor to read field values and make the Processor use them
  • extend CustomScoreQuery to register the listener with Lucene searcher
  • call IndexSearcher with an instance of the new query type; once finished, enjoy the data gathered by the Processor

No comments: