Apr 28, 2016

Faster Lucene numeric field retrieval

I wrote previously about using a custom scoring loop to retrieve Lucene document values. It turns out there is another simple optimization that allows much faster field value retrieval. That feature is known as DocValues field type. The idea is that in addition to LongField/DoubleField types that can be indexed (and so filtered on when running a Lucene query) there is another numeric field type. 

The difference is that the DocValues fields cannot be indexed. So the pattern here is to have to Lucene document field subsets. The fields from the first one are indexed but not stored. They are used to match documents. The second subset consists of DocValues fields only. For a matching document the values of the DocFields can be retrieved from the index reader without fetching the entire document. The DocValues fields are implemented using memory-efficient techniques including reasonable compression.

In the example:

  • there are two numeric fields queries collected for each matching document by Processor 
  • Searcher can execute a custom scoring loop query
  • DocValuesQuery obtains entry points to DocValues storage and create a new Provider
  • Provider retrieves values of the required DocValues fields and notifies the Processor 
  • Processor represent the logic responsible for acting on the collected values for every matching Lucene document