As I discussed before, Parquet file format is extremely well suited for storing data in analytics systems because of its columnar approach. But it gets better (even after you take into account column data compression). Parquet library supports predicate push-down which makes it very complimentary to Spark-like analytics query engines. Depending on your data model and data distribution it could be possible to skip entire blocks when reading a Parquet file.
The Filter API supports predicates that can use usual comparison operators on column field values. Individual predicates can be composed using logical AND and OR operators. In addition, user-defined predicates are supported. They can skip an entire file block if the column metadata indicates that the block has no matching values.
The Filter API is comprised of just a few abstractions:
The Filter API is comprised of just a few abstractions:
- FilterCompat.Filter - the push-down predicate interface taken by ParquetReader builder as an optional parameter
- FilterApi - builder methods that create individual predicates and compose them together
- UserDefinedPredicate - the interface to be implemented by user-defined filters; it supports block-level predicates in addition to usual record-level ones
I updated my Parquet example to demonstrate usage of a disjunctive filter and a user-defined one. At a high-level:
- FilterApi is repeatedly called to resolve a column name into a column reference, create individual predicates using the reference, and compose predicates
- Behind the scene, a tree of inner Operators class instances is created
- FilterCompat facade is called to wrap the tree into Filter implementation recognized by Parquet file reader
- A new file reader is created using the Filter
- Internally, different visitor types are used to apply record-level and block-level predicates