Oct 8, 2016

Parquet file metadata, standard and custom

When writing about Parquet file format before I neglected to mention one auxiliary but potentially useful API. You might remember that a Parquet file is comprised of blocks. Even though each column is compressed individually, the block determines the rows that will be compressed as a single chunk. The API I would like to discuss today provides access to the file metadata, both custom and built-in.

If you are interested in using Parquet files with no help from a Spark-like query engine chances are you will end up using metadata. It could be as simple as attaching a schema version. Or you might want to utilize column statistics to optimize query processing predicate pushdown-style.

Every Parquet file has detailed metadata associated with it. If you see it in the picture you can read it. Among the attributes most likely to be useful are:
  • schema name
  • row count for every block
  • the number of NULLs, MIN and MAX values for every column in a block
  • application-specific key/value pairs you can attach to a file

I have an example of reading standard metadata. There is a static method that can return a ParquetMetadata instance for a given file path. From there you can traverse all blocks and their columns. No surprises.

In addition to the attributes defined by the Parquet format you can also attach arbitrary String key/value pairs to a file. In the write path, your WriteSupport class can override the finalizeWrite() method to return a custom metadata Map. In the read path, you have access to the map in the same place you get access to the file schema. Alternatively, the already mentioned standard API allows you to read custom metadata as well.