Feb 19, 2016

Reading Parquet file

Now that we can write a Parquet file, it could be useful to be able to read it back. We still assume a simple record type as the data format. This example can be adjusted for more complicated data schemas with nested structures though in such a case you should probably consider using PB/Avro messages instead.

It is not a surprise that reading a file is sufficiently similar to writing it. One would need to:
  • extend the ReadSupport class to parse a file schema and create readers for individual fields
  • create a new ParquetReader instance making possible to read a file one row at a time
  • when called by Parquet library, copy each row field value to your current row buffer 

Your read support class is responsible for returning a concrete RecordMaterializer. This class represents the ability to read a single row. Internally, it holds a reference to a GroupConverter that knows which PrimitiveConverter instance to use for the row field with a given index. 

Individual PrimitiveConverters are responsible for writing a field value to the corresponding slot in the data structure representing individual file rows in your application. The row reading sequence is mostly self-explanatory, please have a look at the example source code to see it in action. 


No comments: