Jul 10, 2016

Processing data without deserialization with flatbuffers

While reading about Arrow I was reminded about flatbuffers. I decided to try them for a simple case to see how they compare to hand-coded data structures. The promise was that given a schema described with an IDL similar to PBs/Thrift, the flatc compiler can generate data structures that do not require deserialization to perform operations. Which comes in handy when you have a massive input data stream.

I created a trivial schema that purports to represent a simplistic time series. It's basically a sequence of (time, value) pairs. I was curious how useful the generated code could be. For example, would it be smart enough to unwrap an array of fixed-size value pairs into two parallel arrays? 

The first nasty surprise was to learn that flatbuffers don't have an official maven artifact. They don't even have a maven plugin. This alone is a huge red flag even though some kind people came to the rescue. At least "brew install flatbuffers" actually installed the flatc compiler on MacOS. 

Flatbuffers have "struct" and "table" abstractions. The former is supposed to be bare-bones serializable data structure. The latter helps with schema changes but adds overhead to store some schema details. I was interested in the most compact format possible. My first attempt was to introduce a DataPoint struct and then have an array of them. No can do, flatc does not allow it. My second attempt was more successful. With two primitive type arrays I was actually able to compile my IDL.

To begin with, flatbuffers internally use ByteBuff instances and not byte arrays. That complicates life if you want something simple. I imagine it should complicate life even if your use case is more complex because there is no support for configurable buffer allocation. You can use a ByteBuff instance of your own but if you fail to size it appropriately, the framework will allocate a new large instance.

When I looked at the way an array is written to the ByteBuff I was disappointed to see that they copy it one element at a time. Simple omissions like that or absence of signatures such as "write(byte[] array, int from, int length)" are unexpected in a framework aimed at people who care about efficiency. It was also disappointing to see 50+ bytes of overhead when serializing a 10*16 bytes record. Those tables are not free.

So flatbuffers could still be an option if you deal with complicated, deeply-nested data structures. For simple record types they do not seem to provide any benefit but still require exorbitant effort in configuration management.