Jun 30, 2015

Insufficient support for sparse data in sparksql

I was experimenting with sparse data in sparksql and I learned at least one surprising thing. I'm really interested in OLAP-style analytics use cases. If you squint enough SparkSQL looks like an in-memory analytics database. So even though SparkSQL is not a data storage product its DataFrame cache can be abused for doing in-memory filtering and aggregation of multidimensional data. 

In my experiment I started from an 820,000 rows by 900 columns Parquet file that simulated a very sparse denormalized OLAP dataset. I had a dozen or so dense columns of the long type representing dimension member ids. The rest were very sparse columns of the double type pretending to be metric values. I was impressed to see that Parquet compressed it into a 22 MB binary file.

The next thing I tried was actually loading that file as a DataFrame and caching it in memory to accelerate future processing. The good news is that reading such a file is a one-liner in Spark. The bad news is that it took 3 GB of SparkSQL cache storage. Initially I though that I had misconfigured something. After all to me 3 GB very much looks like 800K * 900 * 8 bytes with a modicum of compression. After looking at my executor's debug-level logs I could see that indeed what happened. Double-type columns were "compressed" with the PassThrough compression scheme which is a no-op.

I dug just a little bit deeper to see that no compression scheme is actually supported for floating point types. Well, who knows, even assertEquals treats floating types differently. Maybe data science is all about integer-type feature vectors. But while looking at the NullableColumnBuilder I also noticed that it is biased towards dense data and is rather unhelpful if what you have is sparse. It encodes a null value by remembering its int (i.e. 4-byte) position. So if you have an empty long/double (i.e. 8-byte) column, the best you can expect is to spend half of the memory a fully populated column would take.

I actually couldn't make the suggested workaround work in Java but I was not really trying. So I am still convinced that currently SparkSQL is rather sparse data-unfriendly. The corresponding compression functionality is not pluggable. I guess one would need to implement a new NullableColumnBuilder trait, extend that trait by new DoubleColumnBuilder-style classes, and instantiate those classes in ColumnBuilder::apply(). I feel like trying it to see if I can make something simple work.