Jan 30, 2017

Compressing time series with Gorilla in Java

You might know FB for their censorship efforts and spying on the users but did you know they also write software? :) In a two-year old paper they discuss a very insightful and practical approach to compressing time series. Better yet, they even have an OSS implementation. Though C++ is not hard to read, it makes the original code not that useful in the JVM land. 

First, a brief recap of the compression ideas:
  • The data model is a sequence of (timestamp:int64, value:double) pairs. They split a sequence into blocks aligned with two-hour windows but we'll ignore such higher-level concerns.
  • Timestamps and values are compressed separately. 
  • Timestamps are stored as the first timestamp value followed by variable-length encoded delta of deltas. Four value ranges are supported ([-63:64] , [-255:256], [-2047:2048], "fits into 32 bits").
  • Values are stored as a sequence of the first value followed by variable-length encoded values produced by XOR-ing next value with the previous one. The binary representation of an XOR-ed value is split into "leading zeros", "block value", and "trailing zeroes" parts. The numbers of leading and trailing zeroes are stored as bytes.

I wanted to see how well Gorilla compression can cope with less predictable data. The kind of time series I am interested in is not as periodic as IoT data. So I ended up porting a couple of classes to Java. The time series class can be used for writing data or reading it but not both simultaneously. 

The current implementation is intended to help with testing compression quality and does not necessarily support other uses. The unit test shows how to read a compressed time series. The only catch is that reading the first timestamp requires a dedicated method call while all the values can be read calling the same method.