Sep 7, 2017

Tantalizing promise of GPU for analytics workload

For the last few years the trend in analytics systems has been to implement traditional MPP-style architecture as an open source, usually JVM-based product. Impala was the only exception. Surprisingly, its C++/LLVM foundation did not make it the leader of the "SQL on Hadoop" space which is probably a testament to modern Java performance. 

More recently, everyone began converging on the triad of columnar storage (off-heap when in-memory), vectorized operators, and operator byte code generation. Parquet / Arrow-style columnar format helps with compression and reducing disk-to-memory data transfer volume. Vectorization can take advantage of SIMD (e.g. see Lecture #21 - Vectorized Execution) or at the very least alleviate CPU cache line thrashing. 

By v2.0, Spark had covered most of the ground. It's interesting to note that even though cached dataframes are stored in  a compressed columnar format, physical operators are still not vectorized and so process a row at a time (e.g. see a rare glimpse of the internal machinery in the design doc attached to SPARK-14098 "Generate Java code to build CachedColumnarBatch"). 

Judging from SPARK-15687 "Columnar execution engine", the need for vectorized operators vis-a-vis whole stage code generation seems to be a topic to seariously debate this year. Nevertheless it would still be a well understood approach other features such as SPARK-19489 "Stable serialization format" could benefit from.

What is easy to miss behind all this progress is that the GPU revolution ignited by DL/ML quietly arrived to the analytics world too. My high-level understanding of the GPU is essentially "very fast operations on columnar data once you copy it from main memory to the GPU". Which almost looks like very advanced SIMD with high latency. So my initial expectation was that analytics queries would not be much faster because of the sheer data volume required by an average query.

I was clearly wrong. My favorite source of query engine benchmarking statistics shows that MapD and Brytlyt score very high. Both were designed from ground up for the GPU. One challenge with GPU-related technology is that it's way too low-level in comparison with any JVM-based development. 

So most people will probably end up just learning the big picture. For a reasonable overview of GPU usage in the OLAP context, I would recommend the Red Fox research project and their "Red Fox: An Execution Environment for Relational Query Processing on GPUs" paper in particular.

To the best of my knowledge, this topic is of limited interest for Spark project currently. At least partially because it would take columnar data formats and vectorized operators to align Spark internals with GPU coding idioms and data layouts. It probably makes sense to keep an eye SPARK-12620 "Proposal of GPU exploitation for Spark" and whatever else its author does even if that prototype did not succeed.