May 3, 2015

Data storage for SparkSQL-based analytics

I have been mulling over the question of Spark-friendly data storage recently. My favorite use-case here is still the one where:
  • in a multi-tenant environment 
  • you can provision enough memory for all/most of the data to reside in RAM
  • your target system is expected to have response time of a few seconds, preferably less
When I imagine an ideal data storage in this context I see things such as:
  • In a steady state, all data is in  memory, backed by some disk-based cold storage.
  • Datasets are split into multiple partitions which are held on different servers. It makes possible to execute a query in parallel on multiple servers. In case of a crash, only a smallish data subset is lost and has to be recovered from the cold storage.
  • A catalog service shared between the storage component and the Spark scheduler. It would allow the scheduler to honor data locality for expensive computations.
  • Obvious features such as efficient column compression and support for predicate push-down
  • It's too much to ask, but using the same storage (or at least transfer) format as the internal Spark cache storage would be nice for server recovery time. 
Curiously enough, I don't see storage options discussed much in Spark documentation and books. Probably because the default Spark positioning is against MapReduce and so one is expected to have ORC/Parquet files on HDFS.

My impression is that there are three reasonable alternatives. All of them are work in progress to such an extent that only your own experiments can show how far they are from production-quality. It is not entirely clear how stable they are but there are companies running them in production despite of the official experimental/beta status. So I can imagine some startups betting on those technologies with expectation of growing together.
  • Spark cache itself. On the positive side, it's easy to use, it's integrated into the engine and it's believed to be memory-efficient because of column-orientation and compression. The problem is that it's not a real storage (try to imagine updates), it's called a cache for a reason. What is much worse is that if you lose you driver then you lose every single cached DataFrame it owned. It's a really big deal and there are only tepid answers right now.
  • Tachyon. You would think that for a project originated at the same lab the integration (and PR around it) would be top-notch. Apparently there are missing pieces and Tachyon is not explicitly mentioned in what passes for a roadmap.
  • Cassandra. With an official connector on github, the story here seems to be more clear. For people who already run it I would expect it to be not a bad idea. You already have an in-memory columnar storage that can process predicates and serve column subsets. Co-locate Spark workers and Cassandra nodes and data transfer might be not that expensive.
  • GridGain Ignite. Frankly, I guess I am not being serious here. I have never looked at their in-memory file system anyway. But according to some not-so-fringe voices..
When I just started thinking about it, serving Parquet files via Tachyon as your SparkSQL data storage sounded like a buzzword-compliant joke. I am not so sure anymore even though it still looks strange to me. I would say that the Cassandra option looks much more traditional and so likely to be production-worthy sooner than anything else. But I admit to having little certainty as to which one is solid enough to be used this year.