I have been mulling over the question of Spark-friendly data storage recently. My favorite use-case here is still the one where:
- in a multi-tenant environment
- you can provision enough memory for all/most of the data to reside in RAM
- your target system is expected to have response time of a few seconds, preferably less
- In a steady state, all data is in memory, backed by some disk-based cold storage.
- Datasets are split into multiple partitions which are held on different servers. It makes possible to execute a query in parallel on multiple servers. In case of a crash, only a smallish data subset is lost and has to be recovered from the cold storage.
- A catalog service shared between the storage component and the Spark scheduler. It would allow the scheduler to honor data locality for expensive computations.
- Obvious features such as efficient column compression and support for predicate push-down
- It's too much to ask, but using the same storage (or at least transfer) format as the internal Spark cache storage would be nice for server recovery time.
Curiously enough, I don't see storage options discussed much in Spark documentation and books. Probably because the default Spark positioning is against MapReduce and so one is expected to have ORC/Parquet files on HDFS.
- Spark cache itself. On the positive side, it's easy to use, it's integrated into the engine and it's believed to be memory-efficient because of column-orientation and compression. The problem is that it's not a real storage (try to imagine updates), it's called a cache for a reason. What is much worse is that if you lose you driver then you lose every single cached DataFrame it owned. It's a really big deal and there are only tepid answers right now.
- Tachyon. You would think that for a project originated at the same lab the integration (and PR around it) would be top-notch. Apparently there are missing pieces and Tachyon is not explicitly mentioned in what passes for a roadmap.
- Cassandra. With an official connector on github, the story here seems to be more clear. For people who already run it I would expect it to be not a bad idea. You already have an in-memory columnar storage that can process predicates and serve column subsets. Co-locate Spark workers and Cassandra nodes and data transfer might be not that expensive.
GridGainIgnite. Frankly, I guess I am not being serious here. I have never looked at their in-memory file system anyway. But according to some not-so-fringe voices..
When I just started thinking about it, serving Parquet files via Tachyon as your SparkSQL data storage sounded like a buzzword-compliant joke. I am not so sure anymore even though it still looks strange to me. I would say that the Cassandra option looks much more traditional and so likely to be production-worthy sooner than anything else. But I admit to having little certainty as to which one is solid enough to be used this year.
No comments:
Post a Comment