Dec 5, 2014


The other day I noticed that the Hyracks research prototype had a continuation. A few years ago it was quite interesting. Back in the day when the Dryad paper was in vogue but its source code was unavailable. The sql-on-hadoop segment was not as overcrowded. Heck, the term itself did not exist then if I remember correctly. I perused their recent papers and I am not quite sure what to think about it. 

I liked the one on data storage because it illustrates nicely the usage of LSM tree indexes. Their streaming support looks somewhat unnatural in comparison with the way event-oriented systems are built and used Storm-style. The critique of mainstream Java memory wastefulness is nice but too basic.They don't even consider vectorization which is such a hot topic in analytics. I'd rather see what the researchers think about using the off-heap storage with the latest JDK. 

I understand that the academics avoid making real contributions to production systems. I am not convinced it's always justified though. Consider something like a DB cost-based optimizer. Judging from the literature it's a legit academic topic. There is only one open source optimizer project I know, Apache Calcite. It's actually a very mature framework with a long history and major products using it. It is also very meta which makes it extremely difficult to really grasp despite of impressive javadocs. I have not found any design papers on it and I regret that deficiency very much. Anyway, the AsterixDB guys don't even mention Calcite/Optiq even though they admit that their own optimizer is rule-based. The way they describe their optimizer makes me think Algebricks wants to be Calcite when it grows up. 

On a lighter note, we are running out of previously unused words. When I hear that AsterixDB is a BDMS I conjure up images of a venerable video genre. When I read about Algebricks I remember Algerbird. 

No comments: