Mar 11, 2015

Scala is hot in analytics companies

It is hard to believe it's been six years since I started worrying that Java is obsolete :) We lived long enough to see Java8 in production. A couple of years ago I thought that Scala flat-lined. I remember keynotes on never getting to mainstream and compiler folks leaving. But this year I can see a very different picture. Most of the companies dealing with analytics are writing Scala. A great many are either already running Spark in production or prototyping it. At this point it feels like a very good time to join such a company. There are too few people around with real Scala experience to make it a strict requirement.

It is interesting to observe that some arguments against Scala sound so familiar from good old C++ days. Some companies ban use of a certain feature subset (e.g. implicits). Some folks have trouble with a multi-paradigm language where there is so much choice. Few people are excited about function signatures in the standard library (remember STL? partially specialized template anyone?). I know good engineers are supposed to value simplicity. I must be a bad one, I am attracted to complexity. Scala reminds me C++ in this respect. There is always some language tidbit to pick up every time you read a book. In Java only the j.u.c package could fill one with so much joy for years :)

Mar 10, 2015

Research projects inch towards mainstream big data analytics

Just recently I wrote about a previously small research prototype that grew up into a much larger system. It took me some time to recognize that another university project from a few years ago got even closer to the world of real software. After rebranding it is known as Flink and resides at Apache Incubator.

It raises a few interesting questions about evolution of what could be liberally called sql-on-hadoop query engines. A growing number of open source systems used for analytics look architecturally very similar. If you squint enough, each and every of Presto/Impala/Hive/Drill/Spark/AsterixDB/Flink and their ilk is fundamentally an relational MPP database. Or at least its query engine half.

Some of them started with a poor computational model. To be more precise, the mapreduce revolution of Hadoop 1.0 allowed people access to the kind of computational infrastructure previously absent from the open source world. Without alternatives, people were forced to abuse a simple model invented for log batch processing. Others didn't limit themselves with two kinds of jobs and went for real relational operators. 

Impala was arguably the first wake up call from the database universe. Its technological stack alone was a clear indication of the kind of gun real database folks bring to a Hadoop knife fight. Nowadays it's Spark that is so bright and shiny that every other day people suggest it could replace MapReduce in Hadoop stack.

I reckon it'll take years to see how it all plays out. If Spark, with its academic pedigree, can challenge Hadoop/MR could it be the case that AsterixDB or Flink mature enough in a couple of years to do something similar? They also started at respected universities, roughly at the same time, had the right relational MPP core from day one, have some Scala in the code base. Are they too late to the game and will never be able to acquire enough traction? Do they need a company to provide support (and aggressive marketing) as a precondition? Will they be able to run on YARN/Mesos well enough to compete in a common Hadoop environment?