Jan 29, 2012

Scheduler design in YARN / MapReduce2 / Hadoop 0.23


The Hadoop branch started with version 0.23 introduced a few significant changes to Hadoop implementation. From the middleware perspective, the most salient one is a brand new approach to scheduling. Previously, Map and Reduce slots used to be explicitly distinguished for scheduling purposes which limited potential for using Hadoop framework in non-MapReduce environments. In YARN, one of the key goals was to make the entire framework more generic, Mesos-style, so that internal distributed machinery can be used for other computational models such as MPI.

I am particularly curious about the possibility to heavily customize YARN for running distributed computations based on a non-standard computational model. Now that even MSFT dropped their, arguably more powerful alternative, Hadoop is destined to become the most advanced open source distributed infrastructure. So many in-house computational platforms could potentially benefit from reusing its core components.

There is still interesting tension between fundamental Hadoop trade-offs and low latency requirements. To guarantee control over resource consumption and allow termination of misbehaving tasks Hadoop starts a new JVM instance for each allocated container. JVM startup and loading of task classes take a few seconds which would be spared if JVMs were reused. This approach is taken by, for example, by GridGain for the price of inability to restrict resource consumption of a particular task on a given node. We'll see how Hadoop developers extend resource allocation to CPUs which Mesos achieves with Linux Containers.

Scheduler API

The way the Scheduler API is defined has significant influence on how general and so reusable YARN will be. Currently, YARN is shipped with the same schedulers as before but if it gets used outside of default MapReduce world custom schedulers will likely be one of the most popular extension points.

If you look at the diagram below, you can see key abstractions of YARN Scheduler design.

  • Scheduler is notified about topology changes via event-based mechanism
  • Scheduler operates on one or more hierarchical task queues. A queue can be chosen in Job Configuration using the mapred.job.queue.name property.
  • Scheduler allocates resources on the basis of resource requests that currently are limited to memory only. A request includes optional preference for a particular host or rack.
  • Scheduler supports fault tolerance with the ability to recover itself from a state snapshot provided by Resource Manager

Scheduler events include notifications about changes to the topology of available nodes and the set of running applications:
Side notes

It remains to be seen how quickly YARN takes over classical Hadoop. Even though the new code base is more interesting, it seems to be rather immature despite three years since the previous major version. Some things which surprised me:

  • There are important pieces of code such as Scheduler recovery from failures which are currently commented out. I also saw a TODO item stating something like "synchronization approach is broken here" in another place. Probably it's me, but before I merged a couple files with trunk I had not been able to build original v0.23.
  • The good news is that they finally migrated to Maven. The bad news is that for a project of this magnitude they have only a handful of Maven modules and those are strangely nested and interleaved with multiple non-maven directories.
  • It's quite odd not to have any dependency injection framework used in a large server-side Java system.
  • Not only they have their own tiny web framework but it is mixed with pure server-side code
  • Even though Guava is declared as a dependency it is used only sporadically. As an example, they even have their own Service hierarchy

Proliferation of Hadoop branches does not make it any easier I guess.