Apr 23, 2011

Delay Scheduling and Hadoop Fair Scheduler

I am blessed with a job which is in a domain sophisticated enough to encourage reading research papers. I am planning to post my notes on the most interesting ones at least as a means of keeping key notes readily available.

So without further ado, here goes a summary of "Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling":

The key trade-off is fairness (in allocation of resources) v data locality (executing a job on a node that already has input data for the job). The scheduler goal is sharing a cluster between multiple users with a mix of long batch jobs and short interactive queries over a common data set. Fair scheduling requires resource reallocation when the number of jobs changes.

The original Hadoop FIFO scheduler:

  • assign tasks in response to heartbeats sent by slaves which report the number of free map and reduce slots on the slave
  • scan through jobs in order of priority and submit time to find one with a task of the required type
  • for maps, after selecting a job greedily pick the map task in the job with data closest to the slave
Locality problems with naive fair sharing (assign free slots to the job that has the fewest running tasks):

  • Head-of-line scheduling: small jobs are likely to have their tasks sent to random nodes 
  • Sticky Slots: a tendency for a job to be assigned the same slot repeatedly (a task completes, its job has fewer tasks than the others, the slot is reassigned to the job again)
The Hadoop Fair Scheduler: 

  • divide resources using max-min fair sharing to achieve statistical multiplexing
  • place computations near their input data to maximize system throughput
A two-level scheduling hierarchy: 

  • allocate task slots across pools using weighted fair sharing
  • let each pool allocate its slots using either FIFO with priorities or a second level of fair sharing (each pool can be given a minimum share guaranteed to be given as long as the pool contains jobs)
Delay Scheduling

  • each job has two timeouts
  • a job with no tasks local to a node is not scheduled for the duration of the first timeout. 
  • the job is not scheduled for the duration of the second timeout if it does not have tasks local to the rack of the node.
Key enablers: 

  • most tasks are short compared to jobs
  • multiple locations in which a task can run to read a given data block (including multiple task slots per node)

No comments: