Nikita Dolgov's technical blog: Starfish and Hadoop self-tuning

A summary of "Starfish: A Self-tuning System for Big Data Analytics".

Most of Hadoop features must be managed manually with multiple obscure parameters. As an example, Hadoop supports dynamic cluster membership changes but it has no support for deciding when to add/remove nodes or when to rebalance the data layout. Instead of aiming for peak performance, Starfish project wants to provide good Hadoop performance automatically.

Three levels of Hadoop workload optimization:

Individual MR jobs
MR jobs assembled into a workflow (e.g. generated from HiveQL or by a Cascading-style framework)
Collections of workflows

Job-level tuning

JIT optimizer (instead of manual configuration of 190 parameters) based on
Profiler (dynamic instrumentation to learn performance models) and
Sampler (statistics about input/intermediate/output key-value spaces of a job)

The Profiler creates a job profile for different phases of MR job

Timings view: where wall-clock time is spent in each phase
Data flow view: how much data is processed in each phase
Resource-level view: how many resources such as CPU and memory is used in each phase

Workflow-level tuning

Workflow-aware Scheduler ("global optimization")
WhatIf engine (uses performance models and a job profile to estimate a new profile for different configuration parameters)
Data Manager (rebalance HDFS data blocks using different block placement policies)

Workload-level tuning

Workload Optimizer to generate an equivalent, but optimized, collection of workflows using
Data-flow sharing (reusing the same job on behalf of different workflows)
Materialization (caching intermediate data for later reuse, probably by other workflows; also helps avoid cascading reexecution)
Reorganization (automatically chosen alternative means of keeping intermediate data such as key-value and column stores)

Starfish is built on top of Hadoop. Its input is expressed in a new language called Lastword. The language is not supposed to be used directly by humans. Instead there are translators from HiveQL-style languages to submit a collection of MR workflows. Those workflows can be DAGs of MR jobs, select-project-join-aggregate logical specification or user-defined functions. Workflows can be annotated with metadata such as scheduling hints, data statistics and data layouts.

Starfish is an open-source project.

Nikita Dolgov's technical blog

Apr 24, 2011

Starfish and Hadoop self-tuning

No comments: