Most of Hadoop features must be managed manually with multiple obscure parameters. As an example, Hadoop supports dynamic cluster membership changes but it has no support for deciding when to add/remove nodes or when to rebalance the data layout. Instead of aiming for peak performance, Starfish project wants to provide good Hadoop performance automatically.
Three levels of Hadoop workload optimization:
- Individual MR jobs
- MR jobs assembled into a workflow (e.g. generated from HiveQL or by a Cascading-style framework)
- Collections of workflows
- JIT optimizer (instead of manual configuration of 190 parameters) based on
- Profiler (dynamic instrumentation to learn performance models) and
- Sampler (statistics about input/intermediate/output key-value spaces of a job)
- Timings view: where wall-clock time is spent in each phase
- Data flow view: how much data is processed in each phase
- Resource-level view: how many resources such as CPU and memory is used in each phase
- Workflow-aware Scheduler ("global optimization")
- WhatIf engine (uses performance models and a job profile to estimate a new profile for different configuration parameters)
- Data Manager (rebalance HDFS data blocks using different block placement policies)
- Workload Optimizer to generate an equivalent, but optimized, collection of workflows using
- Data-flow sharing (reusing the same job on behalf of different workflows)
- Materialization (caching intermediate data for later reuse, probably by other workflows; also helps avoid cascading reexecution)
- Reorganization (automatically chosen alternative means of keeping intermediate data such as key-value and column stores)
Starfish is an open-source project.
No comments:
Post a Comment