Jun 12, 2016

Druid and Pinot low latency OLAP design ideas

0. Introduction

Imagine a data schema that can be described as (M, D1, .. , Dn, {(t,v)} ) where every entry has
  • M -  a measure (e.g. Site Visits)
  • Di - a few dimensions (e.g. Country, Traffic Source)
  • {(t,v)} - a time series with a measure value for each timestamp

Calling it a time series OLAP could be a misnomer. Time is just another dimension in a typical OLAP schema. Time series usually implies just a metric without dimensions (e.g. server CPU usage as a function of time). Nevertheless in real life systems dealing with such data is quite common. They also face similar challenges. Chief among them is the need for low latency/interactive queries. So traditional Hadoop scalability is not enough. There are a couple systems out there that are less well known than Spark or Flink but seem to be good at solving the time series OLAP puzzle.

1. Druid ideas 
  • For event streams, pre-aggregate small batches (at some minimum granularity e.g. 1 min)
  • Time is a dimension to be treated differently because all the queries use it
  • Partition on time into batches (+ versions), each to become a file
  • Each batch to have (IR-style): a dictionary of terms, a compressed bitmap for each term (and so query filters are compiled into efficient binary AND/OR operations on large bitmaps)
  • Column-oriented format for partitions
  • A separate service to decide which partitions to be stored /cached by which historic data server
  • Metadata is stored as BLOBs in a relational DB
  • ZK for service discovery and segment mappings
  • Historical and real-time nodes, queried by coordinators separately to produce final results

2. Pinot ideas
  • Kafka and Hadoop M/R based
  • Historical and real-time nodes, queried by brokers
  • Data segment : fixed-size blocks of records stored as a single file
  • Data segments
  • Time is special. Ex: brokers know how to query realtime and historic nodes for the same query by sending queries with different time filters. Realtime is slower because it aggregates at different granularity ("min/hour range instead of day")
  • ZK for service discovery
  • Realtime has segments in-memory, flushes to disk periodically, queries both.
  • Multiple segments (with the same schema) constitute a table
  • Column-oriented format for partition for segments
  • A few indices for a segment (forward, single-value sorted, )
3. Common threads
  • Column oriented storage formats
  • Partition on time
  • Information Retrieval/Search Engine techniques for data storage (inverted index)
  • ZK is a popular choice of discovery service
4. In other news

There also smaller scale products using search engine technology for OLAP data storage.

No comments: