0. Introduction
Imagine a data schema that can be described as (M, D1, .. , Dn, {(t,v)} ) where every entry has
- M - a measure (e.g. Site Visits)
- Di - a few dimensions (e.g. Country, Traffic Source)
- {(t,v)} - a time series with a measure value for each timestamp
Calling it a time series OLAP could be a misnomer. Time is just another dimension in a typical OLAP schema. Time series usually implies just a metric without dimensions (e.g. server CPU usage as a function of time). Nevertheless in real life systems dealing with such data is quite common. They also face similar challenges. Chief among them is the need for low latency/interactive queries. So traditional Hadoop scalability is not enough. There are a couple systems out there that are less well known than Spark or Flink but seem to be good at solving the time series OLAP puzzle.
1. Druid ideas
- For event streams, pre-aggregate small batches (at some minimum granularity e.g. 1 min)
- Time is a dimension to be treated differently because all the queries use it
- Partition on time into batches (+ versions), each to become a file
- Each batch to have (IR-style): a dictionary of terms, a compressed bitmap for each term (and so query filters are compiled into efficient binary AND/OR operations on large bitmaps)
- Column-oriented format for partitions
- A separate service to decide which partitions to be stored /cached by which historic data server
- Metadata is stored as BLOBs in a relational DB
- ZK for service discovery and segment mappings
- Historical and real-time nodes, queried by coordinators separately to produce final results
2. Pinot ideas
- Kafka and Hadoop M/R based
- Historical and real-time nodes, queried by brokers
- Data segment : fixed-size blocks of records stored as a single file
- Data segments
- Time is special. Ex: brokers know how to query realtime and historic nodes for the same query by sending queries with different time filters. Realtime is slower because it aggregates at different granularity ("min/hour range instead of day")
- ZK for service discovery
- Realtime has segments in-memory, flushes to disk periodically, queries both.
- Multiple segments (with the same schema) constitute a table
- Column-oriented format for partition for segments
- A few indices for a segment (forward, single-value sorted, )
3. Common threads
- Column oriented storage formats
- Partition on time
- Information Retrieval/Search Engine techniques for data storage (inverted index)
- ZK is a popular choice of discovery service
4. In other news
There also smaller scale products using search engine technology for OLAP data storage.
No comments:
Post a Comment