Dec 10, 2014


I expect Aeron will follow a trajectory similar to the one Disruptor had. Namely, fewer people will actually use it in production than learn lots of valuable lessons from its design principles. With the exception of a market feed processing system I have actually seen the Disruptor used only in a low latency logger. But I still remember how surprising the emphasis on a single writer and all the nifty non-blocking tricks looked a few years ago. 

So I went through the original Strange Loop presentation for a first taste of extreme messaging wisdom. Well, for better or worse actually learning any new low-level details will take some source code digging. But from the first pass one technique in particular somehow reminded me both Disruptor and Kafka thinking. 

On the sender side, there are three buffers. A clean one (empty for the time being), a dirty one (completely filled, could be used to go a little back in sent message history) and the current one (partially filled, to be used for the next send request). There is a current offset pointer shared by all the sender threads. When a message is being sent, the sender thread first tries to increment that counter with a CAS, retrying if necessary. That CAS allows multiple senders to do most work, including actually writing a message to the output buffer, without synchronization.

On the receiver side, the setup is similar but there are two offset pointers. One for the high water mark (the highest message offset seen so far) and the other for the last message received in sequential order (i.e. without missing messages with lower offsets). The two pointers allow to asynchronously restore the original sequence for messages received out of order. In addition, another thread can periodically check the difference between the two offsets to detect a loss of messages in transit and so the need for re-transmission. 

Dec 5, 2014


The other day I noticed that the Hyracks research prototype had a continuation. A few years ago it was quite interesting. Back in the day when the Dryad paper was in vogue but its source code was unavailable. The sql-on-hadoop segment was not as overcrowded. Heck, the term itself did not exist then if I remember correctly. I perused their recent papers and I am not quite sure what to think about it. 

I liked the one on data storage because it illustrates nicely the usage of LSM tree indexes. Their streaming support looks somewhat unnatural in comparison with the way event-oriented systems are built and used Storm-style. The critique of mainstream Java memory wastefulness is nice but too basic.They don't even consider vectorization which is such a hot topic in analytics. I'd rather see what the researchers think about using the off-heap storage with the latest JDK. 

I understand that the academics avoid making real contributions to production systems. I am not convinced it's always justified though. Consider something like a DB cost-based optimizer. Judging from the literature it's a legit academic topic. There is only one open source optimizer project I know, Apache Calcite. It's actually a very mature framework with a long history and major products using it. It is also very meta which makes it extremely difficult to really grasp despite of impressive javadocs. I have not found any design papers on it and I regret that deficiency very much. Anyway, the AsterixDB guys don't even mention Calcite/Optiq even though they admit that their own optimizer is rule-based. The way they describe their optimizer makes me think Algebricks wants to be Calcite when it grows up. 

On a lighter note, we are running out of previously unused words. When I hear that AsterixDB is a BDMS I conjure up images of a venerable video genre. When I read about Algebricks I remember Algerbird. 

Nov 30, 2014

Kafka design ideas

A new company behind Kafka development and a fascinating new high-throughput messaging framework making waves inspired me to go look at major ideas in Kafka design. Its strategy in general is to maximize throughput by having non-blocking, sequential access logic accompanied by decentralized decision making. 
  • A partition corresponds to a logical file which is implemented as a sequence approximately same-size physical files. 
  • New messages are appended to the current physical file. 
  • Each message is addressed by its global offset in the logical file. 
  • A consumer reads from a partition file sequentially (pre-fetching chunks in the background). 
  • No application level cache. Kafka relies on the file system page cache. 
  • A highly efficient Unix API (available in Java via FileChannel::transferTo()) is used to minimize data copying between application/kernel buffers. 
  • A consumers is responsible for remembering how much it has consumed to avoid broker-side book keeping. 
  • A message is deleted by the broker after a configured timeout (typically a few days) for the same reason. 
  • A partition is consumed by a single consumer (from each consumer group) to avoid synchronization. 
  • Brokers, consumers and partition ownership are registered in ZK. 
  • Broker and consumer membership changes trigger a rebalancing process. 
  • The rebalancing algorithm is decentralized and relies on typical ZK idioms. 
  • Consumers periodically update ZK registry with the last consumed offset. 
  • Kafka provides at least-once delivery semantics. When a consumer crashes the corresponding last read offset in ZK could lag behind. So the consumer that takes over will re-deliver the events from that window. 
  • Event-consuming application itself is supposed to apply deduplication logic if necessary.