May 30, 2011

Zookeeper-based implementation of GridGain Discovery SPI

The GridGain team is not particularly fond of Zookeeper. Their argument is based on the need for a rather static configuration which is somewhat at odds with the original vision for the highly dynamic GridGain framework. Nevertheless, nowadays a growing number of products use ZK as their discovery service to enable SOA-style architecture. Once you have ZK installed it's quite natural to try to reuse it to keep track of all kinds of servers in your system including GridGain nodes.

GridGain is a pretty interesting niche player. To the best of my knowledge they were the only open source Java product started as a computational grid framework when everyone was implementing K/V-store-style distributed caches. Curiously enough, nowadays everyone is bolting on something similar while GridGain is working on a distributed cache of their own.

When I first learned about GridGain a couple of years ago the two most striking features for me were their SPI-based architecture and the quality of their javadocs. Except for the kernal, pretty much every piece of functionality in GridGain belongs to one of a dozen well-defined SPI interfaces. So theoretically if you feel like you can provide your own implementation for a slice of the system. On the flip side, the entire framework is still shipped as one jar file and so Maven-based configuration management is complicated by the need to explicitly exclude at least half of transitive dependencies pulled in by default.

The Discovery SPI has a few obvious implementations out of the box. The recommended one is a recently rebranded and updated TCP-based implementation. Its protocol seems to have gossip characteristics but there is also the notion of coordinator. Under the hood, this implementation relies on an auxiliary notion of IP finder to determine the initial set of peers to try to connect to. If a connection succeeds, the peer server will forward the connection request to the coordinator. The coordinator is responsible for keeping a versioned view of the current topology.

The easy way

The TCP Discovery implementation internals are pretty complicated because the connection state machine, the corresponding message processing logic and socket IO all reside in the same Java class. Nevertheless it looks like the easiest way to integrate with ZK is to implement a new IP Finder. At a high level:
  • Choose a persistent znode with a sufficiently unique name/path (especially if you share a set of servers among multiple Gridgain clusters)
  • When a new instance of your Finder is being created: connect to ZK, check whether the base znode exists and create it if necessary, set a watcher for children of the base znode
  • When your ChildrenCallback is notified extract urls from the returned list of strings
  • When the local GridTcpDiscoverySpi instance registers an address with its Finder, create an ephemeral child node with a name based on the address (e.g. following the "hostname:port" convention) under the base znode
  • There is no need to remove the children created by a GridGain instance when the instance is stopped. ZK will detect a closed session and delete the corresponding nodes automatically.
I had a positive experience with an implementation along these lines with the previous version (3.0.5c). My understanding though is that this is more of a hack because the TCP SPI implementation will have all the socket-based machinery running redundantly. The DRY principle is clearly broken here. So this approach is probably more of a prototype to see how much you like ZK-based Discovery.

The right way

As they say, no pain no GridGain ;) And apparently the only correct way is to create a brand new Discovery SPI implementation. The good new is that there are at least two detailed examples: the JMS-based implementation and the Coherence-based one. The latter seems to be more relevant because Coherence is certainly more similar to GridGain than JMS is.

The bad news is that the Discovery SPI itself is not as trivial as one might think. You will need to keep in mind a few design details pertaining to the way the SPI is used by the rest of the system. Even when you think that some of them might be optional you will likely discover (ha-ha) that in reality other parts of the system actually use them.

The notion of GridGain Node includes the following elements:
  • a UUID used to uniquely identify a node (so you'll need to map them on whatever you put into ZK)
  • one or more IP addresses associated with the node (certainly to be held in ZK)
  • a set of node attributes created when the GridGain instance starts and given to the SPI by the kernal (another thing to put into ZK). Attributes are frequently used by SPI implementations to exchange configuration information via a generic mechanism
  • an implicit set of periodically refreshed node metrics. Here comes the pain - they change and so a chunk of data will need to be written to ZK periodically (and trigger a bunch of updates in ZK server and other GridGain instances).
  • a discovery listener to be notified about nodes that joined or left for internal event-based bookkeeping.
In comparison to the previous approach, the node representation in ZK is not only larger but also more dynamic. So a more promising ZK node structure could be borrowed from Norbert:

  • Create two persistent children of the base znode called "members" and "available"
  • The "members" node will have an ephemeral child node for each GridGain instance with the node state (attributes, metrics, probably UUID) kept as data 
  • The "available" node will have an ephemeral node for each IP address of all currently available instances (e.g. assuming the same "hostname:port"  name format). Each "available" node would refer to the corresponding "members" node (to simplify support for multiple IP adresses for the same instance).
  • The SPI would listen for changes to "available" nodes to detect joined and left nodes and to "members" nodes to detect updated metrics
The middle way

I am wary of frequent updates written to ZK. So another option would be to keep only URLs in ZK as in the first approach and come up with a metric exchange protocol of your own (with manually created sockets or even Netty). My inner child thinks it could be fun. My inner engineer thinks that enough already of reimplementing the same wheel over and over again. So this middle way would trade fewer ZK updates for a lot of coding and very real potential for subtle bugs.

No comments: