Dec 19, 2010

Apache Avro serialization without code generation and IDL

It is not a secret that real programmers just love writing their own serializaiton frameworks. Everyone from JBoss to GOOG has one. So it's only natural for the Hadoop stack guys to have one of their own :) What makes Avro different from Protocol Buffers and Thrift though is that one does not need IDL and code generation to send data streams between nodes of a distributed system. An Avro data stream always carries its JSON-formatted schema with it and so can be deserialized by a client without any previous knowledge of what could be inside. 

While experimenting with this feature I did not find good examples not requiring IDL and code generation so I came up with my own. What I have in this tiny test project is a couple of builders for Avro Schema and GenericRecord classes and corresponding unit tests. The tests demonstrate writing to an output stream and reading back. 

In the most basic case, an entire stream/file represents a serialized sequence of records of the same schema. All one needs to write data to an output stream is a schema and data records created with the schema. The very same generic record type is used to read data from an input stream. In the test below we just assume a certain schema but in production code one would need to check schema type of each field. Also, note the idiomatic Avro approach to reusing the same record instance to read the whole data set.

@Test
    public void testUniformSchemaSerialization() throws Exception {
        final Schema schema = new AvroSchemaBuilder(2).string(COLUMN1).int32(COLUMN2).build("DATA");

        final ByteArrayOutputStream out = new ByteArrayOutputStream();
        final DataFileWriter<GenericRecord> writer = new DataFileWriter<GenericRecord>(new GenericDatumWriter<GenericRecord>(schema)).create(schema, out);

        final AvroRecordBuilder builder = new AvroRecordBuilder(schema);
        writer.append(builder.field(CITY1).field(POPULATION1).build());
        writer.append(builder.field(CITY2).field(POPULATION2).build());
        writer.close();

        final DataFileStream<GenericRecord> reader = new DataFileStream<GenericRecord>(new ByteArrayInputStream(out.toByteArray()), new GenericDatumReader<GenericRecord>());
        GenericRecord deserialized = null;

        assertTrue(reader.hasNext());
        deserialized = reader.next(deserialized);
        assertEquals(new Utf8(CITY1), deserialized.get(COLUMN1));
        assertEquals(POPULATION1, deserialized.get(COLUMN2));

        assertTrue(reader.hasNext());
        deserialized = reader.next(deserialized);
        assertEquals(new Utf8(CITY2), deserialized.get(COLUMN1));
        assertEquals(POPULATION2, deserialized.get(COLUMN2));

        assertFalse(reader.hasNext());
    }

The generated schema looks like this:

{
  "type" : "record",
  "name" : "DATA",
  "fields" : [ {
    "name" : "CITY",
    "type" : "string"
  }, {
    "name" : "POPULATION",
    "type" : "int"
  } ]
}

In real life the need to support headers (e.g. with version-like fields) and/or footers (e.g. with checksum-like fields) of some kind is also likely to arise. In contrast to actual data tuples, header and footer formats are likely to be the same in all streams.There is one obvious way to implement it with the Avro union type. We declare the output schema for our records to be a union of three possible schemas. The ones for header and footer will have only one record each.

final String HEADER_FIELD1 = "VERSION";
        final String FOOTER_FIELD1 = "CHKSUM";
        final String HEADER_SCHEMA = "HEADER";
        final String BODY_SCHEMA = "DATA";
        final String FOOTER_SCHEMA = "FOOTER";
        final int version = 2010;
        final long total = 2;

        final Schema schema = Schema.createUnion(
                Lists.<Schema>newArrayList(
                        new AvroSchemaBuilder(1).int32(HEADER_FIELD1).build(HEADER_SCHEMA),
                        new AvroSchemaBuilder(2).string(COLUMN1).int32(COLUMN2).build(BODY_SCHEMA),
                        new AvroSchemaBuilder(1).int64(FOOTER_FIELD1).build(FOOTER_SCHEMA)));

        final ByteArrayOutputStream out = new ByteArrayOutputStream();
        final DataFileWriter<GenericRecord> writer = new DataFileWriter<GenericRecord>(new GenericDatumWriter<GenericRecord>(schema)).create(schema, out);

        final AvroRecordBuilder headerBuilder = new AvroRecordBuilder(schema.getTypes().get(0));
        writer.append(headerBuilder.field(version).build());

        final AvroRecordBuilder bodyBuilder = new AvroRecordBuilder(schema.getTypes().get(1));
        writer.append(bodyBuilder.field(CITY1).field(POPULATION1).build());
        writer.append(bodyBuilder.field(CITY2).field(POPULATION2).build());

        final AvroRecordBuilder footerBuilder = new AvroRecordBuilder(schema.getTypes().get(2));
        writer.append(footerBuilder.field(total).build());

        writer.close();

To read it back we again assume we know what is in the input stream. I a real application one would need to compare schema names against a well-known/hard-coded list of alternatives (in this test, HEADER, DATA and FOOTER) to understand how to interpret a record. Actually, it would be a good idea to check in production code that the first record is always of type HEADER and the last of type FOOTER.

final DataFileStream<GenericRecord> reader = new DataFileStream<GenericRecord>(new ByteArrayInputStream(out.toByteArray()), new GenericDatumReader<GenericRecord>());
        GenericRecord deserialized = null;

        assertTrue(reader.hasNext());
        deserialized = reader.next(deserialized);
        assertEquals(HEADER_SCHEMA, deserialized.getSchema().getName());
        assertEquals(version, deserialized.get(HEADER_FIELD1));

        assertTrue(reader.hasNext());
        deserialized = reader.next(deserialized);
        assertEquals(BODY_SCHEMA, deserialized.getSchema().getName());
        assertEquals(new Utf8(CITY1), deserialized.get(COLUMN1));
        assertEquals(POPULATION1, deserialized.get(COLUMN2));

        assertTrue(reader.hasNext());
        deserialized = reader.next(deserialized);
        assertEquals(BODY_SCHEMA, deserialized.getSchema().getName());
        assertEquals(new Utf8(CITY2), deserialized.get(COLUMN1));
        assertEquals(POPULATION2, deserialized.get(COLUMN2));

        assertTrue(reader.hasNext());
        deserialized = reader.next(deserialized);
        assertEquals(FOOTER_SCHEMA, deserialized.getSchema().getName());
        assertEquals(total, deserialized.get(FOOTER_FIELD1));

        assertFalse(reader.hasNext());

In this case the generated schema looks slightly more interesting:

[ {
  "type" : "record",
  "name" : "HEADER",
  "fields" : [ {
    "name" : "VERSION",
    "type" : "int"
  } ]
}, {
  "type" : "record",
  "name" : "DATA",
  "fields" : [ {
    "name" : "CITY",
    "type" : "string"
  }, {
    "name" : "POPULATION",
    "type" : "int"
  } ]
}, {
  "type" : "record",
  "name" : "FOOTER",
  "fields" : [ {
    "name" : "CHKSUM",
    "type" : "long"
  } ]
} ]

Sep 3, 2010

Lies, damned lies, and bare metal

A former colleague wrote about something I occasionally think about. In my view the post touches on at least three areas that could result in a holy war ("The need for bare metal programming skills", "The need for strong concurrent programming skills", "The right approach to concurrent programming"). What fascinates me is that I am not aware of a coherent narrative for at least some of them.

To begin with, the notion of bare metal is a blurred and moving target. From Cliff Click's detailed discussions to Joshua Bloch'shigher-level recommendations there seems to be a wide-spread belief that contemporary hardware is virtually incomprehensible by most programmers. There are too many moving parts and indirection levels to have a mental model good enough for reliable predictions. And it's not just about high-level Java-like platforms.

As history of computing demonstrates, the hardware progress enables programming in terms of higher-level abstractions. And that pushes the border with bare metal higher with time. It's true for programming languages (consider how VM-based or interpreted/dynamic languages became practical in the last ten years) and for particular sub-fields such as concurrent programming (think about a recent surge of interest in Actors and STM).

Concurrent programming is a similarly gray area and also because of strong influence of hardware. On the one hand, shared memory and actors are just models and can be used to solve the same problems. On the other, software runs on real hardware and commodity hardware nowadays means you start from XCHG instruction and layer levels of abstractions on it. Be it Clojure or Scala, you are looking at java.util.concurrent (and, ultimately the very same CPU-level support of CAS) in disguise.

So with time some indirection levels become so low-level that only very few people have time/need/desire to look at. Today java.util.concurrent is the corner-stone and very few people bother even to look inside (for example, compare the number of people intimately familiar with JCiP with those fluent in AoMP). In a few years it might as well be Actors, if not STM/HTM. And then knowledge of j.u.c will count as low-level black magic. It's not easy to see when such a transition happens and adjust one's definition of "bare" or "low-level".

In addition, there are entire domains such as big data. People in the map-reduce land think in terms of [dozens of] servers, not threads. Actually, even in mainstream software not many are lucky to work with j.u.c-level abstractions and some even explicitly prefer even higher-level ones.

But in general, my experience confirms that serious usage of any technology implies comprehensive understanding of its design and some implementation details. As an example, you do not need to know about the GC to program in Java but if you actually do not you probably work on something trivial.

For complicated technologies it can be hard and take time and so one is necessarily limited and cannot be familiar with everything even in a particular language universe (just think about Java - from J2ME to Hadoop "and still counting"). At least in this day and age, most software comes with source code and so you can always dig deeper to learn the details.

Jun 15, 2010

Summer reading on actors

For a popular topic, I do not see enough design discussions around Actors online. It could be difficult to learn about idioms or pitfalls without going deep into OTP or Akka. Off the top of my head I can think of only two interesting blogs:
  • James Iry wrote a fascinating series on why Actors are not a silver bullet for all things concurrent. He is such a polished author you should read his posts even if you don't care much about Scala/actors.
  • Somebody started recently a very informative blog with detailed examples of actor-based FP-style designs.

May 12, 2010

Object block initializor


Object block initializor in Java is a code block in parenthesisexecuted before the class constructor. I always thought it was a bad idea because by default I expect all initialization code in a constructor. In addition, I have never seen such initializors used anywhere except for Java textbooks and interview questions.

Yesterday I for the first time read about a potentially useful way to utilize them. The idea is to save some typing (compare creation of b1 and b2):


May 5, 2010

Thunk

I first learned about this approach to lazy initialization a few months ago but every time I look at its Java implementation I think it's a really cute piece of code. I wonder why my book on Singletons did not mentions it :)

Feb 15, 2010

Sizing up Java employers

While trying to understand what kind of company I would like to work for next I made another observation on employers in the Java land. At first glance they fall into three categories:
  • Typical Java shops doing something of little interest from the engineering point of view. Their job descriptions usually mention Spring and Hibernate. Frequently they also ask for either JEE or presentation tier technologies a-la JSP/Struts (IMO the latter is a particularly strong indicator you do not want to work for them).
  • Companies working on something more challenging or in more exciting domains. For me a good indicator is that they ask for more advanced/less known technologies such as Lucene, Hadoop or one of the numerous DHT implementations.
  • One-of-a-kind companies, GOOG-style. They do not list any particular technologies except for Java itself. They are explicitly after what they refer to as "bright people".

Jan 6, 2010

Meaningful approach to interviewing?

My current company is still looking for a Sr. Java developer in SF and I am usually one of the 1st round interviewers. Naturally, it makes me ponder on the optimal approach to interviewing. Recently I started noticing that on more and more topics my real answer is "it depends". It also frequently requires we agree on some conceptual model. I do not believe in the universal best answer to an important question but I also recognize that in real life any progress is possible only when one actually chooses and pursues one of the possible answers.

Interviewing is not any different. In my life I have witnessed at least five distinctly different interview types:
  • No real questions - you chat with a few members of the team but they do not test you. You tell them about your background, ask about their system and answer a couple trivial technical questions. Strangely enough, I worked almost five years for two companies that hired me in this style. Although I heard some people think it's a red flag (as in "Do you really think they will have sufficiently bright people with such poor screening?") one of those two was the best company in my eleven years. And yes, it happens in the US too.
  • Outsourcing-style - an endless stream of extremely low-level technical questions on particular technologies and APIs. RDBMS isolation levels, the servlet lifecycle, the methods on EJB home/remote interfaces, JSP scopes. A design pattern for them is always one from the GOF book or Sun J2EE patterns guidelines. Characteristically, they do not care about your design skills or code quality and really like J2EE-style.
  • CS undergraduate-style - in my opinion a questions such as "implement a RW-lock on the whiteboard" is a good example. Surely enough, real-life locks have nothing to do with whatever you will come up with in 15 minutes and, unless they prepared specifically, quite a few people will probably not be able to remember enough textbook material.
  • Common sense-style - reverse a list. Suggest a design for logging system. Implement a Map with additional non-standard requirements. You can answer them on the basis of real-life experience or with some thinking of the kind you do day-by-day at work.
  • GOOG-style (actually, I read it was MSFT who originated it) - CS undergraduate on steroids with a puzzle or two to load. Personally, I was not happy during some of those interviews, especially the ones I knew I was blowing badly :) Incidentally, I also learned a lot from them. But when I have only an hour with a candidate I am not quite sure which approach to follow.
From what I see, a typical Java-centered company is either a J2EE enterprise or a place where they care about util.concurrent. Amusingly, this seems to be a very clear dichotomy. In both cases, there is a pile of [de facto] standard APIs and frameworks which take considerable time to master. And people who gravitate to one of the two types tend to be rather apprehensive if not ignorant of the other. I am also inclined to think that very little of even basic algoritms&data structures material is used in those companies. At least not deeper than TreeMap v HashMap and such. So asking more theoretical questions seems almost unfair (and not justified by actual day-by-day development needs).

As an example, today I interviewed another "Sr. Java" engineer. He did not know exactly what the volatile keyword meant (wtf? after all it's a keyword and not some obscure optional library). He was immediately confused with "try { foo(); return 1; } catch (Exception e) { return 2; } finally { return 3; }". And so on. He stated Hibernate experience in his resume but he actually meant "experience in calling Hibernate via JPA". Obviously he did not know about caching capabilities of Hibernate or mapping of OO models on relational schemas although it takes one book and a weekend to learn about. Frankly, I cannot imagine how a senior-level engineer could be so ignorant. After all, you learn by osmosis at work and glean questions from interviews throughout your career.

But what's the point of discussing, say, software architecture with people who fail simple questions relevant to daily coding? In a language universally regarded as outdated and extremely simple? Someone knowledgeable of actual CS must be pretty comfortable with basic hashCode/equals/HashMap stuff. Could I be gratuitously harsh on candidates because of some misplaced self-righteousness? Or I am right in that people who are too lazy to read half a dozen books before an interview do not deserve much? Or is my current company so unappealing that only losers apply?