Simple Lego Blocks for Big Data

Data engineers should abstract their code in the most lightweight way possible to facilitate downstream integration in a large-scale data system.

You want lego blocks, not puzzle pieces.

The creators of the C programming language once famously said, “first make it work, then make it right, and, finally, make it fast.” This adage still applies today.

The difference is, we have tools to take working code and validate that it is right against reams of data. Many of these tools can also be used to make the working, right code run really fast across a cluster of machines, possibly even in real-time, as the data comes in.

But, making code work, then right, then fast, requires some discipline.

I’d suggest that to use “code execution infrastructure” modern projects have available — e.g. Processes/Threads, Hadoop, Spark, Storm — you need to intentionally decouple your code from that execution infrastructure directly.

Lambda Architecture, in a nutshell

“Lambda Architecture” is a fanciful term coined recently to describe a simple idea. All derived calculations in a large data system can be expressed as a re-computation function over all of your data.

This is the way you should think about your work. Have a clever text extraction algorithm? You want to express it as a re-computation function over individual raw HTML documents (strings). Have a clever image extraction algorithm? You want to express it as a re-computation function over the sequence of individual raw images (bytes).

Coupling

Though it’d be nice if your re-computation functions were machine-efficient, that’s not nearly as important as making sure they are decoupled from other systems. We can solve the efficiency problem with hardware (which is cheap), but the only way to solve the decoupling problem is with programmer time (which is expensive).

So, your functions should not include code about:

databases
data file formats
caching
threads
processes
clusters

Functions

If you write code in Python, ideally, you will expose your code as a simple, plain function. (See also: “Stop Writing Classes”.) Dependencies on nothing but the standard library, ideally! But sure, you could also use the odd Python library, like NumPy. Definitely no dependencies on Spark, Hadoop, Storm, or other “execution infrastructure”.

If you are writing code in Java (or Scala, or Clojure), aim for a simple, static function. Always prefer functions, despite Java’s object-orientation. Java 8 has finally recognized the value of functions, so don’t think you’re going against the language designers any longer by using them.

Simple inputs

Functions should operate on simple inputs and outputs, ideally all of them JSON-serializable. Though not every execution framework uses JSON, the idea of JSON serializability is a good rule of thumb or lowest common denominator. In other words, think data values (maps, lists, integers, strings), not data structures or object instances. The more complex your data is to serialize and deserialize, the more difficult it will be for your data algorithm to parallelize.

If your functions require small static data files, e.g. for learning models you produced ahead of time, then you should consider exposing your lego block as very simple classes, whose constructor initializes and loads relevant data structures, and that exposes a single function to execute the computation itself. But do this sparingly. In other words, even when using classes, keep it simple!

Functional data pipelines

Some algorithms will be more like a data pipeline. In these cases, it should be possible to construct the entire pipeline in memory using chained function calls. If you have a multi-stage analytics processing pipeline where the input is a sequence of raw events and the output is a JSON record format ready to be inserted into a data store, you could have a call sequence that is something like this in Python:

persist(in_5min_windows(with_metadata(by_referrer(raw_events))))

Or, in a language like Clojure, this could be expressed like a pipeline over the input:

(-> raw_events by_referrer with_metadata in_5min_windows persist)

This looks a lot like a UNIX pipeline:

cat raw_events | by_referrer | with_metadata | in_5min_windows | persist

If you’re thinking simple functions and simple data pipelines, you’re thinking in the right direction. This is possible in every language, and will facilitate integration.

If you have your core logic defined in this basic way, then we can map those functions onto various execution paradigms.

Thinking about Python — to run them in Pig, we can create a streaming UDF (user-defined function). For Spark, we can use PySpark’s built-in interoperation facilities. For Storm, we can use streamparse, and embed your functions inside Bolt classes.

Keeping things simple will guarantee we can run your logic across all historical data, as well as running your function against the live stream of data in the real production system. It will also make your life easier.

Rest and motion

In data processing circles, we draw a distinction between data-at-rest and data-in-motion. The best way to think about this notion is to ask the question, “Does my function operate against a data file, or a data stream?”

Most well-engineered data streams are automatically archived as data files so that ad-hoc analysis can happen after the fact. Stream processing performance is usually about per-tuple latency, whereas file processing performance is usually about total re-computation runtime. Well-designed lego blocks will be able to operate in both of these modes, real-time and batch.

Batch analysis of data-at-rest

Once you have your functions that do interesting algorithmic work, you may want to use them out against large data sets that have been saved and cached offline in a data store like S3. If this is the case, then you’ll want to use one of these options:

single-node: simple parallel processing that happens on your machine, using several machine cores; in Python this can happen with a library like joblib
multi-node: more complex cluster analysis, using many cores across many machines; in Python, a good choice is a framework like pyspark

Key point: batch analysis should be viewed as code running over all our data. It should not make the single-machine assumption: with too much data and too little time, you’re gonna need a bigger boat.

Keep the lambdas nice and clean, and you’ll be able to run in parallel across nodes. Ideally, the output of your analysis will be a simple data structure or static file that can be saved just as easily as it was computed.

Streaming analysis of data-in-motion

For analyzing data in motion and to deploy the live, online analytical system, the best choices today are Apache Kafka and Storm.

New data comes in through Kafka. It gets processed by a Storm cluster via Spout classes that subscribe to those Kafka topics. Storm topologies have Bolt instances laid out in a way to execute the algorithms. Some of these Bolts are “terminal bolts”, that end a live computation graph. They will output their results in Kafka or distributed data store.

Storm communicates via a lightweight JSON protocol. We can make Kafka communicate likewise in this lightweight way. We can mix components in one or more Storm topologies to lay out data processing pipelines that fit our needs for the running production system. Again, we will keep it simple!

It’s not the execution stack that matters

What matters is your code. The execution stack is about translating the various processing stages of your code into a Directed Acyclic Graph (DAG) or computation. This should be a straightforward exercise, like assembling lego pieces together.

However complex you make your core code, you will suffer a pain ten times worse when you try to make it embarrassingly parallel.

Don’t embarrass yourself: in large-scale data processing problems, simple parts are not only about code simplicity. It’s also about solution viability. To have a chance to assemble your data castle, you’ll need to start with building blocks that actually fit together.