streamparse: Python + Apache Storm for real-time stream processing

Parse.ly released streamparse today, which lets you run Python code against real-time streams of data by integrating with Apache Storm.

We released it for our talk, “Real-time streams & logs with Apache Kafka and Storm” at PyData Silicon Valley 2014.

An initial release (0.0.5) was made. It includes a command-line tool, sparse, with the ability to set up and run local Storm-friendly Python projects.

If you run sparse quickstart, it will quick-start a local Storm + Python project using the streamparse framework, using a project template. The basic example will implement a simple word count against a stream of words. Going into that directory and doing sparse run will actually spin up a local Apache Storm cluster and execute your topology of Python code against the local cluster.

In short: it’s never been easier to develop with Storm and Python, thanks to streamparse. In the coming weeks and months, we plan to bundle a lot more functionality which will make it easier and easier to use Python’s excellent data analysis stack atop real-time streams of data using Storm.

How it works

Under the hood, streamparse is a new implementation of Storm’s multi-lang protocol for Python. For doing local running of Storm topologies, it leverages the lein build tool, which is the library’s only local requirement. This is used to resolve dependencies to Storm itself.

A small command-line tool (which happens to be written in Clojure) is bundled with streamparse. This tool handles 100% of the Java interop for you, as well as compiling and validating your topology definitions. This leverage Storm’s extremely handy Clojure DSL. This gets rid of all the “rough edges” of the fact that Storm is a JVM-based technology, while also allowing you to mix Java, Clojure, Python, Ruby, or any other language that supports the multi-lang protocol in a single Storm topology.

Conveniently, the sparse tool will not only download the full Storm framework locally, but it will also let you spin up clusters with 2, or 10, or 100 parallel processes in complex data flow topologies within seconds. It will let you debug these topologies locally using slices of your real-time data. And, it will package your topologies as an uberjar for submission to a production Storm cluster — which could be running across 10’s or 100’s of machines — without you lifting a finger or learning anything about Java or Clojure.

All of streamparse’s extensions will leverage fabric, invoke, and virtualenv to manage remote Storm worker machines and synchronize Python dependencies. Configuration of local, beta, and production environments is handled with a simple config.json file.

Join us!

Anyone who is actually working with Storm in production or planning to work with Storm and Python should check out the Github page, Google Group, or get touch with us on Twitter!

streamparse is currently being developed by Mike Sukmanowsky (@msukmanowsky), Keith Bourgoin (@kbourgoin), and me (@amontalenti), though we are looking for other contributors. Also, if you’re interested in more information on Parse.ly’s contributions to open source, our presentations at conferences, and what it’s like to work on our team, check out the Parse.ly Code & Tech page.

streamparse: Python + Apache Storm for real-time stream processing

One thought on “streamparse: Python + Apache Storm for real-time stream processing”

Leave a Reply