Apache Storm, Kafka, and Spark are gaining a lot of momentum in the data analysis and processing communities. I was curious whether the interest in using these technologies with Python, in particular, is growing. Based on these Google Trends reports, it seems like it is.
I used a query for “python pandas” to anchor expectations, since this is by far the most popular single-core library in the Python community for data analysis. It looks like there is as much momentum for these JVM-based Apache projects as there is for Python’s Pandas library. Based on the trend, it looks like Apache Spark may also become as popular a topic of inquiry as Pandas currently is.
If you unpack this a little, you also see that there is also interest in using Storm, Kafka, and Spark together with Python:
Yet, the tooling for all of these technologies with Python is somewhat poor. Seems like an opportunity — and one that my team is working on.
See streamparse, samsa, and pyspark-cassandra, for example.
A bit late perhaps … but take a look at https://github.com/TargetHolding/pyspark-cassandra/. Started out as a fork of the pyspark-cassandra project by Parsely and took it further.
Cheers,
Frens