Apache Storm, Kafka, and Spark are gaining a lot of momentum in the data analysis and processing communities. I was curious whether the interest in using these technologies with Python, in particular, is growing. Based on these Google Trends reports, it seems like it is.
I used a query for “python pandas” to anchor expectations, since this is by far the most popular single-core library in the Python community for data analysis. It looks like there is as much momentum for these JVM-based Apache projects as there is for Python’s Pandas library. Based on the trend, it looks like Apache Spark may also become as popular a topic of inquiry as Pandas currently is.
If you dig into this a little, you will see that there is also interest in using Storm, Kafka, and Spark together with Python. Yet, the tooling for all of these technologies with Python is somewhat poor. Seems like an opportunity — and one that my team is working on.
See streamparse, pykafka, and pyspark-cassandra, for example.
A bit late perhaps … but take a look at https://github.com/TargetHolding/pyspark-cassandra/. Started out as a fork of the pyspark-cassandra project by Parsely and took it further.
Cheers,
Frens