========================
Analytics for Journalism
========================
Andrew Montalenti, CTO
.. rst-class:: logo
.. image:: ./_static/parsely.png
:width: 40%
:align: right
What is Parse.ly?
=================
Analytics for digital storytellers.
.. image:: ./_static/banner_01.png
:align: center
.. image:: ./_static/banner_02.png
:align: center
.. image:: ./_static/banner_03.png
:align: center
.. image:: ./_static/banner_04.png
:align: center
Online content ecosystem
========================
Marginal cost of web content: **zero**.
.. rst-class:: build
* | **Print-era content**: Monopolist one-way megaphones.
| Newspapers, TV, radio.
* | **Web-era content**: Distributed n-way channels.
| Media websites, blogs, social networks.
* | **Content deluge**: Commoditized the writer.
| Enthroned the editor.
* | **Editors vs Algorithms**: Google News and Reddit.
| Prismatic and Flipboard.
What is online journalism?
==========================
Being redefined, but not without trouble.
.. rst-class:: build
* | **Legacy definition**: "Content by professionals."
| Is BuzzFeed journalism?
* | **Romantic definition**: "Content in the public interest."
| Is Wikipedia journalism?
* | **Market definition**: "Content worth my money."
| Is Hulu journalism?
My definition
=============
A blend of these definitions:
* Content that informs, inspires, or outrages me.
* Content worth my time and money.
* Content worth my shares and links.
.. image:: ./_static/assange.png
:width: 40%
:align: center
What makes journalism "good"?
=============================
"Good journalism" is that which, it would be beneficial to society that its content were widely dissemenated.
Snowden / Greenwald / Guardian is a recent example.
A whole lot of "bad journalism" subsidizes the good stuff.
.. rst-class:: spaced
.. image:: ./_static/old_news.png
:width: 50%
:align: center
Onward to the tech
==================
.. rst-class:: spaced
.. image:: /_static/tech_stack.png
:width: 90%
:align: center
Why does journalism need analytics?
===================================
Websites have a variety of interesting "first-party" metrics:
* pageviews
* unique visitors
* sessions and paths
* time spent
* page engagement (scroll, copy/paste)
* referrers
* search keywords
E-commerce & ads drove web analytics industry.
Is online journalism special?
=============================
Yes.
* **Short Shelf Life**: average content shelf-life <48 hours
* **High Frequency Publishing**: 1000's posts per day
* **Unclear Conversion Goals**: nothing to buy
.. image:: ./_static/pulse.png
:width: 60%
:align: center
Content metadata is rich
========================
=========== ===================================================
Field Description
=========== ===================================================
title Post or page title (article headline)
link Canonical URL for post/page
image_url URL for associated image
type ``post, frontpage, sectionpage``
media_type ``article, slideshow, video``
pub_date Publication date
section Section of the site (e.g. Politics)
author Author who created the post
tags List of editorially-provided tags
topics List of machine-generated topics
genres List of machine-generated genres
=========== ===================================================
Third-party metrics emerging
============================
* **Comments**: Disqus, LiveFyre, Wordpress
* **Shares**: Twitter, Google+, LinkedIn, Facebook
* **Pins and Saves**: Pinterest, Delicious
* **Upvotes and Likes**: Reddit, Digg
* **Queues**: Instapaper, Readability
.. image:: ./_static/social_icons.png
:width: 60%
:align: center
Time series data
================
.. image:: ./_static/sparklines_multiple.png
:align: center
.. image:: ./_static/sparklines_stacked.png
:align: center
Summary data
============
.. rst-class:: spaced
.. image:: ./_static/summary_viz.png
:align: center
Benchmark data
==============
.. rst-class:: spaced
.. image:: ./_static/benchmarked_viz.png
:align: center
Information radiators
=====================
.. rst-class:: spaced
.. image:: ./_static/glimpse.png
:width: 100%
:align: center
Demo time
=========
.. image:: ./_static/dash.png
:width: 70%
:align: center
Parse.ly tech stack
===================
Parse.ly is a **Python** & **Javascript** shop.
(Some Java used begrudgingly and as necessary.)
.. rst-class:: spaced
.. image:: ./_static/monitors.jpg
:width: 90%
:align: center
Data centers (1)
================
Long evolution to get our current state:
* **Small Beginnings**: 2009, a 1U server I snuck into my friend's cage.
* **Scaling Up**: 2010-2013, from 3 to 80 nodes running in Rackspace Cloud.
* **Bleeding Edge**: 2012-2013, second custom data center with 1 terabyte of RAM across 5 machines.
Through all of this, heavy user of Amazon ELB and S3 for data collection
and archiving, and EMR for Hadoop jobs.
Data centers (2)
================
In early 2014, Amazon launched their i2 instance types:
=========== ======== ======== =========
Instance RAM SSD Cores
=========== ======== ======== =========
i2.8xlarge 244 GB 6.4 TB 32
i2.4xlarge 122 GB 3.2 TB 16
i2.2xlarge 61 GB 1.6 TB 8
=========== ======== ======== =========
* $20/GB of RAM per month on-demand
* 1/2 the price of Rackspace Cloud
* Only 3X the fully-baked price of running your own colo
* Big memory, performant CPU, and fast I/O: all three!
* The golden age of analytics.
Scale
=====
* **8 billion pageviews per month** in Jan 2014
* Typical **>3,000 requests per second** daily peak
* Nearly **30 terabytes** of raw compressed data
.. rst-class:: spaced
.. image:: ./_static/pv_growth.png
:width: 90%
:align: center
Stack Overview
==============
.. rst-class:: spaced
.. image:: ./_static/oss_logos.png
:width: 90%
:align: center
Backend Stack, v1 (2010-2011)
=============================
============= =======================================
Tool Usage
============= =======================================
nginx data collection
Amazon S3 raw logs for offline analysis
MongoDB pre-aggregated data
feedparser RSS/Atom feed parsing
Celery distributed task queue
============= =======================================
Backend Stack, v2 (2011-2012)
=============================
============= =======================================
Tool Usage
============= =======================================
Cloud LBs\* data collection **without SPOF**
node.js\* **fast, dynamic** Javascript config
Amazon S3 raw logs for offline analysis
MongoDB\* **sharded** pre-aggregated data
Redis\* **real-time** data; past 24h, minutely
Scrapy\* **maintainable** web crawling
Celery distributed task queue
ZeroMQ\* **lightweight** service communication
Hadoop\* **compute-intensive** offline analysis
Solr\* **rich** content indexing
============= =======================================
Backend Stack, v3 (2012-2013)
=============================
============= ==========================================
Tool Usage
============= ==========================================
Cloud LBs data collection without SPOF
node.js fast, dynamic Javascript configuration
Amazon S3 raw logs for offline analysis
MongoDB sharded, replicated aggregate data
Redis real-time data; past 24h, minutely
Scrapy maintainable web crawling
Storm\* **elastic** distributed task queue
Kafka\* **fast, reliable** service communication
hll\* **memory-stable** estimated cardinality
Pig\* **readable** offline analysis scripts
SolrCloud\* **scalable** content indexing, trends
============= ==========================================
Frontend Stack, v1 (2010-2012)
==============================
============= ==========================================
Tool Usage
============= ==========================================
Django web app framework
jQuery Javascript utilities
Protoviz.js data visualization framework
============= ==========================================
Frontend Stack, v2 (2012-2013)
==============================
============= ==========================================
Tool Usage
============= ==========================================
Django web app framework
jQuery Javascript utilities
Bootstrap\* **responsive** Javascript/CSS layouts
Pandas\* **in-memory** data manipulation
LESS\* **modular** CSS styling
d3.js\* **customizable** dataviz framework
rq\* **asynchronous** reporting
Tornado\* **high-performance**, REST/JSON API
============= ==========================================
Other important infrastructure
==============================
============= ==========================================
Tool Usage
============= ==========================================
Graphite internal service statistics
Munin system health and heartbeat metrics
Sentry plant-wide exception catching
logstash plant-wide logging
Chef server configuration management
vagrant local VM-based development
Fabric scriptable SSH sessions
============= ==========================================
Things we learned
=================
.. rst-class:: build
* Web data is messy.
* Aggregates over time series data is a hard problem.
* Must "become one" with your raw data.
* Server automation (via e.g. Chef) is crucial.
* Batch vs Stream path: an important pattern.
* High-memory cloud servers have changed the game.
Steamroller, meet road
======================
.. rst-class:: build
* Sunk cost thinking is dangerous: full steam ahead.
* What we built one year ago: "how did we do this!?"
* Looking at it today: some project's "Hello, World!"
* Analytics space is loaded with tough problems.
* Some are being solved by open source, others not.
* Focus is **essential**.
2014 areas of interest
======================
* **Text mining**: Wikidata, content clustering.
* **Hourly storage**: MongoDB, schema redesigns.
* **Milestone Alerting**: Statistical approaches.
* **Content optimization**: Solr, Function Queries.
* **Visitor analysis**: Cassandra, wide-row storage.
* **Network trends**: More work with Apache Pig.
Crazy data ideas
================
.. rst-class:: build
* Storm-DRPC for live aggregates?
* Real-time Map/Reduce?
* Clojure and Python in harmony?
* ElasticSearch/Solr for time series data?
Where are we going with this?
=============================
.. rst-class:: spaced
.. image:: ./_static/gel_metrics.png
:width: 90%
:align: center
Growth
======
* New Monthly Visitors
* New Linking Domains
* New Shares
* Web-Wide Trends
Engagement
==========
* Avg Time Spent
* Avg Posts per Visit
* Shares Per Post
Loyalty
=======
* Monthly Repeat Visitors
* Monthly Homepage Visitors
* Visits Per Month
* Percent with Multiple Daily Visits
Using GEL for visitor targeting
===============================
============== ======================================
Segment Target with...
============== ======================================
**Growth** Ads, e-mail newsletters, follows
**Engagement** Premium ads, sponsored content
**Loyalty** Subscriptions, ebooks, paid content
============== ======================================
Using GEL for content strategy
==============================
============== ======================================
Maturity Invest in...
============== ======================================
**Growth** Short-form, shareable, unique
**Engagement** Typical, emotional, convenient
**Loyalty** Long-form, insightful, indispensible
============== ======================================
API Engagement Tools
====================
.. rst-class:: spaced
.. image:: ./_static/ars_related_stories.png
:align: center
:width: 80%
API Loyalty Tools
=================
.. rst-class:: spaced
.. image:: ./_static/ars_mystories.png
:align: center
Conclusion
==========
.. rst-class:: spaced
.. image:: ./_static/parsely.png
:width: 90%
:align: center
.. rst-class:: build
* We believe **great stories deserve a big audience**.
* We've built an **analytics platform for digital storytellers**.
* "Big Data": it actually applies here.
* "Big Payoff": **help journalism thrive in the digital age**.
Contact Us
==========
Get in touch. We're hiring :)
* http://parse.ly
* http://twitter.com/parsely
And me:
* http://pixelmonkey.org
* http://twitter.com/amontalenti
.. ifnotslides::
.. raw:: html
.. ifslides::
.. raw:: html