Analytics for Journalism

Andrew Montalenti, CTO

What is

Analytics for digital storytellers.

_images/banner_01.png _images/banner_02.png _images/banner_03.png _images/banner_04.png

Online content ecosystem

Marginal cost of web content: zero.

What is online journalism?

Being redefined, but not without trouble.

My definition

A blend of these definitions:


What makes journalism "good"?

"Good journalism" is that which, it would be beneficial to society that its content were widely dissemenated.

Snowden / Greenwald / Guardian is a recent example.

A whole lot of "bad journalism" subsidizes the good stuff.


Onward to the tech


Why does journalism need analytics?

Websites have a variety of interesting "first-party" metrics:

E-commerce & ads drove web analytics industry.

Is online journalism special?



Content metadata is rich

Field Description
title Post or page title (article headline)
link Canonical URL for post/page
image_url URL for associated image
type post, frontpage, sectionpage
media_type article, slideshow, video
pub_date Publication date
section Section of the site (e.g. Politics)
author Author who created the post
tags List of editorially-provided tags
topics List of machine-generated topics
genres List of machine-generated genres

Third-party metrics emerging


Time series data

_images/sparklines_multiple.png _images/sparklines_stacked.png

Summary data


Benchmark data


Information radiators


Demo time

_images/dash.png tech stack is a Python & Javascript shop.

(Some Java used begrudgingly and as necessary.)


Data centers (1)

Long evolution to get our current state:

Through all of this, heavy user of Amazon ELB and S3 for data collection and archiving, and EMR for Hadoop jobs.

Data centers (2)

In early 2014, Amazon launched their i2 instance types:

Instance RAM SSD Cores
i2.8xlarge 244 GB 6.4 TB 32
i2.4xlarge 122 GB 3.2 TB 16
i2.2xlarge 61 GB 1.6 TB 8



Stack Overview


Backend Stack, v1 (2010-2011)

Tool Usage
nginx data collection
Amazon S3 raw logs for offline analysis
MongoDB pre-aggregated data
feedparser RSS/Atom feed parsing
Celery distributed task queue

Backend Stack, v2 (2011-2012)

Tool Usage
Cloud LBs* data collection without SPOF
node.js* fast, dynamic Javascript config
Amazon S3 raw logs for offline analysis
MongoDB* sharded pre-aggregated data
Redis* real-time data; past 24h, minutely
Scrapy* maintainable web crawling
Celery distributed task queue
ZeroMQ* lightweight service communication
Hadoop* compute-intensive offline analysis
Solr* rich content indexing

Backend Stack, v3 (2012-2013)

Tool Usage
Cloud LBs data collection without SPOF
node.js fast, dynamic Javascript configuration
Amazon S3 raw logs for offline analysis
MongoDB sharded, replicated aggregate data
Redis real-time data; past 24h, minutely
Scrapy maintainable web crawling
Storm* elastic distributed task queue
Kafka* fast, reliable service communication
hll* memory-stable estimated cardinality
Pig* readable offline analysis scripts
SolrCloud* scalable content indexing, trends

Frontend Stack, v1 (2010-2012)

Tool Usage
Django web app framework
jQuery Javascript utilities
Protoviz.js data visualization framework

Frontend Stack, v2 (2012-2013)

Tool Usage
Django web app framework
jQuery Javascript utilities
Bootstrap* responsive Javascript/CSS layouts
Pandas* in-memory data manipulation
LESS* modular CSS styling
d3.js* customizable dataviz framework
rq* asynchronous reporting
Tornado* high-performance, REST/JSON API

Other important infrastructure

Tool Usage
Graphite internal service statistics
Munin system health and heartbeat metrics
Sentry plant-wide exception catching
logstash plant-wide logging
Chef server configuration management
vagrant local VM-based development
Fabric scriptable SSH sessions

Things we learned

Steamroller, meet road

2014 areas of interest

Crazy data ideas

Where are we going with this?





Using GEL for visitor targeting

Segment Target with...
Growth Ads, e-mail newsletters, follows
Engagement Premium ads, sponsored content
Loyalty Subscriptions, ebooks, paid content

Using GEL for content strategy

Maturity Invest in...
Growth Short-form, shareable, unique
Engagement Typical, emotional, convenient
Loyalty Long-form, insightful, indispensible

API Engagement Tools


API Loyalty Tools




Contact Us

Get in touch. We're hiring :)

And me: