Analytics for Journalism

Andrew Montalenti, CTO

What is Parse.ly?

Analytics for digital storytellers.

_images/banner_01.png _images/banner_02.png _images/banner_03.png _images/banner_04.png

Online content ecosystem

Marginal cost of web content: zero.

What is online journalism?

Being redefined, but not without trouble.

My definition

A blend of these definitions:

_images/assange.png

What makes journalism "good"?

"Good journalism" is that which, it would be beneficial to society that its content were widely dissemenated.

Snowden / Greenwald / Guardian is a recent example.

A whole lot of "bad journalism" subsidizes the good stuff.

_images/old_news.png

Onward to the tech

_images/tech_stack.png

Why does journalism need analytics?

Websites have a variety of interesting "first-party" metrics:

E-commerce & ads drove web analytics industry.

Is online journalism special?

Yes.

_images/pulse.png

Content metadata is rich

Field Description
title Post or page title (article headline)
link Canonical URL for post/page
image_url URL for associated image
type post, frontpage, sectionpage
media_type article, slideshow, video
pub_date Publication date
section Section of the site (e.g. Politics)
author Author who created the post
tags List of editorially-provided tags
topics List of machine-generated topics
genres List of machine-generated genres

Third-party metrics emerging

_images/social_icons.png

Time series data

_images/sparklines_multiple.png _images/sparklines_stacked.png

Summary data

_images/summary_viz.png

Benchmark data

_images/benchmarked_viz.png

Information radiators

_images/glimpse.png

Demo time

_images/dash.png

Parse.ly tech stack

Parse.ly is a Python & Javascript shop.

(Some Java used begrudgingly and as necessary.)

_images/monitors.jpg

Data centers (1)

Long evolution to get our current state:

Through all of this, heavy user of Amazon ELB and S3 for data collection and archiving, and EMR for Hadoop jobs.

Data centers (2)

In early 2014, Amazon launched their i2 instance types:

Instance RAM SSD Cores
i2.8xlarge 244 GB 6.4 TB 32
i2.4xlarge 122 GB 3.2 TB 16
i2.2xlarge 61 GB 1.6 TB 8

Scale

_images/pv_growth.png

Stack Overview

_images/oss_logos.png

Backend Stack, v1 (2010-2011)

Tool Usage
nginx data collection
Amazon S3 raw logs for offline analysis
MongoDB pre-aggregated data
feedparser RSS/Atom feed parsing
Celery distributed task queue

Backend Stack, v2 (2011-2012)

Tool Usage
Cloud LBs* data collection without SPOF
node.js* fast, dynamic Javascript config
Amazon S3 raw logs for offline analysis
MongoDB* sharded pre-aggregated data
Redis* real-time data; past 24h, minutely
Scrapy* maintainable web crawling
Celery distributed task queue
ZeroMQ* lightweight service communication
Hadoop* compute-intensive offline analysis
Solr* rich content indexing

Backend Stack, v3 (2012-2013)

Tool Usage
Cloud LBs data collection without SPOF
node.js fast, dynamic Javascript configuration
Amazon S3 raw logs for offline analysis
MongoDB sharded, replicated aggregate data
Redis real-time data; past 24h, minutely
Scrapy maintainable web crawling
Storm* elastic distributed task queue
Kafka* fast, reliable service communication
hll* memory-stable estimated cardinality
Pig* readable offline analysis scripts
SolrCloud* scalable content indexing, trends

Frontend Stack, v1 (2010-2012)

Tool Usage
Django web app framework
jQuery Javascript utilities
Protoviz.js data visualization framework

Frontend Stack, v2 (2012-2013)

Tool Usage
Django web app framework
jQuery Javascript utilities
Bootstrap* responsive Javascript/CSS layouts
Pandas* in-memory data manipulation
LESS* modular CSS styling
d3.js* customizable dataviz framework
rq* asynchronous reporting
Tornado* high-performance, REST/JSON API

Other important infrastructure

Tool Usage
Graphite internal service statistics
Munin system health and heartbeat metrics
Sentry plant-wide exception catching
logstash plant-wide logging
Chef server configuration management
vagrant local VM-based development
Fabric scriptable SSH sessions

Things we learned

Steamroller, meet road

2014 areas of interest

Crazy data ideas

Where are we going with this?

_images/gel_metrics.png

Growth

Engagement

Loyalty

Using GEL for visitor targeting

Segment Target with...
Growth Ads, e-mail newsletters, follows
Engagement Premium ads, sponsored content
Loyalty Subscriptions, ebooks, paid content

Using GEL for content strategy

Maturity Invest in...
Growth Short-form, shareable, unique
Engagement Typical, emotional, convenient
Loyalty Long-form, insightful, indispensible

API Engagement Tools

_images/ars_related_stories.png

API Loyalty Tools

_images/ars_mystories.png

Conclusion

_images/parsely.png

Contact Us

Get in touch. We're hiring :)

And me: