Analytics for Journalism
Andrew Montalenti, CTO
Andrew Montalenti, CTO
Analytics for digital storytellers.
Marginal cost of web content: zero.
Being redefined, but not without trouble.
A blend of these definitions:
"Good journalism" is that which, it would be beneficial to society that its content were widely dissemenated.
Snowden / Greenwald / Guardian is a recent example.
A whole lot of "bad journalism" subsidizes the good stuff.
Websites have a variety of interesting "first-party" metrics:
E-commerce & ads drove web analytics industry.
Yes.
Field Description title Post or page title (article headline) link Canonical URL for post/page image_url URL for associated image type post, frontpage, sectionpage media_type article, slideshow, video pub_date Publication date section Section of the site (e.g. Politics) author Author who created the post tags List of editorially-provided tags topics List of machine-generated topics genres List of machine-generated genres
Parse.ly is a Python & Javascript shop.
(Some Java used begrudgingly and as necessary.)
Long evolution to get our current state:
Through all of this, heavy user of Amazon ELB and S3 for data collection and archiving, and EMR for Hadoop jobs.
In early 2014, Amazon launched their i2 instance types:
Instance RAM SSD Cores i2.8xlarge 244 GB 6.4 TB 32 i2.4xlarge 122 GB 3.2 TB 16 i2.2xlarge 61 GB 1.6 TB 8
Tool Usage nginx data collection Amazon S3 raw logs for offline analysis MongoDB pre-aggregated data feedparser RSS/Atom feed parsing Celery distributed task queue
Tool Usage Cloud LBs* data collection without SPOF node.js* fast, dynamic Javascript config Amazon S3 raw logs for offline analysis MongoDB* sharded pre-aggregated data Redis* real-time data; past 24h, minutely Scrapy* maintainable web crawling Celery distributed task queue ZeroMQ* lightweight service communication Hadoop* compute-intensive offline analysis Solr* rich content indexing
Tool Usage Cloud LBs data collection without SPOF node.js fast, dynamic Javascript configuration Amazon S3 raw logs for offline analysis MongoDB sharded, replicated aggregate data Redis real-time data; past 24h, minutely Scrapy maintainable web crawling Storm* elastic distributed task queue Kafka* fast, reliable service communication hll* memory-stable estimated cardinality Pig* readable offline analysis scripts SolrCloud* scalable content indexing, trends
Tool Usage Django web app framework jQuery Javascript utilities Protoviz.js data visualization framework
Tool Usage Django web app framework jQuery Javascript utilities Bootstrap* responsive Javascript/CSS layouts Pandas* in-memory data manipulation LESS* modular CSS styling d3.js* customizable dataviz framework rq* asynchronous reporting Tornado* high-performance, REST/JSON API
Tool Usage Graphite internal service statistics Munin system health and heartbeat metrics Sentry plant-wide exception catching logstash plant-wide logging Chef server configuration management vagrant local VM-based development Fabric scriptable SSH sessions
Segment Target with... Growth Ads, e-mail newsletters, follows Engagement Premium ads, sponsored content Loyalty Subscriptions, ebooks, paid content
Maturity Invest in... Growth Short-form, shareable, unique Engagement Typical, emotional, convenient Loyalty Long-form, insightful, indispensible
Get in touch. We're hiring :)
And me: