Andrew Montalenti, CTO
Marginal cost of web content: zero.
Being redefined, but not without trouble.
A blend of these definitions:
“Good journalism” is that which, it would be beneficial to society that its content were widely dissemenated.
Snowden / Greenwald / Guardian is a recent example.
A whole lot of “bad journalism” subsidizes the good stuff.
Websites have a variety of interesting “first-party” metrics:
E-commerce & ads drove web analytics industry.
Yes.
Field Description title Post or page title (article headline) link Canonical URL for post/page image_url URL for associated image type post, frontpage, sectionpage media_type article, slideshow, video pub_date Publication date section Section of the site (e.g. Politics) author Author who created the post tags List of editorially-provided tags topics List of machine-generated topics genres List of machine-generated genres
Parse.ly is a Python & Javascript shop.
(Some Java used begrudgingly and as necessary.)
Servers running across:
Over 60 production nodes with approximately 1 terabyte of hot production RAM.
Tool Usage nginx data collection Amazon S3 raw logs for offline analysis MongoDB pre-aggregated data feedparser RSS/Atom feed parsing Celery distributed task queue
Tool Usage Cloud LBs* data collection without SPOF node.js* fast, dynamic Javascript config Amazon S3 raw logs for offline analysis MongoDB* sharded pre-aggregated data Redis* real-time data; past 24h, minutely Scrapy* maintainable web crawling Celery distributed task queue ZeroMQ* lightweight service communication Hadoop* compute-intensive offline analysis Solr* rich content indexing
Tool Usage Cloud LBs data collection without SPOF node.js fast, dynamic Javascript configuration Amazon S3 raw logs for offline analysis MongoDB sharded, replicated aggregate data Redis real-time data; past 24h, minutely Scrapy maintainable web crawling Storm* elastic distributed task queue Kafka* fast, reliable service communication hll* memory-stable estimated cardinality Pig* readable offline analysis scripts SolrCloud* scalable content indexing, trends
Tool Usage Django web app framework jQuery Javascript utilities Protoviz.js data visualization framework
Tool Usage Django web app framework jQuery Javascript utilities Bootstrap* responsive Javascript/CSS layouts Pandas* in-memory data manipulation LESS* modular CSS styling d3.js* customizable dataviz framework rq* asynchronous reporting Tornado* high-performance, REST/JSON API
Tool Usage Graphite internal service statistics Munin system health and heartbeat metrics Sentry plant-wide exception catching logstash plant-wide logging Chef server configuration management vagrant local VM-based development Fabric scriptable SSH sessions
Segment Target with... Growth Ads, e-mail newsletters, follows Engagement Premium ads, sponsored content Loyalty Subscriptions, ebooks, paid content
Maturity Invest in... Growth Short-form, shareable, unique Engagement Medium-form, emotional, convenient Loyalty Long-form, insightful, indispensible