======================== Analytics for Journalism ======================== Andrew Montalenti, CTO .. rst-class:: logo .. image:: ./_static/parsely.png :width: 40% :align: right What is Parse.ly? ================= Analytics for digital storytellers. .. image:: ./_static/banner_01.png :align: center .. image:: ./_static/banner_02.png :align: center .. image:: ./_static/banner_03.png :align: center .. image:: ./_static/banner_04.png :align: center Online content ecosystem ======================== Marginal cost of web content: **zero**. .. rst-class:: build * | **Print-era content**: Monopolist one-way megaphones. | Newspapers, TV, radio. * | **Web-era content**: Distributed n-way channels. | Media websites, blogs, social networks. * | **Content deluge**: Commoditized the writer. | Enthroned the editor. * | **Editors vs Algorithms**: Google News and Reddit. | Prismatic and Flipboard. What is online journalism? ========================== Being redefined, but not without trouble. .. rst-class:: build * | **Legacy definition**: "Content by professionals." | Is BuzzFeed journalism? * | **Romantic definition**: "Content in the public interest." | Is Wikipedia journalism? * | **Market definition**: "Content worth my money." | Is Hulu journalism? My definition ============= A blend of these definitions: * Content that informs, inspires, or outrages me. * Content worth my time and money. * Content worth my shares and links. .. image:: ./_static/assange.png :width: 40% :align: center What makes journalism "good"? ============================= "Good journalism" is that which, it would be beneficial to society that its content were widely dissemenated. Snowden / Greenwald / Guardian is a recent example. A whole lot of "bad journalism" subsidizes the good stuff. .. rst-class:: spaced .. image:: ./_static/old_news.png :width: 50% :align: center Onward to the tech ================== .. rst-class:: spaced .. image:: /_static/tech_stack.png :width: 90% :align: center Why does journalism need analytics? =================================== Websites have a variety of interesting "first-party" metrics: * pageviews * unique visitors * sessions and paths * time spent * page engagement (scroll, copy/paste) * referrers * search keywords E-commerce & ads drove web analytics industry. Is online journalism special? ============================= Yes. * **Short Shelf Life**: average content shelf-life <48 hours * **High Frequency Publishing**: 1000's posts per day * **Unclear Conversion Goals**: nothing to buy .. image:: ./_static/pulse.png :width: 60% :align: center Content metadata is rich ======================== =========== =================================================== Field Description =========== =================================================== title Post or page title (article headline) link Canonical URL for post/page image_url URL for associated image type ``post, frontpage, sectionpage`` media_type ``article, slideshow, video`` pub_date Publication date section Section of the site (e.g. Politics) author Author who created the post tags List of editorially-provided tags topics List of machine-generated topics genres List of machine-generated genres =========== =================================================== Third-party metrics emerging ============================ * **Comments**: Disqus, LiveFyre, Wordpress * **Shares**: Twitter, Google+, LinkedIn, Facebook * **Pins and Saves**: Pinterest, Delicious * **Upvotes and Likes**: Reddit, Digg * **Queues**: Instapaper, Readability .. image:: ./_static/social_icons.png :width: 60% :align: center Time series data ================ .. image:: ./_static/sparklines_multiple.png :align: center .. image:: ./_static/sparklines_stacked.png :align: center Summary data ============ .. rst-class:: spaced .. image:: ./_static/summary_viz.png :align: center Benchmark data ============== .. rst-class:: spaced .. image:: ./_static/benchmarked_viz.png :align: center Information radiators ===================== .. rst-class:: spaced .. image:: ./_static/glimpse.png :width: 100% :align: center Demo time ========= .. image:: ./_static/dash.png :width: 70% :align: center Parse.ly tech stack =================== Parse.ly is a **Python** & **Javascript** shop. (Some Java used begrudgingly and as necessary.) .. rst-class:: spaced .. image:: ./_static/monitors.jpg :width: 90% :align: center Data centers (1) ================ Long evolution to get our current state: * **Small Beginnings**: 2009, a 1U server I snuck into my friend's cage. * **Scaling Up**: 2010-2013, from 3 to 80 nodes running in Rackspace Cloud. * **Bleeding Edge**: 2012-2013, second custom data center with 1 terabyte of RAM across 5 machines. Through all of this, heavy user of Amazon ELB and S3 for data collection and archiving, and EMR for Hadoop jobs. Data centers (2) ================ In early 2014, Amazon launched their i2 instance types: =========== ======== ======== ========= Instance RAM SSD Cores =========== ======== ======== ========= i2.8xlarge 244 GB 6.4 TB 32 i2.4xlarge 122 GB 3.2 TB 16 i2.2xlarge 61 GB 1.6 TB 8 =========== ======== ======== ========= * $20/GB of RAM per month on-demand * 1/2 the price of Rackspace Cloud * Only 3X the fully-baked price of running your own colo * Big memory, performant CPU, and fast I/O: all three! * The golden age of analytics. Scale ===== * **8 billion pageviews per month** in Jan 2014 * Typical **>3,000 requests per second** daily peak * Nearly **30 terabytes** of raw compressed data .. rst-class:: spaced .. image:: ./_static/pv_growth.png :width: 90% :align: center Stack Overview ============== .. rst-class:: spaced .. image:: ./_static/oss_logos.png :width: 90% :align: center Backend Stack, v1 (2010-2011) ============================= ============= ======================================= Tool Usage ============= ======================================= nginx data collection Amazon S3 raw logs for offline analysis MongoDB pre-aggregated data feedparser RSS/Atom feed parsing Celery distributed task queue ============= ======================================= Backend Stack, v2 (2011-2012) ============================= ============= ======================================= Tool Usage ============= ======================================= Cloud LBs\* data collection **without SPOF** node.js\* **fast, dynamic** Javascript config Amazon S3 raw logs for offline analysis MongoDB\* **sharded** pre-aggregated data Redis\* **real-time** data; past 24h, minutely Scrapy\* **maintainable** web crawling Celery distributed task queue ZeroMQ\* **lightweight** service communication Hadoop\* **compute-intensive** offline analysis Solr\* **rich** content indexing ============= ======================================= Backend Stack, v3 (2012-2013) ============================= ============= ========================================== Tool Usage ============= ========================================== Cloud LBs data collection without SPOF node.js fast, dynamic Javascript configuration Amazon S3 raw logs for offline analysis MongoDB sharded, replicated aggregate data Redis real-time data; past 24h, minutely Scrapy maintainable web crawling Storm\* **elastic** distributed task queue Kafka\* **fast, reliable** service communication hll\* **memory-stable** estimated cardinality Pig\* **readable** offline analysis scripts SolrCloud\* **scalable** content indexing, trends ============= ========================================== Frontend Stack, v1 (2010-2012) ============================== ============= ========================================== Tool Usage ============= ========================================== Django web app framework jQuery Javascript utilities Protoviz.js data visualization framework ============= ========================================== Frontend Stack, v2 (2012-2013) ============================== ============= ========================================== Tool Usage ============= ========================================== Django web app framework jQuery Javascript utilities Bootstrap\* **responsive** Javascript/CSS layouts Pandas\* **in-memory** data manipulation LESS\* **modular** CSS styling d3.js\* **customizable** dataviz framework rq\* **asynchronous** reporting Tornado\* **high-performance**, REST/JSON API ============= ========================================== Other important infrastructure ============================== ============= ========================================== Tool Usage ============= ========================================== Graphite internal service statistics Munin system health and heartbeat metrics Sentry plant-wide exception catching logstash plant-wide logging Chef server configuration management vagrant local VM-based development Fabric scriptable SSH sessions ============= ========================================== Things we learned ================= .. rst-class:: build * Web data is messy. * Aggregates over time series data is a hard problem. * Must "become one" with your raw data. * Server automation (via e.g. Chef) is crucial. * Batch vs Stream path: an important pattern. * High-memory cloud servers have changed the game. Steamroller, meet road ====================== .. rst-class:: build * Sunk cost thinking is dangerous: full steam ahead. * What we built one year ago: "how did we do this!?" * Looking at it today: some project's "Hello, World!" * Analytics space is loaded with tough problems. * Some are being solved by open source, others not. * Focus is **essential**. 2014 areas of interest ====================== * **Text mining**: Wikidata, content clustering. * **Hourly storage**: MongoDB, schema redesigns. * **Milestone Alerting**: Statistical approaches. * **Content optimization**: Solr, Function Queries. * **Visitor analysis**: Cassandra, wide-row storage. * **Network trends**: More work with Apache Pig. Crazy data ideas ================ .. rst-class:: build * Storm-DRPC for live aggregates? * Real-time Map/Reduce? * Clojure and Python in harmony? * ElasticSearch/Solr for time series data? Where are we going with this? ============================= .. rst-class:: spaced .. image:: ./_static/gel_metrics.png :width: 90% :align: center Growth ====== * New Monthly Visitors * New Linking Domains * New Shares * Web-Wide Trends Engagement ========== * Avg Time Spent * Avg Posts per Visit * Shares Per Post Loyalty ======= * Monthly Repeat Visitors * Monthly Homepage Visitors * Visits Per Month * Percent with Multiple Daily Visits Using GEL for visitor targeting =============================== ============== ====================================== Segment Target with... ============== ====================================== **Growth** Ads, e-mail newsletters, follows **Engagement** Premium ads, sponsored content **Loyalty** Subscriptions, ebooks, paid content ============== ====================================== Using GEL for content strategy ============================== ============== ====================================== Maturity Invest in... ============== ====================================== **Growth** Short-form, shareable, unique **Engagement** Typical, emotional, convenient **Loyalty** Long-form, insightful, indispensible ============== ====================================== API Engagement Tools ==================== .. rst-class:: spaced .. image:: ./_static/ars_related_stories.png :align: center :width: 80% API Loyalty Tools ================= .. rst-class:: spaced .. image:: ./_static/ars_mystories.png :align: center Conclusion ========== .. rst-class:: spaced .. image:: ./_static/parsely.png :width: 90% :align: center .. rst-class:: build * We believe **great stories deserve a big audience**. * We've built an **analytics platform for digital storytellers**. * "Big Data": it actually applies here. * "Big Payoff": **help journalism thrive in the digital age**. Contact Us ========== Get in touch. We're hiring :) * http://parse.ly * http://twitter.com/parsely And me: * http://pixelmonkey.org * http://twitter.com/amontalenti .. ifnotslides:: .. raw:: html .. ifslides:: .. raw:: html