======================== Analytics for Journalism ======================== Andrew Montalenti, CTO .. rst-class:: logo .. image:: ./_static/parsely.png :width: 40% :align: right What is Parse.ly? ================= Analytics provider for large-scale content sites. .. image:: ./_static/banner_01.png :align: center .. image:: ./_static/banner_02.png :align: center .. image:: ./_static/banner_03.png :align: center .. image:: ./_static/banner_04.png :align: center Online content ecosystem ======================== Marginal cost of web content: **zero**. .. rst-class:: build * | **Print-era content**: Monopolist one-way megaphones. | Newspapers, TV, radio. * | **Web-era content**: Distributed n-way channels. | Media websites, blogs, social networks. * | **Content deluge**: Commoditized the writer. | Enthroned the editor. * | **Editors vs Algorithms**: Google News and Reddit. | Prismatic and Flipboard. What is online journalism? ========================== Being redefined, but not without trouble. .. rst-class:: build * | **Legacy definition**: "Content by professionals." | Is BuzzFeed journalism? * | **Romantic definition**: "Content in the public interest." | Is Wikipedia journalism? * | **Market definition**: "Content worth my money." | Is Hulu journalism? My definition ============= A blend of these definitions: * Content that informs, inspires, or outrages me. * Content worth my time and money. * Content worth my shares and links. .. image:: ./_static/assange.png :width: 40% :align: center What makes journalism "good"? ============================= "Good journalism" is that which, it would be beneficial to society that its content were widely dissemenated. Snowden / Greenwald / Guardian is a recent example. A whole lot of "bad journalism" subsidizes the good stuff. .. rst-class:: spaced .. image:: ./_static/old_news.png :width: 50% :align: center Onward to the tech ================== .. rst-class:: spaced .. image:: /_static/tech_stack.png :width: 90% :align: center Why does journalism need analytics? =================================== Websites have a variety of interesting "first-party" metrics: * pageviews * unique visitors * sessions and paths * time spent * page engagement (scroll, copy/paste) * referrers * search keywords E-commerce & ads drove web analytics industry. Is online journalism special? ============================= Yes. * **Short Shelf Life**: average content shelf-life <48 hours * **High Frequency Publishing**: 1000's posts per day * **Unclear Conversion Goals**: nothing to buy .. image:: ./_static/pulse.png :width: 60% :align: center Content metadata is rich ======================== =========== =================================================== Field Description =========== =================================================== title Post or page title (article headline) link Canonical URL for post/page image_url URL for associated image type ``post, frontpage, sectionpage`` media_type ``article, slideshow, video`` pub_date Publication date section Section of the site (e.g. Politics) author Author who created the post tags List of editorially-provided tags topics List of machine-generated topics genres List of machine-generated genres =========== =================================================== Third-party metrics emerging ============================ * **Comments**: Disqus, LiveFyre, Wordpress * **Shares**: Twitter, Google+, LinkedIn, Facebook * **Pins and Saves**: Pinterest, Delicious * **Upvotes and Likes**: Reddit, Digg * **Queues**: Instapaper, Readability .. image:: ./_static/social_icons.png :width: 60% :align: center Time series data ================ .. image:: ./_static/sparklines_multiple.png :align: center .. image:: ./_static/sparklines_stacked.png :align: center Summary data ============ .. rst-class:: spaced .. image:: ./_static/summary_viz.png :align: center Benchmark data ============== .. rst-class:: spaced .. image:: ./_static/benchmarked_viz.png :align: center Information radiators ===================== .. rst-class:: spaced .. image:: ./_static/glimpse.png :width: 100% :align: center Demo time ========= .. image:: ./_static/dash.png :width: 70% :align: center Parse.ly tech stack =================== Parse.ly is a **Python** & **Javascript** shop. (Some Java used begrudgingly and as necessary.) .. rst-class:: spaced .. image:: ./_static/monitors.jpg :width: 90% :align: center Data centers ============ Servers running across: * **Amazon Web Services**: data collection and archiving. * **Rackspace Cloud**: data aggregation, web crawling, APIs. * **Himem Colo**: live analysis, dashboard worker nodes. Over **60 production** nodes with approximately **1 terabyte** of hot production RAM. Scale ===== * **5 billion pageviews per month** in May 2013 * Typical **>2,500 requests per second** daily peak * Nearly **20 terabytes** of raw compressed data .. rst-class:: spaced .. image:: ./_static/pv_growth.png :width: 90% :align: center Stack Overview ============== .. rst-class:: spaced .. image:: ./_static/oss_logos.png :width: 90% :align: center Backend Stack, v1 (2010-2011) ============================= ============= ======================================= Tool Usage ============= ======================================= nginx data collection Amazon S3 raw logs for offline analysis MongoDB pre-aggregated data feedparser RSS/Atom feed parsing Celery distributed task queue ============= ======================================= Backend Stack, v2 (2011-2012) ============================= ============= ======================================= Tool Usage ============= ======================================= Cloud LBs\* data collection **without SPOF** node.js\* **fast, dynamic** Javascript config Amazon S3 raw logs for offline analysis MongoDB\* **sharded** pre-aggregated data Redis\* **real-time** data; past 24h, minutely Scrapy\* **maintainable** web crawling Celery distributed task queue ZeroMQ\* **lightweight** service communication Hadoop\* **compute-intensive** offline analysis Solr\* **rich** content indexing ============= ======================================= Backend Stack, v3 (2012-2013) ============================= ============= ========================================== Tool Usage ============= ========================================== Cloud LBs data collection without SPOF node.js fast, dynamic Javascript configuration Amazon S3 raw logs for offline analysis MongoDB sharded, replicated aggregate data Redis real-time data; past 24h, minutely Scrapy maintainable web crawling Storm\* **elastic** distributed task queue Kafka\* **fast, reliable** service communication hll\* **memory-stable** estimated cardinality Pig\* **readable** offline analysis scripts SolrCloud\* **scalable** content indexing, trends ============= ========================================== Frontend Stack, v1 (2010-2012) ============================== ============= ========================================== Tool Usage ============= ========================================== Django web app framework jQuery Javascript utilities Protoviz.js data visualization framework ============= ========================================== Frontend Stack, v2 (2012-2013) ============================== ============= ========================================== Tool Usage ============= ========================================== Django web app framework jQuery Javascript utilities Bootstrap\* **responsive** Javascript/CSS layouts Pandas\* **in-memory** data manipulation LESS\* **modular** CSS styling d3.js\* **customizable** dataviz framework rq\* **asynchronous** reporting Tornado\* **high-performance**, REST/JSON API ============= ========================================== Other important infrastructure ============================== ============= ========================================== Tool Usage ============= ========================================== Graphite internal service statistics Munin system health and heartbeat metrics Sentry plant-wide exception catching logstash plant-wide logging Chef server configuration management vagrant local VM-based development Fabric scriptable SSH sessions ============= ========================================== Things we learned ================= .. rst-class:: build * Web data is messy. * Aggregates over time series data is a hard problem. * Must "become one" with your raw data. * Server automation (via e.g. Chef) is crucial. * Batch vs Stream path: an important pattern. * High-memory cloud servers will change the game. 2013 areas of interest ====================== * **Text mining**: Wikidata, content clustering. * **More social data**: gevent-based API integrations. * **Hourly storage**: MongoDB, schema redesigns. * **Content optimization**: Solr, Function Queries. * **Visitor analysis**: Cassandra, wide-row storage. * **Network trends**: more work with Pig. Crazy data ideas ================ .. rst-class:: build * Solr for time series data? * Storm-DRPC for live aggregates? * Cassandra > MongoDB + Redis? * Real-time Map/Reduce? Where are we going with this? ============================= .. rst-class:: spaced .. image:: ./_static/gel_metrics.png :width: 90% :align: center Growth ====== * New Monthly Visitors * New Linking Domains * New Shares Engagement ========== * Avg Time Spent * Avg Posts per Visit * Comments Per Post * Shares Per Post Loyalty ======= * Monthly Repeat Visitors * Monthly Homepage Visitors * Visits Per Month * Percent with Multiple Daily Visits Using GEL for visitor targeting =============================== ============== ====================================== Segment Target with... ============== ====================================== **Growth** Ads, e-mail newsletters, follows **Engagement** Premium ads, sponsored content **Loyalty** Subscriptions, ebooks, paid content ============== ====================================== Using GEL for content strategy ============================== ============== ====================================== Maturity Invest in... ============== ====================================== **Growth** Short-form, shareable, unique **Engagement** Medium-form, emotional, convenient **Loyalty** Long-form, insightful, indispensible ============== ====================================== API Engagement Tools ==================== .. rst-class:: spaced .. image:: ./_static/ars_related_stories.png :align: center :width: 80% API Loyalty Tools ================= .. rst-class:: spaced .. image:: ./_static/ars_mystories.png :align: center Conclusion ========== .. rst-class:: spaced .. image:: ./_static/parsely.png :width: 90% :align: center .. rst-class:: build * Parse.ly aims to become the **definitive analytics system** for online journalism and content. * "Big Data": it actually applies here. * "Big Payoff": **help journalism thrive in the digital age**. Contact Us ========== Get in touch. We're hiring :) * http://parse.ly * http://twitter.com/parsely And me: * http://pixelmonkey.org * http://twitter.com/amontalenti .. ifnotslides:: .. raw:: html .. ifslides:: .. raw:: html