========================
Analytics for Journalism
========================

Andrew Montalenti, CTO

.. rst-class:: logo

    .. image:: ./_static/parsely.png
        :width: 40%
        :align: right

What is Parse.ly?
=================

Analytics for digital storytellers.

    .. image:: ./_static/banner_01.png
        :align: center
    .. image:: ./_static/banner_02.png
        :align: center
    .. image:: ./_static/banner_03.png
        :align: center
    .. image:: ./_static/banner_04.png
        :align: center

Online content ecosystem
========================

Marginal cost of web content: **zero**.

.. rst-class:: build

* | **Print-era content**: Monopolist one-way megaphones.
  | Newspapers, TV, radio.
* | **Web-era content**: Distributed n-way channels. 
  | Media websites, blogs, social networks.
* | **Content deluge**: Commoditized the writer.
  | Enthroned the editor.
* | **Editors vs Algorithms**: Google News and Reddit.
  | Prismatic and Flipboard.

What is online journalism?
==========================

Being redefined, but not without trouble.

.. rst-class:: build

* | **Legacy definition**: "Content by professionals." 
  | Is BuzzFeed journalism?
* | **Romantic definition**: "Content in the public interest."
  | Is Wikipedia journalism?
* | **Market definition**: "Content worth my money." 
  | Is Hulu journalism?

My definition
=============

A blend of these definitions:

* Content that informs, inspires, or outrages me.
* Content worth my time and money.
* Content worth my shares and links.

.. image:: ./_static/assange.png
    :width: 40%
    :align: center

What makes journalism "good"?
=============================

"Good journalism" is that which, it would be beneficial to society that its content were widely dissemenated.

Snowden / Greenwald / Guardian is a recent example.

A whole lot of "bad journalism" subsidizes the good stuff.

.. rst-class:: spaced

    .. image:: ./_static/old_news.png
        :width: 50%
        :align: center

Onward to the tech
==================

.. rst-class:: spaced

    .. image:: /_static/tech_stack.png
        :width: 90%
        :align: center

Why does journalism need analytics?
===================================

Websites have a variety of interesting "first-party" metrics:

* pageviews
* unique visitors
* sessions and paths
* time spent
* page engagement (scroll, copy/paste)
* referrers
* search keywords

E-commerce & ads drove web analytics industry.

Is online journalism special?
=============================

Yes.

* **Short Shelf Life**: average content shelf-life <48 hours
* **High Frequency Publishing**: 1000's posts per day
* **Unclear Conversion Goals**: nothing to buy
 
.. image:: ./_static/pulse.png
    :width: 60%
    :align: center

Content metadata is rich
========================

    =========== ===================================================
    Field        Description
    =========== ===================================================
    title        Post or page title (article headline)
    link         Canonical URL for post/page
    image_url    URL for associated image
    type         ``post, frontpage, sectionpage``
    media_type   ``article, slideshow, video``
    pub_date     Publication date
    section      Section of the site (e.g. Politics)
    author       Author who created the post
    tags         List of editorially-provided tags
    topics       List of machine-generated topics
    genres       List of machine-generated genres
    =========== ===================================================

Third-party metrics emerging
============================

* **Comments**: Disqus, LiveFyre, Wordpress
* **Shares**: Twitter, Google+, LinkedIn, Facebook
* **Pins and Saves**: Pinterest, Delicious
* **Upvotes and Likes**: Reddit, Digg
* **Queues**: Instapaper, Readability

.. image:: ./_static/social_icons.png
    :width: 60%
    :align: center

Time series data
================

.. image:: ./_static/sparklines_multiple.png
    :align: center

.. image:: ./_static/sparklines_stacked.png
    :align: center

Summary data
============

.. rst-class:: spaced

    .. image:: ./_static/summary_viz.png
        :align: center

Benchmark data
==============

.. rst-class:: spaced

    .. image:: ./_static/benchmarked_viz.png
        :align: center

Information radiators
=====================

.. rst-class:: spaced

    .. image:: ./_static/glimpse.png
        :width: 100%
        :align: center

Demo time
=========

.. image:: ./_static/dash.png
    :width: 70%
    :align: center

Parse.ly tech stack
===================

Parse.ly is a **Python** & **Javascript** shop.

(Some Java used begrudgingly and as necessary.)

.. rst-class:: spaced

    .. image:: ./_static/monitors.jpg
        :width: 90%
        :align: center

Data centers (1)
================

Long evolution to get our current state:

* **Small Beginnings**: 2009, a 1U server I snuck into my friend's cage.
* **Scaling Up**: 2010-2013, from 3 to 80 nodes running in Rackspace Cloud.
* **Bleeding Edge**: 2012-2013, second custom data center with 1 terabyte of RAM across 5 machines.

Through all of this, heavy user of Amazon ELB and S3 for data collection
and archiving, and EMR for Hadoop jobs.

Data centers (2)
================

In early 2014, Amazon launched their i2 instance types:

    =========== ======== ======== =========
    Instance    RAM      SSD      Cores
    =========== ======== ======== =========
    i2.8xlarge  244 GB   6.4 TB   32
    i2.4xlarge  122 GB   3.2 TB   16
    i2.2xlarge  61 GB    1.6 TB   8
    =========== ======== ======== =========

* $20/GB of RAM per month on-demand
* 1/2 the price of Rackspace Cloud
* Only 3X the fully-baked price of running your own colo
* Big memory, performant CPU, and fast I/O: all three!
* The golden age of analytics.

Scale
=====

* **8 billion pageviews per month** in Jan 2014
* Typical **>3,000 requests per second** daily peak
* Nearly **30 terabytes** of raw compressed data

.. rst-class:: spaced

    .. image:: ./_static/pv_growth.png
        :width: 90%
        :align: center

Stack Overview
==============

.. rst-class:: spaced

    .. image:: ./_static/oss_logos.png
        :width: 90%
        :align: center


Backend Stack, v1 (2010-2011)
=============================

    ============= =======================================
    Tool          Usage
    ============= =======================================
    nginx         data collection
    Amazon S3     raw logs for offline analysis
    MongoDB       pre-aggregated data
    feedparser    RSS/Atom feed parsing
    Celery        distributed task queue
    ============= =======================================

Backend Stack, v2 (2011-2012)
=============================

    ============= =======================================
    Tool          Usage
    ============= =======================================
    Cloud LBs\*   data collection **without SPOF**
    node.js\*     **fast, dynamic** Javascript config 
    Amazon S3     raw logs for offline analysis
    MongoDB\*     **sharded** pre-aggregated data 
    Redis\*       **real-time** data; past 24h, minutely
    Scrapy\*      **maintainable** web crawling
    Celery        distributed task queue
    ZeroMQ\*      **lightweight** service communication
    Hadoop\*      **compute-intensive** offline analysis
    Solr\*        **rich** content indexing
    ============= =======================================

Backend Stack, v3 (2012-2013)
=============================

    ============= ==========================================
    Tool          Usage
    ============= ==========================================
    Cloud LBs     data collection without SPOF
    node.js       fast, dynamic Javascript configuration
    Amazon S3     raw logs for offline analysis
    MongoDB       sharded, replicated aggregate data 
    Redis         real-time data; past 24h, minutely
    Scrapy        maintainable web crawling
    Storm\*       **elastic** distributed task queue
    Kafka\*       **fast, reliable** service communication
    hll\*         **memory-stable** estimated cardinality
    Pig\*         **readable** offline analysis scripts
    SolrCloud\*   **scalable** content indexing, trends 
    ============= ==========================================

Frontend Stack, v1 (2010-2012)
==============================

    ============= ==========================================
    Tool          Usage
    ============= ==========================================
    Django        web app framework
    jQuery        Javascript utilities
    Protoviz.js   data visualization framework
    ============= ==========================================

Frontend Stack, v2 (2012-2013)
==============================

    ============= ==========================================
    Tool          Usage
    ============= ==========================================
    Django        web app framework
    jQuery        Javascript utilities
    Bootstrap\*   **responsive** Javascript/CSS layouts
    Pandas\*      **in-memory** data manipulation
    LESS\*        **modular** CSS styling
    d3.js\*       **customizable** dataviz framework
    rq\*          **asynchronous** reporting
    Tornado\*     **high-performance**, REST/JSON API
    ============= ==========================================

Other important infrastructure
==============================

    ============= ==========================================
    Tool          Usage
    ============= ==========================================
    Graphite      internal service statistics
    Munin         system health and heartbeat metrics
    Sentry        plant-wide exception catching
    logstash      plant-wide logging
    Chef          server configuration management
    vagrant       local VM-based development
    Fabric        scriptable SSH sessions
    ============= ==========================================

Things we learned
=================

.. rst-class:: build

* Web data is messy.
* Aggregates over time series data is a hard problem.
* Must "become one" with your raw data.
* Server automation (via e.g. Chef) is crucial.
* Batch vs Stream path: an important pattern.
* High-memory cloud servers have changed the game.

Steamroller, meet road
======================

.. rst-class:: build

* Sunk cost thinking is dangerous: full steam ahead.
* What we built one year ago: "how did we do this!?"
* Looking at it today: some project's "Hello, World!"
* Analytics space is loaded with tough problems.
* Some are being solved by open source, others not.
* Focus is **essential**.

2014 areas of interest
======================

* **Text mining**: Wikidata, content clustering.
* **Hourly storage**: MongoDB, schema redesigns.
* **Milestone Alerting**: Statistical approaches.
* **Content optimization**: Solr, Function Queries.
* **Visitor analysis**: Cassandra, wide-row storage.
* **Network trends**: More work with Apache Pig.

Crazy data ideas
================

.. rst-class:: build

* Storm-DRPC for live aggregates?
* Real-time Map/Reduce?
* Clojure and Python in harmony?
* ElasticSearch/Solr for time series data?

Where are we going with this?
=============================

.. rst-class:: spaced

    .. image:: ./_static/gel_metrics.png
        :width: 90%
        :align: center

Growth
======

* New Monthly Visitors
* New Linking Domains
* New Shares
* Web-Wide Trends

Engagement
==========

* Avg Time Spent
* Avg Posts per Visit
* Shares Per Post

Loyalty
=======

* Monthly Repeat Visitors
* Monthly Homepage Visitors
* Visits Per Month
* Percent with Multiple Daily Visits

Using GEL for visitor targeting
===============================

   ============== ======================================
   Segment        Target with...
   ============== ======================================
   **Growth**     Ads, e-mail newsletters, follows
   **Engagement** Premium ads, sponsored content
   **Loyalty**    Subscriptions, ebooks, paid content
   ============== ======================================

Using GEL for content strategy 
==============================

   ============== ======================================
   Maturity       Invest in...
   ============== ======================================
   **Growth**     Short-form, shareable, unique
   **Engagement** Typical, emotional, convenient 
   **Loyalty**    Long-form, insightful, indispensible
   ============== ======================================

API Engagement Tools
====================

.. rst-class:: spaced

    .. image:: ./_static/ars_related_stories.png
        :align: center
        :width: 80%

API Loyalty Tools
=================

.. rst-class:: spaced

    .. image:: ./_static/ars_mystories.png
        :align: center

Conclusion
==========

.. rst-class:: spaced

    .. image:: ./_static/parsely.png
        :width: 90%
        :align: center

.. rst-class:: build

* We believe **great stories deserve a big audience**.
* We've built an **analytics platform for digital storytellers**.
* "Big Data": it actually applies here.
* "Big Payoff": **help journalism thrive in the digital age**.

Contact Us
==========

Get in touch. We're hiring :)

* http://parse.ly
* http://twitter.com/parsely

And me:

* http://pixelmonkey.org
* http://twitter.com/amontalenti

.. ifnotslides::

    .. raw:: html

        <script>
        $(function() {
            $("body").css("width", "1080px");
            $(".sphinxsidebar").css({"width": "200px", "font-size": "12px"});
            $(".bodywrapper").css("margin", "auto");
            $(".documentwrapper").css("width", "880px");
            $(".logo").removeClass("align-right");
        });
        </script>

.. ifslides::

    .. raw:: html

        <script>
        $("tr").each(function() { 
            $(this).find("td:first").css("background-color", "#eee"); 
        });
        </script>