On multi-form data

I read an excellent debrief on a startup’s experience with MongoDB, called “A Year with MongoDB”.

It was excellent due to its level of detail. Some of its points are important — particularly global write lock and uncompressed field names, both issues that needlessly afflict large MongoDB clusters and will likely be fixed eventually.

However, it’s also pretty clear from this post that they were not using MongoDB in the best way. For example, in a small part of their criticism of “safe off by default”, they write:

We lost a sizable amount of data at Kiip for some time before realizing what was happening and using safe saves where they made sense (user accounts, billing, etc.).

You shouldn’t be storing user accounts and billing information in MongoDB. Perhaps MongoDB’s marketing made you believe you should store everything in MongoDB, but you should know better.

In addition to that data being highly relational, it also requires the transactional semantics present in mature relational databases. When I read “user accounts, billing” here, I cringed.

Things that it makes total sense to use MongoDB for:

Analytics Systems: where server write thorughput, client-side async (unsafe) upserts/inserts, and the atomic $inc operator become very valuable tools. See this post for one example.

Content Management Systems: Here, schema-free design, avoidance of joins, its query language, and support for arbitrary metadata become an excellent set of tradeoffs vs. tabular storage in an RDBMS. MongoDB’s website has some nice examples of uses in media.

Document Management Systems: MongoDB can be used to great sucess as the canonical store of documents which are then indexed in a full-text search engine like Solr or Elasticsearch. You can do this kind of storage in an RDBMS, but MongoDB has less administrative overhead, a simpler development workflow, and less impedance mismatch with document-based stores like Solr and Elasticsearch. Further, with GridFS, you can even use MongoDB as a store for actual files, and leverage MongoDB’s replica sets for spreading those files across machines.

So, when evaluating MongoDB for your project, look at these use cases and see if they match yours. Because these are some of MongoDB’s “sweet spots”, and where you will likely get the most benefit out of its design.

However, SQL databases were developed over the course of decades because of patterns of software data storage requirements. Therefore, before choosing MongoDB, you shouldn’t flush all of this industry knowledge and learning down the toilet. You need to ask yourself: Is my data relational? Can I benefit from transactional semantics? Can I benefit from on-the-fly data aggregation (SQL aggregates)?

Answered “yes” to these questions? Then, by all means, use a relational database. Just because a technology isn’t brand new doesn’t mean it isn’t right. You should also check out this video by Brandon Rhodes at PyCon to get a better appreciation of SQL databases: Flexing SQLAlchemy’s Relational Power.

Using multiple data stores is a reality of all large-scale technology companies. Pick the right tool for the right job. At my company, we use MongoDB, Postgres, Redis, and Solr — and we use them each on the part of our stack where we leverage their strengths and avoid their weaknesses. (Update from the future: in Shipping the Second System, I describe how we upgraded our stack to a system involving Elasticsearch and Cassandra, replacing MongoDB; we still retained usage of Redis and Postgres, however.)

The original article reads to me like someone who decided to store all of their canonical data for an e-commerce site in Solr, and then complains when they realized that re-indexing their documents takes a long time, index corruption occurs upon Solr/Lucene upgrades, or that referential integrity is not supported. Solr gives you excellent full-text search, and makes a lot of architectural trade-offs to achieve this. Such is the reality of technology tools. What, were you expecting Solr to make your coffee, too?

Likewise, MongoDB made a lot of architectural tradeoffs to achieve the goals it set out in its vision, as described in their Philosophy document.

It may be a cool technology, but no, it won’t make your coffee, too.

In the end, the author writes, “Over the past 6 months, we’ve scaled MongoDB by moving data off of it. […] we looked at our data access patterns and chose the right tool for the job. For key-value data, we switched to Riak, which provides predictable read/write latencies and is completely horizontally scalable. For smaller sets of relational data where we wanted a rich query layer, we moved to PostgreSQL.”

That’s the best lesson. They ended up in the right place — storing/indexing their big, multi-form data in multiple, purpose-built data stores.

New Things and Terrible Ideas

Here’s a terrible idea: implementing full text search with MongoDB, by stuffing all your keyword tokens into the B-Tree index. Instead, you should use Solr, Sphinx, or Elasticsearch, which are well-tuned tools for the job.

(Note: MongoDB has shipped its own text search feature in 2.4; this article was written 1 year before it shipped. Still a bad idea to use MongoDB for full-text search, IMO. Especially with the growing maturity of Elasticsearch and Solr. But, if you do, you should definitely use MongoDB’s provided facility for it, rather than rolling your own keyword collection!)

Here’s another terrible idea, implementing grouping and aggregation using MongoDB’s map/reduce support. Though this “works”, you should just use Postgres or any other database that supports these operations out of the box.

(Note: MongoDB shipped aggregation pipelines in 2.2, a bit after this article was written. It’s actually a great feature that satisfies a lot of aggregation needs with reasonable performance, and avoids some of the awkwardness of the map/reduce framework.)

Sometimes it’s good to share the worst resources on the web along with the best.

Any new, shiny technology brings with it a bunch of terrible ideas. I actually think it’s kind of funny to see the ways people on the web contort perfectly good technologies in terrible ways. For example, I regularly see people trying to implement relations in Solr/Elasticsearch, implement document storage in SQL, implement transactions with MongoDB… the list goes on and on.

In my work, we try to fit the right tool to the job. This can be challenging and lead to technology fragmentation, but I think it’s a reality one simply must navigate these days. It’s not good enough for a startup CTO to say “We’re a MongoDB company” or “We’re a Postgres company”, as if picking a data store is a cultural statement. You need to have reasons to pick one data storage approach over another.

MongoDB happens to be a really good fit for analytics applications. We’re not the only ones who think so. Chartbeat had also implemented their real-time analytics system on MongoDB, Gilt Group implemented theirs the same way (Hummingbird), and Square implemented their analytics system with it, and Square also released some open source functionality related to that in a project named Cube. There are lots of other examples.

I have heard that Cassandra might also be a good fit for analytics systems. It’s how Twitter implemented its own internal analytics (which it now provides to advertising customers) — but my team hasn’t had the cycles to evaluate it yet. Plus, we’re pretty happy with the architecture we landed on.

We still store customer data (e.g. names, billing addresses, credit card tokens, API keys, etc.) in Postgres. I love Postgres, especially for this kind of data. And there are times I wish I had some subset of my data in Postgres so I could use aggregates and views.

We use Redis for ephemeral, real-time data where we have to sustain higher write throughput than even MongoDB will muster. Having a data store that can automatically expire keys can simplify some use cases drastically. (Note: in version 2.2, MongoDB also added data expiration.) We also use Redis as a simple queuing mechanism in some systems.

Finally, we use Solr for search — because Solr is awesome at it and we leverage features like TermVectorComponent, MoreLikeThisHandler, and FunctionQuery in a lot of places. (Update from the future: we upgraded from Solr to Elasticsearch.)

You could implement queuing in MongoDB; you could implement file storage in Postgres; you could implement relational data in Redis. But just because you can doesn’t mean you should.

The reason NoSQL has become such a terrible moniker is because it suggests that perhaps SQL “was a mistake”. It wasn’t. It just represents one set of tradeoffs for data storage. The only mistake that happened with SQL is the dogma that it is the “only” way to store data. There were many years that even I can remember where “knowing data storage” simply meant “knowing SQL”.

We don’t live in that world anymore, and I am glad we don’t. There is more choice and diversity in data stores today, and that is A Good Thing, because there is bigger data, taking more diverse forms, than ever before.

Let’s embrace the new things, while denouncing the terrible ideas of how to use them!

A note on this post from 2012. It has been a long time since this post was written and the world of distributed databases has changed quite a bit. If you want to learn more about how to evolve a distributed database stack to meet growing scale demands, check out Shipping the Second System.

9 thoughts on “On multi-form data”

jamieorc says:

April 19, 2012 at 5:54 pm

Excellent comments, Andrew. I enjoyed the discussion at our first BeCraft meeting. You offered a lot of insightful comments.
alex says:

April 19, 2012 at 7:18 pm

nice to see all of this in print. i agree with jamie; becraft was enlightening. thank you!
Joel Kemp says:

April 21, 2012 at 10:08 am

Andrew, this post is excellent. I’ve been struggling with understanding the strengths and weaknesses of the different storage systems and determining the types of data that work best with the respective solutions.

As you explained with Kiip, getting to the right solution is fantastic, but making the initial mistakes are costly. Your post makes it more possible to avoid one such early mistake of incorrectly matching data to a data-store.

Thanks a ton for taking the time to give some rare and valuable insight! Much appreciated.
Pingback: On multi-form data « Another Word For It
Horse Mask says:

May 4, 2012 at 12:43 am

Taking my first SQL course and this was a tough read for me, hah. I need to go back because I know there’s solid info in there. Thanks for the writeup.
Auth says:

May 27, 2012 at 11:31 pm

I have used MongoDB for about an year for research and delepovment. Very impressed the way how MongoDB works. I had some questions which I was not able to find answers in mongodb.org or forums but after reading this book, I came to know some basic principles and they are applied in MongoDB. This book focus on developers who would like to get started on a NoSQL platform as well as for DBAs who know nothing about MongoDB. This book was written few months ago based on version 1.6 but we have new version 1.8 with some additional features. The topic says Database for Cloud and Desktop Computing , there is no clear references to what they actually mean to developers/DBA . The way topics are structured is very nice (basics to advanced) which motivated me to read the book in two days. Positives * Gave a clear motivation on why we need MongoDB which I think is a big plus * Very detailed * Most aspects of MongoDB is covered Negatives * Inconsistencies in the syntax for the examples given. Though all of them are valid, there is no clear explanation on why. That is, some keywords are in double quotes, some are in single and some did not have quotes at all. Eg., $elemMatch , $or * There are some references about mapReduce but there is no separate topic about that at all. All in all, a must have book for starters.
Baron Schwartz says:

July 17, 2013 at 11:37 am

Related reading:

http://sergei.clustrix.com/2011/01/mongodb-vs-clustrix-comparison-part-1.html
http://sergei.clustrix.com/2011/02/mongodb-vs-clustrix-comparison-part-2.html

Context: I was involved with a company Clustrix hired to perform extensive (months of work) stress-tests of all aspects of performance, scalability, and high availability. These blog posts are credible, techie-to-techie meat and potatoes, not marketing fluff.
Jane says:

April 14, 2015 at 11:46 pm

A long time MongoDB user here, pretty satisfied with the usability of the program, best regards to the OP for the main article together with Baron Schwartz in the comments for the links, great content guys!
ramki says:

July 30, 2015 at 4:19 am

A good read, trying to start something in data analysis and learnt a good deal as a starting point from this page. Thank you

9 thoughts on “On multi-form data”

Leave a Reply