I read an excellent debrief on a startup’s experience with MongoDB, called “A Year with MongoDB”.
It was excellent due to its level of detail. Some of its points are important — particularly global write lock and uncompressed field names, both issues that needlessly afflict large MongoDB clusters and will likely be fixed eventually.
However, it’s also pretty clear from this post that they were not using MongoDB in the best way. For example, in a small part of their criticism of “safe off by default”, they write:
We lost a sizable amount of data at Kiip for some time before realizing what was happening and using safe saves where they made sense (user accounts, billing, etc.).
You shouldn’t be storing user accounts and billing information in MongoDB. Perhaps MongoDB’s marketing made you believe you should store everything in MongoDB, but you should know better.
In addition to that data being highly relational, it also requires the transactional semantics present in mature relational databases. When I read “user accounts, billing” here, I cringed.
Things that it makes total sense to use MongoDB for:
Analytics Systems: where server write thorughput, client-side async (unsafe) upserts/inserts, and the atomic $inc operator become very valuable tools. See this post for one example.
Content Management Systems: Here, schema-free design, avoidance of joins, its query language, and support for arbitrary metadata become an excellent set of tradeoffs vs. tabular storage in an RDBMS. MongoDB’s website has some nice examples of uses in media.
Document Management Systems: MongoDB can be used to great sucess as the canonical store of documents which are then indexed in a full-text search engine like Solr or Elasticsearch. You can do this kind of storage in an RDBMS, but MongoDB has less administrative overhead, a simpler development workflow, and less impedance mismatch with document-based stores like Solr and Elasticsearch. Further, with GridFS, you can even use MongoDB as a store for actual files, and leverage MongoDB’s replica sets for spreading those files across machines.
So, when evaluating MongoDB for your project, look at these use cases and see if they match yours. Because these are some of MongoDB’s “sweet spots”, and where you will likely get the most benefit out of its design.
However, SQL databases were developed over the course of decades because of patterns of software data storage requirements. Therefore, before choosing MongoDB, you shouldn’t flush all of this industry knowledge and learning down the toilet. You need to ask yourself: Is my data relational? Can I benefit from transactional semantics? Can I benefit from on-the-fly data aggregation (SQL aggregates)?
Answered “yes” to these questions? Then, by all means, use a relational database. Just because a technology isn’t brand new doesn’t mean it isn’t right. You should also check out this video by Brandon Rhodes at PyCon to get a better appreciation of SQL databases: Flexing SQLAlchemy’s Relational Power.
Using multiple data stores is a reality of all large-scale technology companies. Pick the right tool for the right job. At my company, we use MongoDB, Postgres, Redis, and Solr — and we use them each on the part of our stack where we leverage their strengths and avoid their weaknesses. (Update from the future: in Shipping the Second System, I describe how we upgraded our stack to a system involving Elasticsearch and Cassandra, replacing MongoDB; we still retained usage of Redis and Postgres, however.)
The original article reads to me like someone who decided to store all of their canonical data for an e-commerce site in Solr, and then complains when they realized that re-indexing their documents takes a long time, index corruption occurs upon Solr/Lucene upgrades, or that referential integrity is not supported. Solr gives you excellent full-text search, and makes a lot of architectural trade-offs to achieve this. Such is the reality of technology tools. What, were you expecting Solr to make your coffee, too?
Likewise, MongoDB made a lot of architectural tradeoffs to achieve the goals it set out in its vision, as described in their Philosophy document.
It may be a cool technology, but no, it won’t make your coffee, too.
In the end, the author writes, “Over the past 6 months, we’ve scaled MongoDB by moving data off of it. […] we looked at our data access patterns and chose the right tool for the job. For key-value data, we switched to Riak, which provides predictable read/write latencies and is completely horizontally scalable. For smaller sets of relational data where we wanted a rich query layer, we moved to PostgreSQL.”
That’s the best lesson. They ended up in the right place — storing/indexing their big, multi-form data in multiple, purpose-built data stores.
New Things and Terrible Ideas
Here’s a terrible idea: implementing full text search with MongoDB, by stuffing all your keyword tokens into the B-Tree index. Instead, you should use Solr, Sphinx, or Elasticsearch, which are well-tuned tools for the job.
(Note: MongoDB has shipped its own text search feature in 2.4; this article was written 1 year before it shipped. Still a bad idea to use MongoDB for full-text search, IMO. Especially with the growing maturity of Elasticsearch and Solr. But, if you do, you should definitely use MongoDB’s provided facility for it, rather than rolling your own keyword collection!)
Here’s another terrible idea, implementing grouping and aggregation using MongoDB’s map/reduce support. Though this “works”, you should just use Postgres or any other database that supports these operations out of the box.
(Note: MongoDB shipped aggregation pipelines in 2.2, a bit after this article was written. It’s actually a great feature that satisfies a lot of aggregation needs with reasonable performance, and avoids some of the awkwardness of the map/reduce framework.)
Sometimes it’s good to share the worst resources on the web along with the best.
Any new, shiny technology brings with it a bunch of terrible ideas. I actually think it’s kind of funny to see the ways people on the web contort perfectly good technologies in terrible ways. For example, I regularly see people trying to implement relations in Solr/Elasticsearch, implement document storage in SQL, implement transactions with MongoDB… the list goes on and on.
In my work, we try to fit the right tool to the job. This can be challenging and lead to technology fragmentation, but I think it’s a reality one simply must navigate these days. It’s not good enough for a startup CTO to say “We’re a MongoDB company” or “We’re a Postgres company”, as if picking a data store is a cultural statement. You need to have reasons to pick one data storage approach over another.
MongoDB happens to be a really good fit for analytics applications. We’re not the only ones who think so. Chartbeat had also implemented their real-time analytics system on MongoDB, Gilt Group implemented theirs the same way (Hummingbird), and Square implemented their analytics system with it, and Square also released some open source functionality related to that in a project named Cube. There are lots of other examples.
I have heard that Cassandra might also be a good fit for analytics systems. It’s how Twitter implemented its own internal analytics (which it now provides to advertising customers) — but my team hasn’t had the cycles to evaluate it yet. Plus, we’re pretty happy with the architecture we landed on.
We still store customer data (e.g. names, billing addresses, credit card tokens, API keys, etc.) in Postgres. I love Postgres, especially for this kind of data. And there are times I wish I had some subset of my data in Postgres so I could use aggregates and views.
We use Redis for ephemeral, real-time data where we have to sustain higher write throughput than even MongoDB will muster. Having a data store that can automatically expire keys can simplify some use cases drastically. (Note: in version 2.2, MongoDB also added data expiration.) We also use Redis as a simple queuing mechanism in some systems.
Finally, we use Solr for search — because Solr is awesome at it and we leverage features like TermVectorComponent, MoreLikeThisHandler, and FunctionQuery in a lot of places. (Update from the future: we upgraded from Solr to Elasticsearch.)
You could implement queuing in MongoDB; you could implement file storage in Postgres; you could implement relational data in Redis. But just because you can doesn’t mean you should.
The reason NoSQL has become such a terrible moniker is because it suggests that perhaps SQL “was a mistake”. It wasn’t. It just represents one set of tradeoffs for data storage. The only mistake that happened with SQL is the dogma that it is the “only” way to store data. There were many years that even I can remember where “knowing data storage” simply meant “knowing SQL”.
We don’t live in that world anymore, and I am glad we don’t. There is more choice and diversity in data stores today, and that is A Good Thing, because there is bigger data, taking more diverse forms, than ever before.
Let’s embrace the new things, while denouncing the terrible ideas of how to use them!
Interested in Python and/or DevOps? Want to work at a company dealing petabyte-scale data, real-time analytics, and cloud scaling challenges? Parse.ly is always looking for great talent for its small but nimble Python backend engineering and DevOps teams. Write an email to [email protected] with “Re: multi-form data” in the subject line.