The NoSQL movement has spawned a slew of alternative data stores, all of which attempt to fill voids left by traditional relational database implementations. But while it's easy to fit the various relational databases (MySQL, Oracle, DB2, and so on) under a single categorical umbrella, the NoSQL world is much more diverse, and the NoSQL label is too general. NoSQL data stores such as MongoDB and Cassandra are so vastly different from each other that apples-to-apples comparisons are practically impossible. Thus, within the world of NoSQL, there are subcategories such as key-value stores, graph databases, and document-oriented stores.
Document-oriented stores, or document stores for short, aren't new to the world of computing. Industry graybeards will quickly recognize Lotus Notes as one of the first successful NoSQL document stores from the late '80s. Document stores encapsulate data into loosely defined documents, rather than tables with columns and rows. Implementations of the underlying document vary by data store, with some representing a document as XML and others as JSON, for instance.
[ Also on InfoWorld: NoSQL standouts: New databases for new applications | First look: Oracle NoSQL Database | Follow the latest developments in business technology news and get a digest of the key stories each day in the InfoWorld Daily newsletter. ]
But in general, documents aren't rigidly defined, and in fact they offer a high degree of flexibility when it comes to defining data. This flexibility has costs. For example, these data stores do not support SQL, instead supporting custom query languages better suited to the underlying document structure (such as XPath-like query languages for XML data stores). But the lack of rigidity in data definition has many benefits as well. In many cases, compared to traditional relational databases, the more flexible document stores enable faster iterative-style development where data requirements are evolving more rapidly than the pace of development.
MongoDB: Flexible, scalable NoSQL In recent years, a number of document stores have come out and garnered a high degree of developer mind share. One of the most popular of these is MongoDB, an open source, schema-free document store written in C++ that boasts support for a wide array of programming languages, a SQL-like query language, and a number of intriguing features related to performance and scalability.
Out of the box, Mongo supports sharding, which permits horizontal scaling by divvying up a collection of documents across a cluster of nodes, thus making reads faster. What's more, Mongo offers replication in two modes: master-slave and replica sets. In a replica set, there is no master node; instead, all nodes are copies of one another and there is no single point of failure. Replica sets therefore bring more fault tolerance to larger environments supporting massive amounts of data. These features and more don't require an army of DBAs to implement, nor do they need massive hardware expenditures. Mongo can run on commodity hardware platforms, provided there is a healthy amount of memory.
Mongo is schema-less -- it'll store any document you decide to put into it. There is no upfront document definition requirement. Ultimately, documents are grouped into collections, which are akin to tables in a relational database. Collections can be defined on the fly as well. Documents are stored in a binary JSON format, dubbed BSON, and encapsulate data represented as name-value pairs (which are somewhat like columns and rows).
MongoDB: JSON document storeJSON is an extremely understandable format. Humans can easily read it (as opposed to XML, for example) and machines can efficiently parse it. A document in Mongo representing a business card, for example, would look something like this:
In this case, the _id attribute in the document above represents a primary key in Mongo. Like a relational database, Mongo can index data and force uniqueness on data attributes. By default, the _id attribute is indexed; moreover, this document can further index individual fields or even a combination of them (for example, the name and address). Additionally, when defining an index, you can specify that its value be unique.
Mongo, however, doesn't provide for constraints or triggers. Documents in Mongo are free to refer to each other. For example, a document in a contact_log collection could refer back to a business card's _id above, thus providing a foreign keylike link. But there is no way, currently, to specify corresponding actions to be taken should a related document be removed, such as remove all referencing documents as well, which you can do in a typical RDBMS. You can, of course, add this sort of logic in application code, and triggers are planned for a future release.
JSON documents in Mongo do not force particular data types on attribute values. That is, there is no need to define upfront the format of a particular attribute. The data can be a string, an integer, or even an object type, provided it makes sense. By default, data types in Mongo documents include string, integer, boolean, double, array, date, object ID (which you can see in action in the business card example above), binary data (similar to a blob), and regular expression, although support for these latter types varies by driver.
The freedom from rigid data definition is where a document store shines. Taking the business card example a bit further, the same collection could include this document:
Note the subtle but important differences between the two. While both documents clearly represent business cards, the first includes a fax number and the second includes a Twitter handle instead.
Take a moment to think through how this same entity would be represented in an RDBMS. In this case, a business card table would need columns for both fax and Twitter, even though many rows would not have any data in those fields. Furthermore, altering a table's definition after the fact can be problematic, especially when that table contains a large amount of data. Thus, in some cases, a document store's freedom of data definition permits a high degree of variance in rapidly evolving data collections. In essence, a document store can permit data agility.
MongoDB: RDBMS style queries Although many document stores (and NoSQL implementations in general) eschew the notion of SQL and thus implement custom query languages and data access schemes, Mongo's query language is SQL-like. In fact, at a high level, it operates a lot like SQL and is rather easy to pick up.
For instance, using Mongo's shell, inserting a document representing a business card is done as follows:
In this case, I've inserted a JSON document into a collection named business_cards by issuing an insert command on that collection directly. I can search the business_card collection via the find command like so:
This would return the document I just inserted. Plus, that document would contain Mongo's _id field, which was automatically generated upon insert. Note that Mongo's query language is closely analogous to SQL -- the same query in SQL would be something like:
Mongo's query language supports a wide variety of searches, ranging from boolean expressions:
This leverages a boolean OR and returns:
It also covers regular expressions:
In return, you get two documents:
Naturally, Mongo supports updating documents and removing them. Updates are quite powerful, as there are a number of operations available for updating various aspects of a document. For instance, updating the cell number in a document containing the Twitter handle "msmith" would be performed as follows:
In this case, the $set modifier changes a particular value in the first document matching the query. If all documents matching a query should be updated, additional parameters to the update command can be added.
Removing documents is similarly straightforward. Simply use the remove command and issue a query:
Finally, Mongo offers MapReduce, a powerful searching algorithm for batch processing and aggregations that is somewhat similar to SQL's group by. At a high level, the MapReduce algorithm breaks a big task into two smaller steps. The map function is designed to take a large input and divide it into smaller pieces, then hand that data off to a reduce function, which distills the individual answers from the map function into one final output. For instance, MapReduce could be used in the business_card collection to determine how many documents contain a Twitter attribute.
MongoDB: New paradigm, new challengesAlthough MongoDB itself is open source, the code base is actively maintained and indeed sponsored by a commercial entity known as 10gen, which provides support, consulting, monitoring, and training for Mongo. 10gen supports a number of well-known companies using Mongo, including Disney, Intuit, and MTV Networks, to name a few. In addition to strong community backing and commercial support, Mongo benefits from excellent documentation. A number of published books are available.
Working with MongoDB is not without challenges. For starters, Mongo requires a lot of memory, preferring to put as much data as possible into working memory for fast access. In fact, data isn't immediately written to disk upon an insert (although you can optionally require this via a flag) -- a background process eventually writes unsaved data to disk. This makes writes extremely fast, but corresponding reads can occasionally be inconsistent. As a result, running Mongo in a nonreplicated environment courts the possibility of data loss. Furthermore, Mongo doesn't support the notion of ACID transactions, which is a touchstone of the RDBMS world.
As with traditional databases, indexing in Mongo must be thought through carefully. Improperly indexed collections will result in degraded read performance. Moreover, while the freedom to define documents at will does provide a high degree of agility, it has repercussions when it comes to data maintenance over the long term. Random documents in a collection present search challenges.
The relational database is still the staple data store for the vast majority of applications built today. But for some applications, the flexibility offered by Mongo provides advantages with respect to development speed and overall application performance. Plus, Mongo's relative ease-of-use and low price tag make it an attractive option for companies large and small.
- Out-of-the-box scalability and availability with sharding and replica sets
- Support for wide array of programming languages via drivers
- Excellent documentation, including a number of published books
- Ad hoc queries similar to SQL
- Strong community and commercial support
- No ACID transactons
- Lack of deep security features
- BI tool integration lagging (although Jaspersoft supports MongoDB)
- Queries don't support joins between collections
This article, "Flexing NoSQL: MongoDB in review," was originally published at InfoWorld.com. Follow the latest developments in data management and cloud computing at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.
Read more about data management in InfoWorld's Data Management Channel.
Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.