Document databases may be the most popular NoSQL database variant of them all. Their great flexibility -- schemas can be grown or changed with remarkable ease -- makes them suitable for a wide range of applications, and their object nature fits in well with current programming practices. In turn, Couchbase Server and MongoDB have become two of the more popular representatives of open source document databases, though Couchbase Server is a recent arrival among the ranks of document databases.
[ Andrew C. Oliver answers the question on everyone's mind: Which freaking database should I use? | Also on InfoWorld: The time for NoSQL standards is now | Get a digest of the key stories each day in the InfoWorld Daily newsletter. ]
The latest versions of Couchbase Server and MongoDB are both newly arrived. In December 2012, Couchbase released Couchbase Server 2.0, a version that makes Couchbase Server a full-fledged document database. Prior to that release, users could store JSON data into Couchbase, but the database wrote JSON data as a blob. Couchbase was, effectively, a key/value database.
10gen released MongoDB 2.4 just this week. MongoDB has been a document database from the get-go. This latest release incorporates numerous performance and usability enhancements.
Key differencesOf course there are differences. First, MongoDB's handling of documents is better developed. This becomes most obvious in the mongo shell, which serves the dual purpose of providing a management and development window into a MongoDB database. Database, collections, and documents are first-class entities in the shell. Collections are actually properties on database objects.
This is not to say that Couchbase is hobbled. You can easily manage your Couchbase cluster -- adding, deleting, and fetching documents -- from the Couchbase Management GUI, for which MongoDB has no counterpart. Indeed, if you prefer management via GUI consoles, score one for Couchbase Server. If, however, you favor life at the command line, you will be tipped in MongoDB's direction.
The cloud-based MongoDB Monitoring Service (MMS), which gathers statistics, is not meant to be a full-blown database management interface. But MongoDB's environment provides a near seamless connection between the data objects abstracted in the mongo shell and the database entities they model. This is particularly apparent when you discover that MongoDB allows you to create indexes on a specific document field using a single function call, whereas indexes in Couchbase must be created by more complex mapreduce operations.
In addition, while Couchbase documents are described via JSON, MongoDB documents are described in BSON; the latter notation includes a richer number of useful data types, such as 32-bit and 64-bit integer types, date types, and byte arrays. Both support geospatial data and queries, but this support in Couchbase is currently in an experimental phase and likely won't stay there long. New in version 2.4, MongoDB's full text search capability is also integrated with the database. A similar capability is available in Couchbase Server, but requires a plug-in for the elasticsearch tool.
Both Couchbase Server and MongoDB provide data safety via replication, both within a cluster (where live documents are protected from loss by the invisible creation of replica documents) and outside of a cluster (through cross data-center replication). Also, both provide access parallelism through sharding. However, where both Couchbase and MongoDB support hash sharding, MongoDB supports range sharding and "tag" sharding. This is a two-edged sword. On the one hand, it puts a great deal of flexibility at a database administrator's fingertips. On the other hand, its misuse can result in an imbalanced cluster.
Mapreduce is a key tool used in both Couchbase and MongoDB, but for different purposes. In MongoDB, mapreduce serves as the means of general data processing, information aggregating, and analytics. In Couchbase, it is the means of creating indexes for the purpose of querying data in the database. (We suspect that this, like the poorer document handling, is an effect of Couchbase's only having recently morphed into a document database.) As a result, it's easier to create indexes and perform ad hoc queries in MongoDB.
Couchbase's full incorporation of Memcached has no counterpart in MongoDB, and Memcached is a powerful adjunct as general object caching system for high-throughput, data-intensive Internet and intranet applications. If your application needs a Memcache server with your database, then look no further than Couchbase.
In general, the two systems are neck-and-neck in terms of features provided, though the ways those features are implemented may differ. Further, the advantages that one might hold over the other will certainly come and go as development proceeds. Both provide database drivers and client frameworks in all the popular programming languages, both are open source, both are easily installed, and both enjoy plenty of online documentation and active community support. As is typical for such well-matched systems, the best advice anyone could give for determining one over the other will be that you install them both and try them out.
Couchbase promotes Couchbase Server as a solution for real-time access, not data warehousing. Nor is Couchbase Server suitable for batch-oriented analytic processing -- it is designed to be an operational data store.
Though Couchbase Server is based on Apache CouchDB, it is more than CouchDB with incremental modifications. For starters, Couchbase is an amalgam of CouchDB and Memcached, the distributed, in-memory, key/value storage system. In fact, Couchbase can be used as a direct replacement for Memcached. The system provides a separate port that unmodified, legacy Memcached clients can use, as well as "smart SDK" and proxy tools that improve its performance as a Memcached server.
For example, you can use a "thick client" deployment model, which will place the continuously updated knowledge of Memcached node topology on the client tier. This speeds response, as any request for a particular Memcached object will be sent from the client directly to the caching node for that object. This thick-client approach also plays an important role in the Couchbase system's resilience to node crashes (described later).
Couchbase includes its own object-level caching system based on Memcached, though with enhancements. For example, Couchbase tracks working sets (the documents most frequently accessed on a given node) in its object cache using NRU (not recently used) algorithms. All I/O operations act on this in-memory cache. Updates to documents in the cache are eventually persisted to disk. In addition, for updates, locking is employed at the document level -- not at the node, database, or partition level (which would hobble throughput with numerous I/O waits), nor at the field level (which would snarl the system with memory and CPU cycles required to track the locks).
Couchbase accelerates access by using "append only" persistence. This is used not only with the data, but with indexes as well. Updated information is never overwritten; instead, it is appended to the end of whatever data structure is being modified. Further, deleted space is reclaimed by compaction, an operation that can be scheduled to take place during times of low activity. Append-only storage speeds updates and allows read operations to occur while writes are taking place.
Couchbase scaling and replicationTo facilitate horizontal scaling, Couchbase uses hash sharding, which ensures that data is distributed uniformly across all nodes. The system defines 1,024 partitions (a fixed number), and once a document's key is hashed into a specific partition, that's where the document lives. In Couchbase Server, the key used for sharding is the document ID, a unique identifier automatically generated and attached to each document. Each partition is assigned to a specific node in the cluster. If nodes are added or removed, the system rebalances itself by migrating partitions from one node to another.
There is no single point of failure in a Couchbase system. All partition servers in a Couchbase cluster are equal, with each responsible for only that portion of the data assigned to it. Each server in a cluster runs two primary processes: a data manager and a cluster manager. The data manager handles the actual data in the partition, while the cluster manager deals primarily with intranode operations.
System resilience is enhanced by document replication. The cluster manager process coordinates the communication of replication data with remote nodes, and the data manager process shepherds whatever replica data the cluster has assigned to the local node. Naturally, replica partitions are distributed throughout the cluster so that the replica copy of a partition is never on the same physical server as the active partition.
Like the documents themselves, replicas exist on a bucket basis -- a bucket being the primary unit of containment in Couchbase. Documents are placed into buckets, and documents in one bucket are isolated from documents in other buckets from the perspective of indexing and querying operations. When you create a new bucket, you are asked to specify the number of replicas (up to three) to create for that bucket. If a server crashes, the system will detect the crash, locate the replicas of the documents that lived on the crashed system, and promote those replicas to active status. The system maintains a cluster map, which defines the topology of the cluster, and this is updated in response to the crash.
Note that this scheme relies on thick clients -- embodied in the API libraries that applications use to communicate with Couchbase -- that are in constant communication with server nodes. These thick clients will fetch the updated cluster map, then reroute requests in response to the changed topology. In addition, the thick clients participate in load-balancing requests to the database. The work done to provide load balancing is actually distributed among the smart clients.
Changes in topology are coordinated by an orchestrator, which is a server node elected to be the single arbiter of cluster configuration changes. All topology changes are sent to all nodes in the cluster; even if the orchestrator node goes down, a new node can be elected to that position and system operation can continue uninterrupted.
Couchbase supports cross-data-center replication (XDCR), which provides live replication of database contents of one Couchbase cluster to a geographically remote cluster. Note that XDCR operates simultaneously with intracluster replication (the copying of live documents to their inactive replica counterparts on other cluster members), and all systems in an XDCR arrangement invisibly synchronize with one another. However, Couchbase does not provide automatic fail-over for XDCR arrangements, relying instead on techniques such as using a load-balancing mechanism to reroute traffic at the network layer, in which case the XDCR group will have been set up in a master-master configuration.
Couchbase indexing and queriesQueries on Couchbase Server are performed via "views," Couchbase terminology for indexes. Put another way, when you create an index, you're provided with a view that serves as your mechanism for querying Couchbase data. Views are new to Couchbase 2.0, as is the incremental mapreduce engine that powers the actual creation of views. Note that queries really didn't exist prior to Couchbase Server 2.0. Until this latest release, the database was a key/value storage system that simply did not understand the concept of a multifield document.
The map function in a design document's mapreduce specification filters and extracts information from the documents against which it executes. The result is a set of key/value pairs that comprise the query-accelerating index. The reduce function is optional. It is typically used to aggregate or sum the data manipulated by the map operation. Code in the reduce function can be used to implement operations that correspond roughly to SQL's ORDER BY, SORT, and aggregation features.
Couchbase Server supplies built-in reduce functions: _count, _stats, and _sum. These built-in functions are optimized beyond what would be possible if written from scratch. For example, the _count function (which counts the number of rows returned by the map function) doesn't have to recount all the documents when called. If an item is added to or removed from the associated index, the count is incremented or decremented appropriately, so the _count function need merely retrieve the maintained value.
Query parameters offer further filtering of an index. For example, you can use query parameters to define a query that returns a single entry or a specified range of entries from within an index. In addition, in Couchbase 2.0, document metadata is available. The usefulness of this becomes apparent when building mapreduce functions, as the map function can employ metadata to filter documents based on parameters such as expiration date and revision number.
Couchbase indexes are updated incrementally. That is, when an index is updated, it's not reconstructed wholesale. Updates only involve those documents that have been changed or added or removed since the last time the index was updated. You can configure an index to be updated when specific circumstances occur. For example, you might configure an index to be updated whenever a query is issued against it. That, however, might be computationally expensive, so an alternative is to configure the index to be updated only after a specified number of documents within the view have been modified. Still another alternative is to have the view updated based on a time interval.
Whatever configuration you choose, it's important to realize that a design document can hold multiple views and the configuration applies to all views in the document. If you update one index, all indexes in the document will be updated.
Finally, Couchbase distinguishes between development and production views, and the two are kept in separate namespaces. Development views can be modified; production views cannot. The reason for the distinction arises from the fact that, if you modify a view, all views in that design document will be invalidated, and all indexes defined by mapreduce functions in the design document will be rebuilt. Therefore, development views enable you to test your view code in a kind of sandbox before deploying it into a production view.
You can manage Couchbase Server via the Web-based management console. The view of active servers, shown above, is open to a single member of the cluster. Memory cache and disk storage usage information is readily available.
Managing Couchbase For gathering statistics and managing a Couchbase Server cluster, the Couchbase Web Console -- available via any modern browser on port 8091 of a cluster member -- is the place to go. It provides a multitab view into cluster mechanics. The tabs include:
- Cluster overview, which has general RAM and disk usage information (aggregated for the whole cluster). Also, operations per second bucket usage. The information is presented in a smoothly scrolling line graph.
- Server nodes, which provides information similar to the above, but for individual members of the cluster. You can also see CPU usage and swap space usage. On this tab, you can add a new node to the cluster: Click the Add Server button and you're prompted for IP address and credentials.
- Data buckets, which shows all the buckets on the cluster. You can see which nodes participate in the storage of a given bucket, how many items are in each bucket, RAM and disk usage attributed to a bucket, and so on.
The Couchbase Web Console provides much more information than can be covered here. An in-depth presentation of its capabilities can be found in Couchbase Server's online documentation.
For administrators who would rather perform their management duties on the metal, Couchbase provides a healthy set of command-line tools. General management functions are found in the couchbase-cli tool, which lists all the servers in a cluster, retrieves information for a single node, initiates rebalancing, manages buckets, and more. The cbstats command-line tool displays cluster statistics, and it can be made to fetch the statistics for a single node (the variety of statistical information retrieved is too diverse to list here). The cbepctl command lets you modify a cluster's RAM and disk management characteristics. For example, you can control the "checkpoint" settings, which govern how frequently replication occurs.
Other command-line tools include data backup and restore, a tool to retrieve data from a node that has (for whatever reason) stopped running, and even a tool for generating an I/O workload on a cluster member to test its performance.
Couchbase Server is available in both Enterprise and Community editions. The Enterprise edition undergoes more thorough testing than the Community edition, and it receives the latest bug fixes. Also, hot fixes are available, as is 24/7 support (with the purchase of an annual subscription). Nevertheless, the Enterprise edition is free for testing and development on any number of nodes or for production use on up to two nodes. The Community edition, as you might guess, is free for any number of production nodes.
- Provides legacy Memcached capabilities
- Supports spatial data and views
- Now a true document database
- Indexing mechanisms not well developed
- JSON support is relatively immature
- Does not support range sharding
MongoDB is about three years old, first released in late 2009. The goal behind MongoDB was to create a NoSQL database that offered high performance and did not cast out the good aspects of working with RDBMSes. For instance, the way that queries are designed and optimized in MongoDB is similar to how that would be done in an RDBMS. MongoDB's designers also wanted to make the database easier for application developers to work with -- for example, by allowing developers to change the data model quickly. MongoDB, whose name is short for "humongous," stores documents in BSON (Binary JSON), an extension of JSON that allows for the use of data types such as integers, dates, and byte arrays.
Two primary processes are at work in a MongoDB system, mongod and mongos. The mongod process is the real workhorse. In a sharded MongoDB cluster, mongod can be found playing one of two roles: config server or shard server. The config server tracks the cluster's metadata. (In a sharded MongoDB cluster, there must be at least three config servers for redundancy's sake.) Each config server knows which server in the cluster is responsible for a given document or, more precisely, where a given contiguous range of shard keys (called a chunk) belongs in the cluster.
Other mongod processes in the cluster run as shard servers, and these handle the actual reading and writing of the data. For fail-over purposes, two instances of a mongod process on a given cluster member run as shard servers. One process is primary, and the other is secondary. All write requests go to the primary, while read requests can go to either primary or secondary.
Secondaries are updated asynchronously from the primary so that they can take over in the event of a primary's crash. This, however, means that some read requests (sent to secondaries) may not be consistent with write requests (sent to primaries). This is an instance of MongoDB's "eventual consistency." Over time, all secondaries will become consistent with write operations on the primary. Note that you can guarantee consistent read/write behavior by configuring a MongoDB system such that all I/O -- reads and writes -- go to the primary instances. In such an arrangement, secondaries act as standby servers, coming online only when the primary fails.
The mongos process, which runs at a conceptually higher level than the mongod processes, is best thought of as a kind of routing service. Database requests from clients arrive first at a mongos process, which determines which shard(s) in a sharded cluster can service each request. The mongos process dispatches I/O requests to the appropriate mongod processes, gathers the results, and returns them to the client. Note that in a nonsharded cluster, clients talk directly to a mongod process.
MongoDB scaling and replicationMongoDB doesn't have an explicit memory caching layer. Instead, all MongoDB operations are performed through memory-mapped files. Consequently, MongoDB hands off the chore of juggling memory caching versus persistence-to-disk to the operating system. You can tweak various flush-to-disk settings for optimal performance, however. For example, MongoDB maintains a journal of database operations (for recovery purposes) that is flushed to the disk every 100ms. Not only is this interval configurable, but you can configure the system so that write operations return only after the journal has been written to disk.
Documents are placed in named containers called collections, which are roughly equivalent to Couchbase's buckets. A collection serves as a means of partitioning related documents into separate groups. The effects of many multidocument operations in a MondoDB database are restricted to the collection in which those operations are performed. MongoDB supports sharding at the collection level, which means -- should requirements dictate -- you could construct a database with unsharded and sharded collections. Of course, only a sharded collection is protected against a single point of failure.
A document's membership in a particular shard is determined by a shard key, which is derived from one or more fields in each document. The exact fields can be specified by the database administrator. In addition, MongoDB provides autosharding, which means that, once you've configured sharding, MongoDB will automatically manage the storage of documents in the appropriate physical location. This includes rebalancing shards as the number of documents grows or the number of mongod instances changes.
As of the 2.4 release, MongoDB supports both hash-based sharding and range-based sharding. As you might guess, hash-based sharding hashes the shard key, which creates a relatively even distribution of documents across the cluster. With range-based sharding (the sole sharding type prior to 2.4), a given member of a MongoDB sharded cluster will store all the documents within a given subrange of the shard key's overall domain. More precisely, MongoDB defines a logical container, called a chunk, which is a subset of documents whose shard keys fall within a specific range. The mongos process then dictates which mongod process will manage a given chunk.
Typically, you permit the load balancer to determine which cluster member manages a given shard range. However, with version 2.4, you can associate tags with shard ranges (a tag being nothing more than an identifying string). Once that's done, you can specify which member of a cluster will manage any shard ranges associated with a tag. In a sense, this lets you override some of the load balancer's decision making and steer identifiable subsets of the database to specific servers. For example, you could put the data most frequently accessed from California on the cluster member in California, the data most frequently accessed from Texas on the cluster member in Texas, and so on.
MongoDB's locking is on the database level, whereas it was global prior to version 2.2. The system implements shared-read, exclusive-write locking (many concurrent readers, but only one writer) with priority given to waiting writers over waiting readers. MongoDB avoids contentions via yield operations within locks. Predictive coding was added to the 2.2 release; if a process requests a document that is not in memory, it yields its lock so that other processes -- whose documents are in memory -- can be serviced. Long-running operations will also periodically yield locks.
You'll find no clear notion of transactions in MongoDB. Certainly, you cannot perform pure ACID transactions on a MongoDB installation. Database changes can be made durable if you enable journaling, in which case write operations are blocked until the journal entry is persisted to disk (as described earlier). And MongoDB defines the $atomic isolation operator, which imposes what amounts to an exclusive-write lock on the document involved. However, $atomic is applied at the document level only. You cannot guard multiple updates across documents or collections.
MongoDB indexing and queriesMongoDB makes it easy to create secondary indexes for all document fields. A primary index always exists on the document ID. As with Couchbase Server, this is automatically generated for each document. However, with MongoDB, you can specify a separate field as being the document's unique identifier. For example, a database of bank accounts might use the bank's generated account number as the document ID field. Indexes exist at the collection level, and they can be compound -- that is, created on multiple fields. MongoDB can also handle multikey indexes. If you index a field that includes an array, MongoDB will index each value in that array. Finally, MongoDB supports geospatial indexes.
MongoDB's querying capabilities are well developed. If you're coming to MongoDB from the RDBMS world, the online documentation shows how SQL queries might be mapped to MongoDB operations. For example, in most cases, the equivalent of SQL's SELECT can be performed by a find() function. The find() function takes two arguments: a query document and a projection document. The query document specifies filter operations on specific document fields that are fetched. You could use it to request that only documents with a quantity field whose contents are greater than, say, 100 be returned. Therefore, the query document corresponds to the WHERE clause in an SQL statement. The projection document identifies which fields are to be returned in the results, which allows you to request that, say, only the name and address fields of matching documents be returned from the query. The sort() function, which can be executed on the results of find(), corresponds to SQL's ORDER BY statement.
You can locate documents with the command db.<collection>.find(), possibly the simplest query you can perform. The find() command will return the first 20 members of the result, but it also provides a cursor, which allows you to iterate through all the documents in the collection. If you'd like to navigate the results more directly, you can reference the elements of the cursor as though it were an array.
More complex queries are possible thanks to MongoDB's set of prefix operators, which can describe comparisons as well as boolean connections. MongoDB also provides the $regex operator in case you want to apply regular expressions to document fields in the result set. These prefix operators can be used in the update() command to construct the MongoDB equivalent of SQL's UPDATE ... WHERE statement.
One of the more significant new features in MongoDB's 2.4 release is the arrival of text search. In the past, developers accomplished this by integrating Apache Lucene with MongoDB, which piled on considerable complexity. Adding Lucene in a clustered system with replication and fault tolerance is not an easy thing to do. MongoDB users now get text search for free. The new text search feature is not meant to match Lucene, but to provide basic capabilities such as more efficient Boolean queries ("dog and cat but not bird"), stemming (search for "reading" and you'll also get "read"), and the automatic culling of stop words (for example, "and", "the", "of") from the index.
You can define a text index on multiple string fields, but there can be only a single text index per collection, and indexes do not store word proximity information (that is, how close words are to one another, which can affect how matches are weighted). In addition, the text index is fully consistent: when you update data, the index is also updated.
Ease-of-use features have been added to version 2.4 as well. For example, you can now define a "capped array" as a data element, which works sort of like an ordered circular buffer. If, for example, you're keeping track of the top 10 entries in a blog, using a capped array will allow you to add new entries, and (based on the specified ordering) previous entries will be removed to cap the array at 10 or whatever number you specify.
MongoDB 2.4 also has improved geospatial capabilities. For example, you can now perform polygon operations, which would allow you to determine if two regions overlap. The spherical model used in 2.4 is improved too; it now takes into account the fact that the earth is not perfectly spherical, so distance calculations are more accurate.
You can filter the documents passed into the map function via comparison operators, or you can limit the number of documents to a specific number. This allows you to create what amounts to an incremental mapreduce operation. Initially, you run mapreduce over the entire collection. For subsequent executions, you add a query function that includes only newly added documents. From there, set the output of mapreduce to be a separate collection, and configure your code so that the new results are merged into the existing results.
You have to be careful with long-running mapreduce operations, because their execution involves lengthy locks. As mentioned earlier, the system has built-in facilities to mitigate this. For example, the read lock on the input collection is yielded every 100 documents. The MongoDB documentation describes the various locks that must be considered -- as well as mechanisms to relieve the possible problems.
Other useful command-line utilities are mongostat, which returns information concerning the number of operations -- inserts, updates, deletes, and so on -- within a specific time period. The mongotop utility likewise returns statistical information on a MongoDB instance, this time focusing on a specific collection. You can see the amount of time spent reading or writing in the collection, for instance.
In addition, 10gen provides the free cloud-based MongoDB Monitoring Service (MMS) which provides a monitoring dashboard for MongoDB installations. Built on the SaaS model, MMS requires you to run a small agent on your MongoDB cluster that communicates with the management system.
10gen's MongoDB Monitoring Service shows statistics -- in this case, for a replica set -- but management of the database is done from the command line.
Finally, the Enterprise edition welcomes Kerberos-based authentication. In all editions, MongoDB now supports role-based privileges, which gives you finer-grained control over users' access and operations on databases and collections.
10gen's release of MongoDB 2.4 is accompanied by new subscription levels: Community, Basic, Standard, and Enterprise. The Community subscription level is free, but it's also free of any support. The other subscription levels provide varying support response times and hours of availability. In addition, the Enterprise subscription level comes with the Enterprise version of MongoDB, which has more security features and SNMP support. It has also undergone more rigorous testing.
- New release incorporates text search
- Free MongoDB training courseware available from 10gen
- Text index doesn't store proximity information
- No GUI-based management console
- Kerberos authentication available in Enterprise edition only
Mongo or Couch?As usual, which product is the best choice depends heavily on the target application. Both are highly regarded NoSQL databases with outstanding pedigrees. On the one hand, MongoDB has spent much more of its lifetime as a document database, and its support for document-level querying and indexing is richer than that in Couchbase. On the other hand, Couchbase can serve equally well as a document database, a Memcached replacement, or both.
Happily, exploring either Couchbase or MongoDB is remarkably simple. A single-node system for either database server is easily installed. And if you want to experiment with a sharded system (and have enough memory and processor horsepower), you can easily set up a gang of virtual machines on a single system, and lash them together via a virtual network switch. The documentation for both systems is voluminous and well maintained. 10Gen even provides free online MongoDB classes for developers, complete with video lectures, quizzes, and homework.
This article, "NoSQL showdown: MongoDB vs. Couchbase," was originally published at InfoWorld.com. Follow the latest developments in application development,data management, cloud computing, and open source at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.
Read more about data management in InfoWorld's Data Management Channel.
Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.