This vendor-written tech primer has been edited by Network World to eliminate product promotion, but readers should note it will likely favor the submitter’s approach.
Many organizations are turning to NoSQL for its ability to support Big Data’s volume, variety and velocity, but how do you know which one to chose?
A NoSQL database can be a good fit for many projects, but to keep down development and maintenance costs you need to evaluate each project’s requirements to make sure specialized criteria are addressed. Keep in mind that it is not just a question of being able to develop the application specified, it also means being able to easily manage and support applications with the potential for dramatic growth in scope and size in production for many years. One of my customers doubled the size of their business 12 times in less than 4 years.
With that in mind, the next step is to identify which category of NoSQL best aligns with your needs. The categories are:
* Key Value (KV) databases store data as an associative array (also known as a map or dictionary), where the key is unique and serves as the primary means of accessing the value (data). Many KV databases can support richer and more complex data models on top of the basic key-value architecture.
* Document databases store documents, which contain key-value collections in a hierarchical format such as JSON and XML. Because this is NoSQL, and therefore no relational schema is required, the structure of each document can be different from other documents in the same database.
* Column databases are a sparse matrix system that use a row and a column as keys, similar to a hash table or dictionary, mapping a key to a set of key-value pairs. There can be a lot of columns, but each record only uses the columns it needs, so records can actually be relatively small in size.
* Graph databases are very different from the other categories. They focus on the relationships among the entities. Nodes are entities such as a person or object, and the edges between them detail the type of relationship between the nodes.
While it’s practical to assign databases to categories, many vendors have overlaid more advanced features on the base configuration, providing richer data models and advanced functionality. So while there are industry-defined categories, many NoSQL databases have abilities that don’t fit neatly in a box.
When you start your evaluation of databases, consider that some key-value databases can function like a document database, or a document database can function as a graph database and turn out to be a better fit for what you’re looking for than if you only consider databases in the graph category.
Define the parameters for your project, such as data (what kind, how much, size, format, sources), how you are going to use the data, what kind of growth are you projecting, how many concurrent users on your site, performance, uptime, etc. Know what criteria is essential to your business requirements and rank them in order of importance. As you can see, this is a long list, but it will help you in your evaluation, by allowing you to ask the right questions.
Some considerations when evaluating your solution:
* Scalability. There are many aspects of scalability. For data alone, you need to understand how much data you will be adding to the database per day, how long the data is relevant, what you are going to do with older data (offload to another storage for analysis, keep it in the database but move it to a different storage tier, both, or does it matter?), where is this data coming from, what needs to happen to the data (any pre-processing?), how easy is it to add this data to your database, what sources is it coming from? Real time or batch?
In some situations, your overall data size stays the same, in other situations, the data continues to accumulate and grow. How is your database going to handle this growth? Can your database easily grow by adding new resources, such as servers or storage space? How easy will it be to add resources? Will the database be able to redistribute the data automatically or does it require manual intervention? Will there be any down time during this process?
How many servers and what kind of disk capacity are required to handle the data you will store? Too many servers translates into higher hardware, data center and personnel costs. In some situations there may be significant peaks and valleys in your data usage, such as ecommerce on Black Friday (holiday shopping in December). How easy is it to scale up and down in size? Can the cloud be used during the periods of higher resource usage?
You must be able to make projections about all aspects of your data and database growth. No matter how well a database can do all these things, you should still do continuous monitoring of resource usage so you can proactively scale up to stay ahead of your usage and not overload the database.
* Uptime. Applications have different requirements of when they need to be accessed, some only during trading hours, some of them 24x7 with 5 9’s availability (though they really mean 100% of the time). Is this possible? Absolutely!
This covers a number of features, such as replication, so there are multiple copies of the data within the database. Should a single node or storage device go down, there is still availability of the data so your application can continue to do CRUD (Create, Read, Update and Delete) operations without interruption, which is Failover, and High Availability.
What happens if an entire cluster goes down? There can be a natural disaster such as a hurricane or a power failure in an entire region that lasts longer than most backup plans allow for. Do you have a disaster recovery plan? With a secondary database in a different geographic location, you can continue to operate uninterrupted. One customer I’d worked with had been in operation 100% of the time in the 4 years since they went into production with NoSQL and continue to grow that record.
With good planning and management on the part of Dev and IT, and the right NoSQL database architecture and design, it is possible to have the database up and running all the time.
* Full-Featured. As a second customer determined during their evaluation, one NoSQL solution could do what they needed by integrating a dozen components and it would fulfill everything on their checklist. But realistically, how well would it be able to operate, and still be able to achieve over 25,000 transactions/s, support over 35 million global browsers accessing the main site on multiple types of devices and update over 10,000 web pages as the events were happening without giving them a lot of grief?
It is certainly easier to use a solution that has all the features “under one roof” so to speak, so they work together seamlessly and require fewer resources on your part.
* Performance. How well can your database do what you need it to do and still have reasonable performance? There are two general classes of performance requirements for NoSQL.
The first group is applications that need to be real time, often under 20ms or sometimes as low as 10ms or 5ms. These applications likely have more simplified data and query needs, but this usually means having a cache or in-memory database to accommodate these kinds of speeds.
The second group is applications that need to have human reasonable performance, so we, as recipients of the information don’t notice the lag time too much. These applications may need to look at more complex data, spanning larger sets and do more complicated filtering. Performance for these are usually around .1s to 1s in response time.
And there is the combination, where you have a system of record that cannot be replaced, and a NoSQL database is used as a cache to speed up the ability to use the information.
* Interface. NoSQL databases commonly have programmatic interfaces to access the information, supporting Java and variations of Java script languages, C, C++ and C#, as well as various scripting languages like Perl, PHP, Python, and Ruby. Some have included a SQL interface to support RDBMS users in transitioning to NoSQL solutions. Many NoSQL databases also provide a REST interface to allow for more flexibility in accessing the database – data and functionality.
Evaluate how comprehensive the API is. Is the API extensible? Can it do all the things you need the database to do?
* Security. Security is not just for restricting access to the database, it’s also about protecting the content in your database. If you have data that certain people may not see or change, and the database does not accommodate this level of granularity, this can be done using the application as the means of safeguarding the data. But this adds work to your application layer. If you are in government, finance or healthcare, to name a few groups, this may be a big factor in whether a specific NoSQL solution can be used for sensitive projects.
You should also consider how easy is it to administer user rights, roles and access. Can the database integrate easily with your LDAP or other Single Sign On solution you might have? What kind of granularity do you have? Is it at the database, “table” or record level?
* Management and Administration. An ongoing requirement of production applications is the management and maintenance of your database. How easy is it to manage and maintain the servers and the database software? How easy is it to manage situations where you need to add a server or storage resource? How well does the database perform when a node or disk crashes? Does a DBA need to be paged to take action or does the database architecture handle this gracefully without need of immediate intervention (assuming good capacity planning)?
How easily can the database integrate with your management system to alert you to any issues? How granular is the information you can get about the database, and is it sufficient?
* Open Source and Cost. Open Source is a big trend in evaluating software being used by organizations for many reasons. One is that open source is believed to be more robust, because everyone can view the code and provide feedback or contribute to the code base to shore up these holes. But in February 2015, one well-known open source database was found to have 10’s of thousands of unsecured servers among their users. It was not due to the code, it was due to documentation that did not advise the users to secure the server properly.
Another assumption is that Open Source costs less, since many projects can be done on the community edition and the community can answer many questions in lieu of paying for a support contract. This is true for some projects. You must be sure that you are evaluating all the cost factors, not just the “free” software. If you must integrate other core capabilities into the base open source database, you are paying for the time your team needs to do the integration or extra development work and continue to maintain that work. “Free” is not looking so free after all.
One NoSQL customer switched from Open Source to a commercial solution because their original configuration on open source used almost 200 servers. Switching to the commercial solution allowed them to use fewer than 20 servers, which saved them costs in hardware, data center and administration (server and DBA).
It is easy to get caught up in the “we’re only using Open Source for everything” approach. If you can do this successfully, that is terrific! But if it means that instead of focusing on your business application, you are also having to integrate all the pieces together and incorporate this into your application, this may not be the best solution for the long term.
There are many types of applications that NoSQL can solve, ranging from small and simple to very large and complex, and anywhere in between. You need to make sure you do your due diligence in evaluating the completeness of the solution and avoid getting caught up in the industry hype.
Embrace the change that NoSQL provides and realize the possibilities!
For more than a decade MarkLogic has delivered a powerful, agile and trusted Enterprise NoSQL database platform that enables organizations to turn all data into valuable and actionable information. Organizations around the world rely on MarkLogic’s enterprise-grade technology to power the new generation of information applications. For more information, please visit www.marklogic.com.
Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.