Industries

Room to grow: Tips for data center capacity planning

Room to grow: Tips for data center capacity planning

Capacity planning needs to provide answers to two questions: What are you going to need to buy in the coming year? And when are you going to need to buy it?

To answer those questions, you need to know the following information:

Current usage: Which components can influence service capacity? How much of each do you use at the moment

Normal growth: What is the expected growth rate of the service, without the influence of any specific business or marketing events? Sometimes this is called organic growth.

Planned growth: Which business or marketing events are planned, when will they occur, and what is the anticipated growth due to each of these events?

Headroom: Which kind of short-term usage spikes does your service encounter? Are there any particular events in the coming year, such as the Olympics or an election, that are expected to cause a usage spike? How much spare capacity do you need to handle these spikes gracefully? Headroom is usually specified as a percentage of current capacity.

Timetable: For each component, what is the lead time from ordering to delivery, and from delivery until it is in service? Are there specific constraints for bringing new capacity into service, such as change windows?

From that information, you can calculate the amount of capacity you expect to need for each resource by the end of the following year with a simple formula:

Future Resources = Current Usage x (1 + Normal Growth + Planned Growth) + Headroom

You can then calculate for each resource the additional capacity that you need to purchase:

Additional Resources = Future Resources ñ Current Resources

Perform this calculation for each resource, whether or not you think you will need more capacity. It is okay to reach the conclusion that you don't need any more network bandwidth in the coming year. It is not okay to be taken by surprise and run out of network bandwidth because you didn't consider it in your capacity planning. For shared resources, the data from many teams will need to be combined to determine whether more capacity is needed.

Current usage

Before you can consider buying additional equipment, you need to understand what you currently have available and how much of it you are using. Before you can assess what you have, you need a complete list of all the things that are required to provide the service. If you forget something, it won't be included in your capacity planning, and you may run out of that one thing later, and as a result be unable to grow the service as quickly as you need.

What to track

If you are providing Internet based services, the two most obvious things needed are some machines to provide the service and a connection to the Internet. Some machines may be generic machines that are later customized to perform given tasks, whereas others may be specialized appliances.

Going deeper into these items, machines have CPUs, caches, RAM, storage and network. Connecting to the Internet requires a local network, routers, switches and a connection to at least one ISP. Going deeper still, network cards, routers, switches, cables and storage devices all have bandwidth limitations. Some appliances may have higher-end network cards that need special cabling and interfaces on the network gear. All networked devices need IP addresses. These are all resources that need to be tracked.

Taking one step back, all devices run some sort of operating system, and some run additional software. The operating systems and software may require licenses and maintenance contracts. Data and configuration information on the devices may need backing up to yet more systems. Stepping even farther back, machines need to be installed in a data center that meets their power and environment needs. The number and type of racks in the datacenter, the power and cooling capacity and the available floor space all need to be tracked. Data centers may provide additional per-machine services, such as console service. For companies that have multiple datacenters and points of presence, there may be links between those sites that also have capacity limits. These are all additional resources to track.

Outside vendors may provide some services. The contracts covering those services specify cost or capacity limits. To make sure that you have covered every possible aspect, talk to people in every department, and find out what they do and how it relates to the service. For everything that relates to the services, you need to understand what the limits are, how you can track them and how you can measure how much of the available capacity is used.

How much do you have

There is no substitute for a good up-to-date inventory database for keeping track of your assets. The inventory database should be kept up to date by making it a core component in the ordering, provisioning and decommissioning processes. An up-to-date inventory system gives you the data you need to find out how much of each resource you have. It should also be used to track the software license and maintenance contract inventory, and the contracted amount of resources that are available from third parties.

Using a limited number of standard machine configurations and having a set of standard appliances, storage systems, routers and switches makes it easier to map the number of devices to the lower-level resources, such as CPU and RAM, that they provide. Next: How much are you using now?

Terms to know

QPS: Queries per second. Usually how many web hits or API calls received per second.

Active Users: The number of users who have accessed the service in the specified timeframe.

MAU: Monthly active users. The number of users who have accessed the service in the last month.

Engagement: How many times on average an active user performs a particular transaction.

Primary resource: The one system-level resource that is the main limiting factor for the service.

Capacity limit: The point at which performance starts to degrade rapidly or become unpredictable.

Core driver: A factor that strongly drives demand for a primary resource.

Time series: A sequence of data points measured at equally spaced time intervals. For example, data from monitoring systems.

How much are you using now

Identify the limiting resources for each service. Your monitoring system is likely already collecting resource use data for CPU, RAM, storage and bandwidth. Typically it collects this data at a higher frequency than required for capacity planning. A summarization or statistical sample may be sufficient for planning purposes and will generally simplify calculations. Combining this data with the data from the inventory system will show how much spare capacity you currently have.

Tracking everything in the inventory database and using a limited set of standard hardware configurations also makes it easy to specify how much space, power, cooling and other data center resources are used per device. With all of that data entered into the inventory system, you can automatically generate the data-center utilization rate.

Normal growth

The monitoring system directly provides data on current usage and current capacity. It can also supply the normal growth rate for the preceding years. Look for any noticeable step changes in usage, and see if these correspond to a particular event, such as the roll-out of a new product or a special marketing drive. If the offset due to that event persists for the rest of the year, calculate the change and subtract it from subsequent data to avoid including this event-driven change in the normal growth calculation. Plot the data from as many years as possible on a graph, to determine if the normal growth rate is linear or follows some other trend.

Planned growth

The second step is estimating additional growth due to marketing and business events, such as new product launches or new features. For example, the marketing department may be planning a major campaign in May that it predicts will increase the customer base by 20 to 25 percent. Or perhaps a new product is scheduled to launch in August that relies on three existing services and is expected to increase the load on each of those by 10 percent at launch, increasing to 30 percent by the end of the year. Use the data from any changes detected in the first step to validate the assumptions about expected growth.

Headroom is the amount of excess capacity that is considered routine. Any service will have usage spikes or edge conditions that require extended resource usage occasionally. To prevent these edge conditions from triggering outages, spare resources must be routinely available. How much headroom is needed for any given service is a business decision. Since excess capacity is largely unused capacity, by its very nature it represents potentially wasted investment. Thus a financially responsible company wants to balance the potential for service interruption with the desire to conserve financial resources.

Your monitoring data should be picking up these resource spikes and providing hard statistical data on when, where and how often they occur. Data on outages and postmortem reports are also key in determining reasonable headroom.

Another component in determining how much headroom is needed is the amount of time it takes to have additional resources deployed into production from the moment that someone realizes that additional resources are required. If it takes three months to make new resources available, then you need to have more headroom available than if it takes two weeks or one month. At a minimum, you need sufficient headroom to allow for the expected growth during that time period.

Resiliency

Reliable services also need additional capacity to meet their SLAs. The additional capacity allows for some components to fail, without the end users experiencing an outage or service degradation. The additional capacity needs to be in a different failure domain; otherwise, a single outage could take down both the primary machines and the spare capacity that should be available to take over the load.

Failure domains also should be considered at a large scale, typically at the data-center level. For example, facility-wide maintenance work on the power systems requires the entire building to be shut down. If an entire datacenter is offine, the service must be able to smoothly run from the other data centers with no capacity problems. Spreading the service capacity across many failure domains reduces the additional capacity required for handling the resiliency requirements, which is the most cost-effective way to provide this extra capacity. For example, if a service runs in one data center, a second data center is required to provide the additional capacity, about 50 percent. If a service runs in nine data centers, a tenth is required to provide the additional capacity; this configuration requires only 10 percent additional capacity.

The gold standard is to provide enough capacity for two data centers to be down at the same time. This permits one to be down for planned maintenance while the organization remains prepared for another data center going down unexpectedly.

Timetable

Most companies plan their budgets annually, with expenditures split into quarters. Based on your expected normal growth and planned growth bursts, you can map out when you need the resources to be available. Working backward from that date, you need to figure out how long it takes from "go" until the resources are available.

How long does it take for purchase orders to be approved and sent to the vendor? How long does it take from receipt of a purchase order until the vendor has delivered the goods? How long does it take from delivery until the resources are available? Are there specific tests that need to be performed before the equipment can be installed? Are there specific change windows that you need to aim for to turn on the extra capacity? Once the additional capacity is turned on, how long does it take to reconfigure the services to make use of it? Using this information, you can provide an expenditures timetable.

Physical services generally have a longer lead time than virtual services. Part of the popularity of IaaS and PaaS offerings such as Amazonís EC2 and Elastic Storage are that newly requested resources have virtually instant delivery time.

It is always cost-effective to reduce resource delivery time because it means we are paying for less excess capacity to cover resource delivery time. This is a place where automation that prepares newly acquired resources for use has immediate value.

Large, high-growth environments such as popular Internet services require a different approach to capacity planning. Standard enterprise-style capacity planning techniques are often insufficient. The customer base may change rapidly in ways that are hard to predict, requiring deeper and more frequent statistical analysis of the service monitoring data to detect significant changes in usage trends more quickly. This kind of capacity planning requires deeper technical knowledge. Capacity planners will need to be familiar with concepts such as QPS, active users, engagement, primary resources, capacity limit and core drivers.

Correlation coefficient: Describes how strongly measurements for different data sources resemble each other.

Moving average: A series of averages, each of which is taken across a short time interval (window), rather than across the whole data set.

Regression analysis: A statistical method for analyzing relationships between different data sources to determine how well they correlate, and to predict changes in one based on changes in another.

EMA: Exponential moving average. It applies a weight to each data point in the window, with the weight decreasing exponentially for older data points.

MACD: Moving average convergence/divergence. An indicator used to spot changes in strength, direction and momentum of a metric. It measures the difference between an EMA with a short window and an EMA with a long window.

Zero line crossover: A crossing of the MACD line through zero happens when there is no difference between the short and long EMAs. A move from positive to negative shows a downward trend in the data, and a move from negative to positive shows an upward trend.

MACD signal line: An EMA of the MACD measurement.

Signal line crossover: The MACD line crossing over the signal line indicates that the trend in the data is about to accelerate in the direction of the crossover. It is an indicator of momentum.

This excerpt is from the book The Practice of Cloud System Administration: Designing and Operating Large Distributed Systems Vol 2 by Thomas A. Limoncelli, Strata R. Chalup and Christina J. Hogan, published by Pearson/Addison-Wesley Professional. Reprinted with permission of the authors and publisher.