On Nov. 19, 2014 the IT department of a Texas contracting company started getting reports that the Microsoft Office 365 cloud-based email system was unavailable to its employees. Users couldn't get email on their phones or via Outlook. As the day rolled on some users' email came back, others didn't. When US workers signed off, international employees started reporting similar issues. For some users, email was out for 24 hours.
After the outage IT leaders huddled and filed a claim with Microsoft for a breach of the company's service-level agreement (SLA), which guarantees that Office and other Microsoft Online services will be available 99.9% of a given month. If the service is available for less than that, a 25% credit can be issued to customers. But the response they got from Microsoft surprised them: Web access was still available so the service was not technically unavailable and therefore it was not a breach of the SLA.
"The number of people willing, able and knowledgeable enough to use that option is pretty low," said a senior member of the IT staff, who requested anonymity so he doesn't sour his relationship with Microsoft. In response, the contracting company has since educated employees on how to use web email access when Outlook is down.
In response to a request for comment on the situation, Microsoft issued a statement saying it strives for "an always available service" and that SLAs are in place to provide financial reassurance to that commitment. If a Microsoft online service is unavailable for less than 95% of a given month customers can get a full statement credit for that period.
This episode, however, illustrates the need to understand all terms and conditions in cloud SLAs. Enterprise agreements can be complicated so here are 10 things to watch out for when reviewing SLAs for Microsoft Office 365 (the SaaS offering) and Microsoft Azure (which includes IaaS and PaaS components). Many of the tips apply to other cloud platforms too, such as AWS, but they are specifically for Microsoft cloud services. See Microsoft's list of Azure IaaS SLA uptime guarantees here; the online services SLA can be found here.
This may seem obvious, but many people don't actually read the contract, just like they skim over End User License Agreements. "I run into an amazing number of people who zip through a PowerPoint and then sign the contract," says Paul DeGroot, who works as a consultant at Pica Communications advising clients on Microsoft licensing. If you don't understand something in the contract after analyzing it, ask for help. The key to understanding your SLA is reading it.
Paul DeGroot, consultant at Pica Communications
Contracts can be confusing though. DeGroot says sometimes relevant information is in a supporting document. SLA parameters can be outlined in one section of a document but the contract can be subject to terms that are defined in other literature. Make sure to read the entire contract, including any supporting documents.
Some providers will automatically credit customers when there is an outage, others will not. It is imperative that customers report any outages they believe breach the SLA. DeGroot has run into instances where customers experienced a multi-day outage and were sure their bill would simply reflect the event with a credit. But if you don't document and report it, you don't have any way to prove you experienced downtime. If you have a problem, record it, inform your provider immediately and file a claim for the breach of an SLA.
Microsoft requires that customers submit an SLA breach claim to customer support by the end of the calendar month after the event has happened. (So for example if an incident happens in mid February, the customer has until the end of March to report it.) The claim must include: a detailed description of the incident; duration of incident; number of users or sites impacted; description of your attempts to remedy the situation.
Many of Microsoft's services come with a 99.9% uptime guarantee (three-nines). That sounds good. But being up for 99.9% of the year still allows for 8 hours and 45 minutes of downtime each year with no breach of the SLA. How would you feel if your workload is unavailable for 8 hours one day? This uptime calculator can help users predict how much downtime they should expect from their provider based on their SLA uptime guarantee.
Each individual service can have its own SLA uptime guarantee. For example, Microsoft Azure VMs have a 99.95% uptime guarantee (if deployed across two Availability Sets; more on that later) and the SQL database has a 99.9% uptime guarantee. Most Microsoft Online SaaS products come with a 99.9% uptime guarantee too. But 99.9% uptime allows for up to 43 minutes of downtime to occur in a month without breaching the SLA.
As Troy Hunt, a Microsoft expert blogger points out in this piece, those downtime events do not have to occur at the same time for the provider's SLA to be intact. So, for example, if you have a system that relies on Azure VMs, a SQL database and Azure storage, then on the first day of a month an Azure VM could go down for 21 minutes and bring your workload down. The next day Azure SQL could go down for another 42 minutes and bring the application down. Both of those would still be within the terms of the SLA. For more on this, blogger Brent Stineman explores how to calculate aggregate SLAs across multiple services here.
One of the mantras of cloud computing is prepare for failure. And in fact some cloud services, including Microsoft and AWS, mandate that customers architect their systems to be prepared for failure to meet the terms of the SLA. AWS, for example requires that virtual machines be deployed across multiple Availability Zones (which are different data centers in AWS's cloud) and both copies of the VM must be unavailable for the SLA to be breached. Microsoft uses the term Availability Sets instead of Availability Zones, but it's the same idea. Customers must heed the best-practice architectures to ensure their systems comply with the terms of the SLA.
One thing to keep in mind is that if you architect your system to be fault tolerant and to fail over to another VM or Availability Set, that action itself could cause problems, such as a reboot. If your system goes down because it was not set up to handle a migration to a new set of VMs then that failure is not the provider's fault and will not count as a breach of the SLA. Tools like Netflix's Simian Army Chaos Monkey and Chaos Gorilla can help AWS customers test the tolerance of their systems to outages.
In the example of the Texas company above, IT staff believed the outage was Microsoft's fault, which it was. But the service wasn't really unavailable because web access was still an option, so it didn't count against the SLA. So if your app goes down, is it really your vendor's fault? Is the service unavailable from all access points? Similarly, sometimes cloud services go down but it's not the vendors fault. For Microsoft's SLA to be breached the service must be down because of "circumstances within Microsoft's control," the company states. When an outage occurs, check to see if there is something on your end that caused the outage. Is your network connection to the cloud good, for example? Customers have to prove that their vendor was at fault and the service was truly down in order to be compensated for an SLA breach. A helpful tool for determining if your provider has had an outage are service health dashboards, where Microsoft and AWS report which services have been unavailable.
The cloud is a fast-moving industry and offerings from providers can change. When offerings change, so too can the SLAs. Typically SLAs will outline whether a provider has to notify customers of a change to the service or SLA, or if customers should be prepared for a service disruption. But, it can vary from provider to provider and service to service whether customers will be informed of changes. If a sudden change to a service would impact your workload, check to ensure that your provider will notify you of such changes.
Microsoft will notify customers of what it calls "disruptive changes" to its core products, notes Donald Retallack, a research vice president at Directions on Microsoft, a consultancy. Microsoft defines "disruptive changes" as: "change(s) where a customer or administrator is required to take action in order to avoid significant degradation to the normal operation of the online service." Microsoft promises to inform customers six months in advance of a disruptive change to its Dynamics CRM platform, for example. But other non-disruptive changes can occur without Microsoft notifying customers.
It is one thing for a service to go down for an unexpected reason, but sometimes the cloud can go down because the service providers take it down. Verizon, for example, had an almost 48-hour planned outage earlier this year. Outages like that can mean the service is down, but it doesn't count against the SLA. Customers can ask their provider to ensure they will be informed of any planned downtime.
Many providers offer free-tiers of service or other products that are in preview. Typically, those free and preview services are not covered by SLAs. So, feel free to use them but make sure you understand the terms and the risks of using them before relying on them for critical functions.
Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.