Amazon Web Services (AWS) learned a lot of lessons from the outage that affected its Dublin data center, and will now work to improve power redundancy, load balancing and the way it communicates when something goes wrong with its cloud, the company said in a summary of the incident.
The post mortem delved deeper into what caused the outage, which affected the availability of Amazon's EC2 (Elastic Compute Cloud), EBS (Elastic Block Store), the RDS database and Amazon's network. The service disruption began Aug. 7, at 10:41 a.m., when Amazon's utility provider suffered a transformer failure. At first, a lightning strike was blamed, but the provider now believes it actually wasn't the cause, and is continuing to investigate, according to Amazon.
Normally, when primary power is lost, the electrical load is seamlessly picked up by backup generators. Programmable Logic Controllers (PLCs) assure that the electrical phase is synchronized between generators before their power is brought online. But in this case one of the PLCs did not complete its task, likely because of a large ground fault, which led to the failure of some of the generators as well, according to Amazon.
To prevent this from recurring, Amazon will add redundancy and more isolation for its PLCs so they are insulated from other failures, it said.
Amazon's cloud infrastructure is divided into regions and availability zones. Regions -- for example, the data center in Dublin, which is also called EU West Region -- consists of one or more Availability Zones, which are engineered to be insulated from failures in other zones in the same region. The thinking is that customers can use multiple zones to improve reliability, something which Amazon is working on simplifying.
At the time of the disruption, customers who had EC2 instances and EBS volumes independently operating in multiple EU West Region Availability Zones did not experience service interruption, according to Amazon. However, management servers became overloaded as a result of the outage, which had an impact on performance in the whole region.
To prevent this from recurring, Amazon will implement better load balancing, it said. Also, over the last few months, Amazon has been "developing further isolation of EC2 control plane components to eliminate possible latency or failure in one Availability Zone from impacting our ability to process calls to other Availability Zones," it wrote. The work is still ongoing, and will take several months to complete, according to Amazon.
The service that caused Amazon the biggest problem was EBS, which is used to store data for EC2 instances. The service replicates volume data across a set of nodes for durability and availability. Following the outage the nodes started talking to each other to replicate changes. Amazon has spare capacity to allow for this, but the sheer amount of traffic proved too much this time.
When all nodes related to one volume lost power, Amazon in some cases had to re-create the data by putting together a recovery snapshot. The process of producing these snapshots was time-consuming, because Amazon had to move all of the data to Amazon Simple Storage Service (S3), process it, turn it into the snapshot storage format and then make the data accessible from a user's account.
By 8:25 p.m. PDT on Aug. 10, 98 percent of the recovery snapshots had been delivered, with the remaining few requiring manual attention, Amazon said.
For EBS, Amazon's goal will be to drastically reduce the recovery time after a significant outage. It will, for example, create the capability to recover volumes directly on the EBS servers upon restoration of power, without having to move the data elsewhere.
The availability of the storage service was not just impacted by the power outage, but also by separate software and human errors, which started when the hardware failure wasn't correctly handled.
As a result, some data blocks were incorrectly marked for deletion. The error was subsequently discovered and the data tagged for further analysis, but human checks in the process failed and the deletion process was executed, according to Amazon. To prevent that from happening again, it is putting in place a new alarm feature, that will alert Amazon if there are any unusual situations discovered.
How users experience an outage of this magnitude also depends on how well the affected company keeps them up to date.
"Customers are understandably anxious about the timing for recovery and what they should do in the interim," Amazon wrote. While the company did its best to keep users informed, there are several ways it can improve, it acknowledged. For example, it can accelerate the pace at which it increases the staff on the support team to be even more responsive early on, and make it easier for users to tell if their resources have been impacted, Amazon said.
The company is working on tools to do the latter, and hopes to have them ready in the next few months.
Amazon also apologized for the outage, and will give affected users service credits. Users of EC2, EBS and the RDS database will receive a credit that equals 10 days of usage. Also, companies that were affected by the EBS software bug will be awarded a 30 day credit covering their EBS usage.
The credits will be automatically subtracted from the next AWS bill, so users won't have to do anything to receive it.
Send news tips and comments to email@example.com
Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.