Microsoft blames last week's Azure outage on a configuration error
- 03 August, 2012 20:44
- Comments
A system configuration mistake caused the outage that affected Windows Azure customers in western Europe last week, according to Microsoft.
As a result, the Microsoft public cloud application hosting and development platform was unavailable for about two and a half hours on Thursday. Microsoft didn't say how many customers were impacted.
At issue was a "safety valve" mechanism in the Azure network infrastructure designed to prevent cascading network failures. It does so by capping the number of connections that network hardware devices accept.
"Prior to this incident, we added new capacity to the West Europe sub-region in response to increased demand. However, the limit in corresponding devices was not adjusted during the validation process to match this new capacity," wrote Mike Neil, Windows Azure general manager, in a blog post.
A sudden rise in the affected cluster's usage led to the "safety valve" threshold being exceeded, which generated a storm of network management alerts. "The increased management traffic in turn triggered bugs in some of the cluster's hardware devices, causing these to reach 100% CPU utilization impacting data traffic," Neil wrote.
At the time, Microsoft solved the problem by increasing the affected cluster's "safety valve" limits. To prevent the situation from recurring, Microsoft is patching the identified bugs in the networking hardware devices, and it is also improving the network monitoring systems, so that they can identify and address connectivity issues before they cause outages.
Juan Carlos Perez covers enterprise communication/collaboration suites, operating systems, browsers and general technology breaking news for The IDG News Service. Follow Juan on Twitter at @JuanCPerezIDG.
Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.
- Bookmark this page
- Share this article
- Got more on this story? Email CIO
- Follow CIO on twitter
-
Australia suspected to have PRISM data: Ludlam
-
Australia Post’s mail business to lose $200 million this year
-
Australia Post’s mail business to lose $200 million this year
-
Microsoft's ambivalence about Office on the Web gives Apple shot with iWork on iCloud
-
3 Lessons Learned From a Failed Customer Feedback Test
-
Key Factors in Modernising Backup and Recovery
There is a definite need for better data protection solutions in today’s enterprise data centres. The question is whether to continue with software-only backup and recovery solutions, or to make the move to a purpose-built backup appliance with de-duplication capabilities. This paper discusses the trends that have made modernising backup and recovery an urgent priority. Click to download. -
Building a Better Mousetrap in Anti-Malware
This story is becoming frustratingly old. Cyber threats are continuously advancing in their adaptability speed, sophistication, and degree of stealthiness. At the same time, the exposed footprint is expanding. More business operations are moving online and end-user devices—corporate-issued and user-owned—are expanding in number and variety. A reasonable question asked by executives responsible for making decisions on their organisations’ security budgets is whether their money and resources are being spent wisely. Are their businesses buying and using the best mix of security technologies to meet their needs and obligations? Read on. -
Devising a Server Protection Strategy with Trend Micro
With so many Information Technology solutions available to choose from today, many organizations put their trust in the experience, insight and advice of Gartner, and their industry-leading analysts. Trend Micro’s portfolio of solutions meets and exceeds Gartner’s recommendations on how to devise a server protection strategy. Precisely how Trend Micro does it is detailed in this whitepaper. Read now.
















