Microsoft blames last week's Azure outage on a configuration error
- 03 August, 2012 20:44
- Comments
A system configuration mistake caused the outage that affected Windows Azure customers in western Europe last week, according to Microsoft.
As a result, the Microsoft public cloud application hosting and development platform was unavailable for about two and a half hours on Thursday. Microsoft didn't say how many customers were impacted.
At issue was a "safety valve" mechanism in the Azure network infrastructure designed to prevent cascading network failures. It does so by capping the number of connections that network hardware devices accept.
"Prior to this incident, we added new capacity to the West Europe sub-region in response to increased demand. However, the limit in corresponding devices was not adjusted during the validation process to match this new capacity," wrote Mike Neil, Windows Azure general manager, in a blog post.
A sudden rise in the affected cluster's usage led to the "safety valve" threshold being exceeded, which generated a storm of network management alerts. "The increased management traffic in turn triggered bugs in some of the cluster's hardware devices, causing these to reach 100% CPU utilization impacting data traffic," Neil wrote.
At the time, Microsoft solved the problem by increasing the affected cluster's "safety valve" limits. To prevent the situation from recurring, Microsoft is patching the identified bugs in the networking hardware devices, and it is also improving the network monitoring systems, so that they can identify and address connectivity issues before they cause outages.
Juan Carlos Perez covers enterprise communication/collaboration suites, operating systems, browsers and general technology breaking news for The IDG News Service. Follow Juan on Twitter at @JuanCPerezIDG.
Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.
- Bookmark this page
- Share this article
- Got more on this story? Email CIO
- Follow CIO on twitter
-
Why change management doesn’t work
-
Larry Page wants to see your medical records
-
Dual-Persona Smartphones Not a BYOD Panacea
-
After two-year hiatus, EFF accepts bitcoin donations again
-
CIOs struggle to deliver timely mobile business apps: survey
-
Moving to a Private Cloud? Infrastructure Really Matters!
The Cloud isn’t about locality. It is about quality of service delivery, cost, and whether the services consumed satisfy our objectives. For the enterprise, you need to select the right QoS to mitigate the inherent risks or you face the problem of losing data and the ability to execute operationally. Read on. -
Six Reasons to Empower Your SharePoint Citizen Developers
More and more business applications are being created by “citizen developers” - end users who are not IT developers but who create solutions for themselves and their groups. This white paper explores six reasons to embrace citizen development in an intelligent way that minimises risks and maximises the return on your SharePoint investment. Read now. -
Batten Down the Hatches! A Guide to Protecting Data in Motion
The risks facing high-speed data networks and unencrypted data while in motion are very real and on the rise. As information becomes one of the most valuable ‘off balance sheet’ assets, protection of that information and the investment in it is a paramount obligation of office-holders and management. Read now for a better understanding of the risks to data in motion.















