Blog: Microsoft's Zune Meltdown: Three Lessons Developers Should Learn
- 09 January, 2009 09:36
- Comments
On December 31, all the 30GB Zune models turned into bricks because of a Leap Year firmware coding error. This quality assurance and testing debacle demonstrates three lessons every software developer should take to heart.
On the last day of 2008, every one of an older model of the Microsoft Zune MP3 player (the ones with 30GB of storage) locked up. The devices were back in operation again a day later, and Microsoft explained the cause of the trouble:
"A bug in the internal clock driver related to the way the device handles a leap year. The issue should be resolved over the next 24 hours as the time change moves to January 1, 2009. We expect the internal clock on the Zune 30GB devices will automatically reset tomorrow (noon, GMT)."
With that data, Zune and technical users have some idea of what happened in the "Z2K9" incident. Microsoft's Scott Hanselman wrote a very good technical analysis for programmers about the dangers of such edge cases, and apparently he's not the only one to cover the bug. (My thanks to Indrajit Chakrabarty for the pointer.)
However, aside from "how not to write code like that," there are three important things for developers and software QA professionals—and their managers—to take away from the experience.
This Was a Failure of the Software Development Process and QA Testing
It's great that the technical problem was so easily addressed ("wait a day"), but it's one heck of an embarrassment for Microsoft. I'm not talking about their PR issues per se, though Microsoft is still trying to live down the Red Ring of Death debacle with its XBox. However, Microsoft has a long history of, shall we say, a less-than-stellar reputation for quality, and they did not do themselves any favors with this incident. I feel especially sorry for the authors of the new book, How We Test Software at Microsoft (cue: pointing and giggling) and the many smart people I have met from the company. (They have great people. Really. Some of the smartest techies I've met. But somehow Microsoft doesn't seem to create a culture that demands quality.)
But the bottom line is that this problem was entirely preventable. As a London-based web developer pointed out to me, "Edge conditions such as year transitions on leap years really ought to be tested as a matter of course, and shouldn't be that difficult to do on devices where you can adjust the clock." The date problem really should have been spotted before it was checked in, he says; any sort of code review probably would have spotted the infinite loop possibility. So why wasn't it done? Why wasn't it caught?
I do understand the notion of "ship on time," and that some things get lost in the eternal desire to make a production date. Quality assurance testing is not the only victim. But this is a well-defined problem set with pretty darned obvious unit tests. (I won't be surprised if I get e-mail messages from QA Tools companies telling me that their products include such tests as a matter of course. Just post a response to this post, folks. In this context, it's fine.)
We all make mistakes. But the purpose of software engineering is to catch and fix errors before the product is released.
For further contemplation: Would your company's software development process have caught an error like this?
Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.
- Bookmark this page
- Share this article
- Got more on this story? Email CIO
- Follow CIO on twitter
-
Australia's first 4G smartphone is the HTC Velocity 4G
-
Swedish e-commerce startup's execs linked to NYC sex crime
-
Face Time - Interview with John Brennan and Robert DiStefano
-
How to implement next-generation storage infrastructure for Big Data
-
Pfizer's Future Depends on IT Transformation
-
Optimizing Storage and Protecting Data with Oracle Database 11g
This paper focuses on key Oracle Database 11g capabilities that help IT departments better optimise their storage infrastructure, enabling administrators to deliver a cost-effective, scalable data management platform that is easy to manage, reduces costs, and protects data while continuing to deliver the performance and availability that today’s businesses require. -
Optimised Data Protection for VMware® Environments with Symantec NetBackup™ Appliances
VMware® remains the most widely deployed virtualisation solution. The explosive growth of VMware infrastructure in organisations both large and small has enabled corporations to more fully exploit their hardware investments. With multiple virtual machines running on few physical hardware nodes, hardware costs are reduced, as well as space, power, and cooling requirements. This white paper discusses in more detail how VMware environments can be protected with the NetBackup appliances. Read more. -
Look both ways - Protecting your data with content inspection
Today’s threat environment is as dynamic as the business world in which we operate. As the communications channels we use continue to proliferate and evolve, so too have the vulnerabilities. Finding the right balance between ensuring the security of sensitive data, enabling the free flow of information and making full use of the latest web-based technologies can be a challenge. Deep content inspection is a vital layer in any unified information security strategy, helping organisations to take control over their information assets while proactively protecting against malware and data leakage. Read on.

















Comments
Post new comment