A major British Airways crash has highlighted the importance for businesses of testing backup systems and disaster recovery procedures to ensure that they work as planned.
The airline experienced what CEO Alex Cruz described as "a major IT systems failure" that, he said, affected all check-in and operational systems.
The failure on Saturday, May 27, resulted in the delay or cancellation of hundreds of flights, leaving thousands of passengers stranded at London's Heathrow Airport on a holiday weekend. Things were still not back to normal two days later.
Cruz described the cause of the failure as "a power supply issue," without going into detail.
A spokeswoman for British Airways elaborated.
"It was a power supply issue at one of our U.K. data centres. An exceptional power surge caused physical damage to our infrastructure and as a result many of our hugely complex operational IT systems failed," she said.
Before you ask: "We do have a back-up system," the spokeswoman said.
"But on this occasion it failed."
British Airways isn't the first airline to be laid low by a power failure. Delta Air Lines suffered similarly in August 2016 when a switch box carrying power into the company's headquarters failed, grounding flights worldwide.
A single point of failure had also brought down systems at Southwest Airlines the previous month, although on that occasion the problem was in a network router.
Although British Airways had more than one data center, it's not inconceivable that the same power surge could have damaged two sited close together.
Back in 2012 the company revealed that it had two data centers on sites right next to its Waterside global headquarters near Heathrow. Those sites housed 500 data cabinets in six halls according to Sunbird, the company that supplied the airline's DCIM (data center infrastructure management) system.
So far, British Airways doesn't know why its backup plans failed. IT staff have spent the last two days getting systems up and running again, and aren't done yet.
"When the customer disruption is completely over, we will undertake an exhaustive investigation to find out the exact circumstances and most importantly ensure that this can never happen again," said the spokeswoman.
That was probably Delta's intention too -- until its IT systems went down again in January 2017, resulting in the cancellation of around 150 flights. That time around, the U.S. Federal Aviation Authority said "automation issues" had caused the flight cancellations.
Between ticket refunds and compensation payments British Airways, like Delta before it, will be hundreds of millions out of pocket as a result of the failure of its backup systems. But even businesses that aren't trying to juggle hundreds of planes in the air would do well to test whether backup, failover and business continuity plans work before disaster strikes.
Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.