It was a normal Monday batch process at a well-respected global bank - until, that is, a critical back-office system failed. At first, IT administrators took it in stride. This wasn't the only time they'd had to recover lost data. But soon it became clear something more ominous was occurring: the bank's multi-terabyte database had become corrupted.
The administrators tried to switch to the hot offsite backup. No luck: it had mirrored the corruption. In the IT world, the situation was beginning to spell 'crisis'. Applications teams and anyone else who could help had to suspend all priorities to focus on the failure. Despite best efforts, the target recovery time - four hours - came and went without a clue as to the problem's root cause or fix.
It began to look like an episode of 'House', with IT managers anxiously brainstorming for more than a day, trying to diagnose the mysterious disorder in their dying patient. They knew a premature move could make matters worse.
To the outside world, the bank showed no sign of its grave condition. Customers continued trading, unaware that this high-profile institution was on the verge of losing millions, being investigated by regulators, and spoiling its good reputation.
Out of view from customers, the IT teams struggled to keep the patient alive. They scrambled to find a clean backup. They found out the corruption had happened two days before the crash; it would take 36 hours to run a check on earlier copies of the data to see if it was clean. They worked on updating the production system, rerunning transaction log files to catch up to the crash point, and processing days of transactions that had since accumulated. Senior managers burned the midnight oil to decide which processes to give priority. By end of day Friday, the bank was uncertain it could open for business on Monday. It might be too risky to go more than five days without accurate settlement reconciliation. The bank alerted regulators. The team plugged away on catch-up processing over the weekend. Fortunately, they completed it in time. By Monday the patient was out of danger and the bank was able to open its doors.
A Matter Of When, Not If
This bank is not alone. Indeed, similar near misses are increasingly common. One global retailer had its point-of-sale transactions freeze for 18 hours during the holiday shopping season. The cause: a storage-network software bug that was never precisely identified. Despite the happy ending at the global bank, its senior managers and IT teams were left troubled. Losses had been modest but had the failure struck at year-end instead - when trading was running at full tilt as investors tidied their portfolios - the outcome could have been disastrous.
It turns out that a tiny conflict between a packaged software bug and the server-management software - something nobody could have foreseen - had caused the potentially monumental disaster. This was a problem not addressed in any standard operations manual. For the bank's leadership, the unsettling truth was this: Despite the bank's full compliance with internal policies and external regulations, despite its readiness for loss of a site or failure of a major hardware component, it remained ill prepared for disaster recovery.
The bank is one of many enterprises and public institutions for which a combination of complacency, complexity and strained legacy systems are raising the risk of IT disasters to an alarming level. This is despite the fact that in recent years, disaster recovery and business continuity have gained visibility and significant funding. Improved as practices are, they are no longer enough.
Many large businesses are now so dependent on the flawless operation of their systems that they are dangerously vulnerable to substantial, even irreparable, business damage. The likelihood of disaster is becoming more a matter of when than if.
Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.