When a fire destroyed the local office of Centrelink in Warrnambool, Victoria, the government agency’s smooth recovery turned a potential disaster into a textbook example of the value of business continuity planning.
Planning for a disaster is one thing, but putting that plan into practice is something most IT executives never want to have to do. For massive Commonwealth agency Centrelink, business continuity planning (BCP) efforts came good in the days after October 6, 2002, when fire savaged a row of shops in Warrnambool, Victoria that contained, amongst other things, the local offices of Centrelink and the Department of Human Services.
Fire crews came from as far afield as Ballarat to battle the blaze, which was among the largest in Warrnambool for years. But as soon as word of the conflagration spread up the ranks at Centrelink, emergency planners went into action. The physical office, as well as the IT infrastructure that linked it into the agency’s centralized data centre in the ACT, had to be replaced quickly and effectively to ensure the organization could continue providing its crucial services to the Warrnambool community.
In this case, good planning, effective local leadership and cooperation across the organization paid off for Centrelink: a caravan was set up in the shopping centre’s car park and opened for business the next day, providing continuity of service while a new location could be sourced and fitted. Centrelink provided access to its mainframe-based information systems through notebook PCs connected to the mainframe-based data centre via dial-up modem.
The speed and effectiveness of Centrelink’s response validated its ongoing efforts to ensure continuity of services, something that had always been an issue but that came to a head with millennium bug remediation efforts in the late 1990s. By that time, the growing complexity of the agency’s information systems had highlighted the need for flexible, robust BCP strategies.
As became clear in the Warrnambool blaze, mobile computing played a critical part in separating access to information from the physical premises in which terminals were located. “In terms of IT processes, one of the things we’ve been working to in the past few years is mobile computing,” says Nico Padovan, national manager, infrastructure, with Centrelink. “Accessing the mainframe is key, so we’ve worked to give staff functionality on mobile devices.”
Politically speaking, the smooth recovery from the Warrnambool blaze could not have come at a better time for Centrelink, which was in the midst of an Australian National Audit Office (ANAO) audit of its business continuity practices when the fire struck.
The audit, later documented formally in a report that was publicly released late last year, was generally positive about Centrelink’s BCP efforts. Although auditors suggested changes that would improve communication between different segments of the Centrelink organization, they concluded that the agency’s BCP strategy “effectively addresses the main elements of business continuity outlined in the better practice literature, namely crisis response, crisis management, interim processing and business process recovery”. They also commended the efforts of Centrelink staff, operating in an environment where business continuity is a way of life.
That’s a good endorsement from an organization that has been known to be unrelenting in its mission to identify problems in governmental policy and management processes. But for Centrelink, the positive audit report was simply a well-earned recognition of decades of planning and careful refinement of the procedures necessary to ensure the continuity and seamless management of the more than $55 billion in social services benefits paid to more than six million Australians every year.
Minimizing Points of Failure
As is typical with a network that stretches to more than 320 local customer service centres (it is Australia’s fourth largest IT network), the IT-reliant Centrelink has built its high-availability computing environment around a heavily centralized computing model.
Long-established IBM mainframes handle core customer and benefit information from their bases at a pair of highly redundant data centres located on opposite sides of the Australian Capital Territory. Select mission-critical services within each of the data centres have been designed for maximum availability: data is replicated using conventional fibre-optic WAN links and failover is provided at the application and operating system level.
The distance between the centres — 40 kilometres — is considered optimal by Centrelink planners. The facilities are far enough apart that each is isolated from power outages or other natural disasters that might affect the other. At the same time, the distance is short enough that it is relatively easy to dispatch staff quickly to either site from anywhere within the ACT.
The data centres bristle with communications links, connecting the two facilities together and linking them both with far-flung Centrelink branch offices using conventional frame relay and ISDN connections. Although data may be stored locally in those offices using conventional network attached storage, that storage is effectively a temporary buffer to counter the slow WAN links: data at local offices is replicated back to the central mainframes on a daily basis for archiving and backup.
“Any customer information is backed up on a mainframe and eventually finds its way onto tape offsite,” says Pat Fegan, national manager, business and information protection with Centrelink. “We’re able to recover fairly quickly.”
This statement was proven by the speed with which the Warrnambool office was able to get up and running after the fire without loss of data. By the day after the disaster, current data was accessible from the data centre from WAN-connected notebooks. Centrelink was able to resume operational efficiency quickly while making arrangements for more permanent connectivity.
Centralization presents its own issues, however. For one, ensuring adequate availability of storage space remains an ever-present issue. Backing up so many remote offices’ data has accelerated Centrelink’s already considerable consumption of storage space, a growing need that has been met through continued investment in scalable storage area networks (SANs) and massive tape libraries for backup.
Security is a key requirement for Centrelink, which continually tests its network security and data integrity through regular penetration testing, external vulnerability assessments, and careful testing and certification of new applications or upgrades. Given the increasing intimacy of business systems and the Internet, such security precautions have become absolutely essential.
Meticulous backup and physical separation between the sites helped the data centres escape interruption from the raging Canberra bushfires of last summer. Nevertheless, that disaster led the ANAO to question Centrelink’s plans should both of its sites — and its offsite backup facility — be destroyed at the same time. Centrelink has recently been considering the implications of this scenario and has made its resolution a key priority for 2004.
In the long term, the agency’s key mainframe environment will benefit from an ongoing investment in fibre optic cabling linking its critical sites. Although the systems have previously run in a mirroring and failover configuration, addition of new fibre will allow Centrelink to reconfigure them in a Sysplex cluster that erases the logical separation between units.
Sysplex clusters logically unite multiple groups of mainframe CPUs — even those located at different physical sites — into a single logical core. An external clock, called Sysplex Timer, provides minutely detailed transaction recording that allows for full recovery of transactions in the event of a service disruption. The timer coordinates transaction processing across many sites, allowing clusters to stretch seamlessly between facilities rather than mirroring identical copies of the other.
As it goes live early this year, the new environment will create a single high-powered computing resource with the smarts to keep running even if one of its many subunits is compromised. In this way, fault tolerance will have become a core feature of the agency’s information systems. Down the road, extending the Sysplex cluster to a third site outside the ACT could theoretically help address the issue of simultaneous data centre destruction. Centrelink is weighing up its options and, as per the ANAO’s suggestions, is preparing comprehensive BCP documents for all of its key IT systems and applications, and the data centres as a whole.
Services . . . At Your Service
Because of its size and complexity, Centrelink’s BCP efforts necessarily involve a broad range of technologies and applications. In the past, the agency’s reliance on mainframes ensured that data integrity and availability were relatively straightforward. However, the introduction of open systems servers — which now include more than 400 Unix and Windows-based systems — has presented another challenge altogether.
As is typical of open systems deployments each new application Centrelink introduces has its own associated servers, with storage and computing requirements typically handled on an application-level basis. Recognizing that this approach makes for considerable complexity in BCP planning, Centrelink has recently made service consolidation — as opposed to server consolidation — a major strategic focus.
Service consolidation takes a far more unified approach to business continuity than previous server-based application rollouts. By using server clustering capabilities that link together large numbers of high-availability servers, Centrelink is unifying its infrastructure and shifting requirements planning away from individual server platforms.
Rather, it is building out a flexible data centre utility model that has been engineered for business continuity from the start. This allows the IT organization to focus on providing a high-availability computing environment that is capable of accommodating new applications with an ever-expanding pool of resources.
The result of this shift is a single, logically consistent computing environment that can be used for all manner of applications — which are being written to common data interchange and Web services application standards that provide an abstraction layer between applications and the IT infrastructure on which they run.
Padovan describes the shift as a change from a technology-centric approach to one that is focused on the delivery of end-to-end services. “Previously, if a new project came along and it was important, we’d put platforms in each data centre to provide failover,” he says. “Under the service-centric model, which utilizes highly clustered mid-range and Windows servers operating in a Utility Data Centre model, as each project comes along we undertake a detailed capacity analysis and add the new workload to the clustered computing capability. This not only provides considerable improvements in availability, but also provides scalability benefits.”
This scalability relies on the inherent expandability of the open systems, commoditized servers that make up in flexibility what mainframes offer in rock-solid reliability.
By engineering open systems environments to mainframe reliability standards, availability and IT continuity become far more predictable. Centrelink will be able to execute and monitor BCP efforts at a high level, rather than having to reinvent the wheel by designing application-specific redundancy every time a new system is installed or developed. Business analysts focus not on the technological details of a new server application, but rather on its functionality; IT strategists worry about delivering on their vision.
“The biggest business impact of this approach is not having to tackle [IT planning] on a case-by-case basis,” says Fegan. “We have inherent redundancy and inherent availability targets built into the systems. This is much more efficient from our perspective, and also provides much better services from a business continuity perspective.”
Keeping Up with the Business
Organizations as large as Centrelink are natural homes for a considerable level of bureaucracy. Within Centrelink’s BCP planning, however, careful delineation of roles and responsibilities has long been critical in overcoming potential bureaucratic inefficiency and ensuring that BCP plans can be put into effect as quickly as possible.
Central management of crisis response is handled through Centrelink’s Crisis Command Centre, a formalized unit that works with specially formed Business Resumption Teams to coordinate responses to potential IT and business service interruptions. Centrelink also recently established both an IT Service Continuity Management (ITSCM) team — specifically focused on IT-related aspects of business continuity responses — and a Business Continuity and Emergency Management (BCEM) team to address emergency management strategies in other parts of the organization.
The ANAO commended the formalization of BCP response teams but cautioned that Centrelink must be careful to delineate and coordinate responsibilities between ITSCM and BCEM. Keeping those units’ roles aligned with overall business objectives is essential, as is delegating command authority to an appropriate body capable of coordinating the activities of the two.
Focusing BCP efforts into those units allows them to become BCP centres of excellence within Centrelink at large. The BCEM unit, for example, has compiled a database of best-practice BCP policies for consultation across Centrelink’s business units. The ANAO’s report lauded this move and recommended that the units become more proactive in disseminating BCP knowledge by creating formal business continuity management policy documents. This repository would include the BCP plans from all new projects, allowing the comparison and improvement of various BCP plans.
Sharing of BCP knowledge has long been a core of Centrelink’s IT strategies. Multiple steering committees address various parts of the IT planning process, providing myriad links to business stakeholders that underscore the organization’s increasingly service-focused strategy. IT executives are involved in high-level decisions that ensure IT service continuity — essential to Centrelink’s very functioning as an organization — is addressed as a core goal of any new project.
One such project was the recent establishment of another business unit called The Data Shop, which has been created with the mission of providing information life cycle management capabilities across all of Centrelink’s data and paper assets.
The Data Shop’s first and biggest success was the recent completion of a massive digitization effort that used Tower Technology’s TRIM document management system to index millions of documents that Fegan says comprise “the bulk of our paper holdings”.
“This represents a considerable rationalization for us, and the records management system that goes with that has been considerably progressed.”
While high-level coordination is essential to delivering necessary reliability for such initiatives, some level of local procedural autonomy is also important. To this end, individual Centrelink area managers are responsible for execution of BCP policies for their geographical service areas. However, those policies are guided by established best practice guidelines dictated way up the food chain and modified to suit local requirements. At every point in the chain of command, there is an outlet for escalating a disaster response so that organizations are never left unsupported.
To support such a broad network of offices and local conditions, Centrelink has worked to maintain an effective and cohesive leadership team. The organization’s “guiding coalition” includes all senior executives, who meet every six weeks to discuss business issues facing Centrelink as a whole. Continuity of IT services is a major part of their discussions.
Combining such high-level discussions with myriad function-related IT committees — there is one, for example, that deals with the specific issue of offsite file storage — ensures that the implications of change are addressed through key inter-process communication at every level. Centrelink also works with key IT vendors such as IBM and StorageTek, and outside organizations such as Emergency Management Australia, to ensure that emergency management plans can accommodate future technological directions.
“Having everybody attuned to what’s happening as opposed to a siloed approach, where just a few people focus on an area and nobody else feels they have any responsibility for it, gives us high levels of redundancy and collaboration across the whole organization,” says Fegan. “These governance mechanisms are in place to ensure that the business people responsible for delivering services are kept well informed of progress.”
Towards an Uninterrupted Future
While the ANAO was generally complimentary about Centrelink’s ability to coordinate BCP efforts across organizations, it did identify several areas of the organization’s operations that need attention.
Foremost among these are Centrelink’s telecommunications capabilities, which are essential given the agency’s utter reliance on information systems. Given that it relies on commercially provided WAN services — usually a frame relay primary link with ISDN backup — control over those telecommunications services is out of Centrelink’s hands.
The ANAO concluded that the agency’s WAN was adequately protected, but Centrelink is already working to improve it through a tender to be awarded early this year under the auspices of its Data Network Replacement Program. That project will see the current WAN replaced with a managed network providing IP connections running natively over an Asynchronous Transfer Mode network core — ensuring both quality of service and high reliability. Better communications will also improve the speed at which data can be transferred from branch offices to the central data centres, thereby further improving backup procedures.
Overall, the ANAO’s audit lauded Centrelink’s BCP planning and focused on the need for better processes for authoring and maintaining BCP documentation. These issues are being addressed — as are others Centrelink has prioritized for itself. The key thing to remember, says Fegan, is that there is no endgame for business continuity planning; rather, it is simply a part of the way the organization functions.
“You don’t want business continuity to be something you think about once in a year,” he says. “You never know what event may be just around the corner, so you need good organizational arrangement and clarity. You also need sound judgment and confidence in yourself and your people. Everyone loves to lend a hand, but that can be counterproductive unless it’s appropriately managed.”
Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.