The 99.999 per cent Solution
- 04 April, 2000 14:41
Nobody misses the network - until it's down. When a corporate network crashes, its damaging ramifications often sweep across national headlines, wreaking havoc on the company's bottom line and throwing a fiery spotlight on its CIO.
Indeed, too often the blame for what is mechanical error is dumped on the CIO, the chief architect, who has to spin an explanation for why things went horribly wrong. Just ask Maynard Webb, former CIO at Gateway and now president of eBay Technologies. He's in charge of the network that drives the Silicon Valley-based online auction house.
Last year, before Webb's arrival, eBay experienced a 22-hour outage that deflated the company's market cap by $US2.25 billion in a single day and pinched quarterly sales. "There are huge challenges facing any company growing as fast as us," says Webb, who was brought on board to ensure such debacles won't happen again. "We got a little behind the curve in our capacity and operational excellence." eBay's infamous black eye is today's rallying cry for the importance of network uptime in an accelerated electronic world. Companies are spending exorbitant amounts of capital and manpower to strengthen their networks. Still, the potential for a network outage looms on the horizon like a deadly offshore hurricane. No network is safe from disaster; however, CIOs can learn a few tips from companies that have endured head-on collisions with network downtime, and thus improve their chances for survival.
Previously, CIOs relied on systems management tools from Tivoli Systems, Computer Associates International, Hewlett-Packard and others to safeguard their networks, but the stakes have changed with e-commerce. Suddenly, networks are wider and subject to more weak links. Network downtime is now immediately apparent to customers and business partners. The cost of network downtime has risen exponentially. And it's difficult to measure downtime's effects on sales, market branding, customer loyalty and competition - all moving targets in the new Internet economy.
San Jose, California-based market researcher Infonetics predicts US corporations will spend $US11.2 billion on network and systems management products in 2003, spurred by trends toward virtual private networks, network security and e-commerce.
The stakes are high for everyone, not just for eBay and the rest of the dotcoms. Infonetics surveyed companies with average annual revenues of $US3 billion and found that they lost an average of $US4 million annually because of local area network downtime. For wide area networks, companies reported an average annual loss of $US3.3 million to downtime. (These figures primarily represent lost employee productivity and don't include losses due to customers being unable to access business services.) Another key finding of the survey: companies are allocating more time toward planning and designing their networks. On average, respondents expect to increase their networking staffs to a total of 48 people (up from 39) by mid-2001. "We're seeing people becoming proactive with their networks," says Mike McConnell, analyst at Infonetics. "That's great because it'll save them in the long run."
Dangers lurk inside every cubicle and every networking device. Attaining 99.999 per cent availability - five nines is network uptime's equivalent of the Holy Grail - is a tedious and toilsome lifetime pursuit. Failure points include faulty software in routers and switches, increased bandwidth traffic that crashes servers, human errors, configuration problems, power failures, major carrier outages and even the applications that run on the networks.
It was the application more than the network that bedevilled eBay. But the distinction doesn't matter to CNN or the auction community, Webb found. After more than 20 years as an IT executive at various high-tech companies including Bay Networks, Quantum, Gateway and IBM, Webb accepted the chief technology post at eBay during a precarious time. eBay was in the midst of networking throes that threatened the company's future. "We are confident Maynard possesses the vision and hands-on experience to help ensure eBay's site stability moving forward and to help scale eBay's system," said CEO Meg Whitman, at the time of Webb's hiring.
Webb, too, felt the immense pressure in his new role. He recalls visiting eBay's headquarters shortly after the company's network went down. "I pulled up to eBay and saw CNN and CNBC outside, and it gave me a sense of the kind of attention any blip might receive," says Webb. "I hadn't been announced yet and no one knew who I was, so I walked through the front door like a salesperson - nobody pays attention to those guys."
In Webb's first few months, he implemented a warm backup solution - one that allows network recovery within four hours - by increasing redundancy on NT servers, routers, switches and RAID drives. He then built a hot backup solution - basically, a running duplicate of major systems. This reduced his recovery window to within an hour. Webb assembled an eight-person IT group dedicated to evaluating and implementing next-generation networking architectures.
The IT group's biggest challenge was adding resiliency to eBay's crown jewel: a massive database that stores 4.2 million auction items appearing concurrently on the Web site.
However, this impressive number hides the fact that a lone database corruption can knock out the entire network. The server driving the database was also fast approaching architectural limits. "Our database is a pretty big single point of failure," Webb says. "We just can't throw more hardware at it anymore."
After evaluating possible fixes, such as migrating the database to a more scalable hardware platform or tweaking the Web application to alleviate some of the pressure, Webb decided to split the database.
He created separate databases to handle separate auction categories such as antiques and sports memorabilia. By deploying multiple databases, Webb bought back much-needed headroom.
If eBay encountered a database corruption, chances are only one database would go down - and hopefully, only the people participating in auctions hitting the database would notice. "I like my life to be very boring," Webb says.
The Bane of Bandwidth
Of course, it takes more than mere database partitioning to prevent network downtime. SBC Communications, a telecommunications service provider, has made networking a way of life. More than 200,000 SBC employees send 66 terabytes of data over a global network every month, including 16 million e-mail messages a week. The traffic flies over an ATM backbone linking 40 major network centres around the world. All tallied, SBC's networking hardware consists of 4400 routers, 6300 hubs and 1100 switches. "Our internal business network is truly the lifeline of our corporation," says Ed Glotzback, CIO at SBC. "It connects our applications, our data centres and our employees."
Such a massive network is bound to run into problems. Glotzback admits his company has experienced a few outages. The culprit is often traced back to a dual failure in a redundant system, an initial network design flaw or quirky third-party software. Consequently, SBC's IS staff works closely with vendor development groups and conducts rigorous tests on individual products and integrated systems. These efforts, albeit costly and time-consuming, have helped reduce potential network threats, Glotzback says.
Companies also overlook basic physical requirements when designing a network, according to Glotzback. Properly grounding data cabinets and other equipment can save hours of downtime and lost productivity, he says, as can utilising multiple commercial power feeds.
It's not a perfect science, either. While SBC uses only the Internet Protocol for wide area networking - limiting Novell's IPX and Apple Computer's AppleTalk to local segments - other companies prefer utilising multiple wide area networking protocols and platforms to act as safety nets.
Sandy Goldstein, CIO at Airgas, a Pennsylvania-based distributor of specialty gases, learned about the advantages of many protocols the hard way. Airgas relied on MCI Worldcom's frame relay network to connect 700 branch offices throughout North America. In August 1999, the frame relay network went down.
"We had 10 days of zero access to our network," recalls Goldstein. "My employees were demoralised, my customers angry." Airgas's contingency plan centred on slow remote dial-up and other arcane technology, making it almost impossible for employees to perform their duties.
Moreover, Airgas's field employees, dependent on handheld Sprint PCS telephones to conduct business transactions and check on orders, couldn't get a dial tone. "If you're doing business with MCI Worldcom, keep in mind other carriers are affected," says Goldstein. "You're never really sure who's sharing whose wire." Even internal redundant systems can be rendered useless if they utilise the same egress or share the same power supply.
Airgas survived the outage, and MCI Worldcom issued an official apology. Not surprisingly, Airgas's board of directors demanded Goldstein develop a more flexible backup strategy.
Balancing risk and exposure, Goldstein decided to keep MCI Worldcom as his sole long-distance service provider. Having more than one carrier isn't a panacea, because it increases the management challenge. And if one went down, half of his workforce would still be exposed. But for localised networking, Goldstein did the opposite. He ordered every branch office to contract with at least two ISPs, as a kind of redundancy.
Goldstein also upgraded routers and deployed modems supporting multiple technologies such as Internet, serial tunnelling, ISDN and DSL lines. "We no longer rely solely on frame relay," he explains. "Should the frame go down again, we'll switch over to another networking platform." The end result of Airgas's wake-up call was "half a million dollars in hardware investments, an additional $US80,000 in new dedicated communications, some more extra expenses and an insurance policy", Goldstein says.
All the hardware and software technology investments in the world won't safeguard a network against human errors. This sentiment is echoed by John Carrow, CIO at Unisys, who believes most outages are caused by employee blunders.
Case in point: Unisys's financial staffers were busily closing the August books last year - which was also the end of the company's third quarter - when the system crashed. A Unisys employee conducting a maintenance check at a centre in Michigan unknowingly disabled the local area network of the finance department at the company's headquarters in Blue Bell, Pennsylvania. The network was shut down for several hours.
It was a simple mistake, concedes Carrow, adding that because of what the employee learned, "that person will never make that mistake again". Unisys has spent considerable effort developing training programs, discipline programs and configuration control methods - a sort of cross-training for horizontally disparate employees. "Companies can't afford to have telecom-only or data-centre-only people anymore," he says. "You need people thinking end-to-end systems all the time, especially with the interdependency you have today."
Network outsourcer Intria-HP, based in Toronto, also emphasises the importance of the human element and its impact on high-availability networking. The company, a joint venture between Intria and Hewlett-Packard, has a network that supports 14,000 branches, 2000 local area networks and 120,000 point-of-sale terminals. The network has three central sites that triangulate information for backup and recovery purposes. Remote devices ping the network and perform extensive business functions over the Internet in order to measure latency from the end-user point of view.
The most important factor in keeping the network up and running is the people, says Mike Somerville, vice president of technology planning and technical services at Intria-HP. The company encourages its employees to find and fix network problems no matter how trivial they might seem.
For instance, Intria-HP's monitoring software revealed that an ATM backbone was experiencing a slight service degradation. The backup solution, which was another ATM backbone, didn't kick in. Similar to looking for a needle in a haystack, Intria-HP employees searched until they found the error: a circuit was pulsating every three milliseconds, which wasn't long enough to indicate a problem and alert the secondary system to take over.
After manually starting the backup, Intria-HP staffers called the vendor of the automation software. They asked the ISV to fix its product. At first, the software developer claimed that its product operated within normal parameters. Intria-HP countered that it wouldn't tolerate any downtime in any product that it uses. In the end, the software developer conceded. "Our customers don't want to see heroics, they want things that work," says Somerville. "Sometimes you need weight to get carriers and vendors to deal with you, perhaps in a different way from what they'd prefer."
Occasionally, Intria-HP conducts a unique fire drill. Upon hearing an alarm, employees at a central site are ordered to leave the building and board buses. They're transported to another site where they transfer the network to the new facility and maintain it for up to 24 hours. "We're refining high availability all the time," says Somerville. "High availability is an attitude, and we have the same attitude as air traffic control."
If building and maintaining a network is comparable to air traffic control, then maybe such a feat is better left to experts. Network service providers AT&T Solutions, Digital Island, Exodus, MCI Worldcom and others have knitted together global networks, and now pitch them to companies as highly specialised services. Internet companies that lack the resources to build networks and hire people to run them have emerged as early adopters; their business hinges on the reliability and scalability of their network.
Outsourcing is the wave of the future, believes Doug Nassaur, CIO at LinuxCare and former vice president of technical operations at E-Trade, which suffered sporadic outages last year, affecting about 20,000 E-Trade customers. "CIOs felt they had to build their networks, own the assets and worry about security themselves - and it cost them millions of dollars," he says. "A company like MCI Worldcom can build in much more redundancy than you can ever afford."
With so many loaded guns pointing at an online network, Hungry Minds, an Internet start-up in San Francisco that provides customised research and other information over the Web, has chosen to outsource its network to Exodus and Digital Island. Hungry Minds attracts 3000 unique visitors to its Web site every day - but its goal is 10 million. According to Bill Schaefer, chief technology officer at Hungry Minds, he can't reach this goal using an in-house network. He doesn't have the capital to build a network that can scale quickly and handle massive transactional spikes.
Schaefer publishes most of his Web content directly to Exodus and Digital Island servers. Hungry Minds' five-person IS staff monitors the traffic and controls the distribution process using CA-Unicenter. For instance, Hungry Minds can take Exodus offline and transfer all traffic to Digital Island with a single mouse click. There's virtually unlimited capacity, and Hungry Minds pays per hit.
The downside: Hungry Minds designers and content providers can't make sweeping changes to the Web site in real time. While some Web site content is stored on a secure server at Hungry Minds' headquarters, most of the static information resides on third-party data centre caches. Despite these limitations, Schaefer is content with the trade-off. "I don't ever want to have an eBay experience," he says.
"Outsourcing isn't a panacea," counters eBay's Webb. "It's about your flexibility and responsiveness." According to Webb, eBay's fate rests on complex features and functions of his Web application and their ability to drive volume. And he doesn't want to give up any control.
Nevertheless, last year, eBay announced plans to expand its partnerships with Exodus and AboveNet Communications, a provider of Internet connectivity. The two outsourcers host eBay's Web servers, database servers and Internet routers. "Having our technology hosted at Exodus and AboveNet will help us manage network capacity and provide a more robust Web backbone," says Bob Quinn, eBay's CIO.
When it comes to high-availability networks, there's no single answer, just as there is no single point of failure. While CIOs implement best practices to improve their chances, it's still a crapshoot. Simply put, there are just too many factors that can bring down the network. The only sure things are planning and preparation.