Amazon Web Services has its roots in the needs of Amazon.com, the retailer, but that doesn't mean that all of the book seller's operations run on Web Services.
"A large part of Amazon's website runs on Amazon Web Services, but there are also pieces that are not suitable to move, mainly because of the way we built the architecture really specific for some hardware, or we have a very dedicated, highly tuned environment where just moving them over to Amazon Web Services would give no direct benefit," said Werner Vogels, CTO of Amazon Web Services.
Vogels spoke on Tuesday evening at the company's new offices in Seattle during an open house. He offered a few other surprising tidbits about the Web Services operations, described how the business came about and disclosed a few problems it has faced along the way.
While many people think of surges in demand as driving the development of cloud services like those offered by Amazon, the opposite -- drops in demand -- is equally important, Vogels said.
"Scaling isn't only about scaling up," he said. "In reality, scaling down is almost as important as scaling up because that's where you get the cost benefits."
At Amazon, each year before the holiday season in past years company engineers requested additional servers in anticipation of a spike in traffic. Even before it built its Web Services platform, Amazon was good at allocating servers quickly, within hours, he said.
"However, the engineers would never release capacity," he said. After the holiday season, the engineers would always say they wanted to hang on to the added capacity for the next expected surge. In the meantime, that capacity would go unused.
"Even though we had efficient mechanisms certainly compared to traditional enterprises that would take weeks or months to get hardware, we needed to do something in the design of our compute servers such that we could radically change that behavior," he said.
Cloud computing made it easier for Amazon to provision capacity but reallocate it too, he said.
Now that Web Services is used by companies around the world, Amazon still has a good handle on the high end of capacity -- making sure that there's enough to go around. Because its customers come from such a wide variety of market segments, there aren't worldwide events that Amazon worries will create massive usage spikes that it can't handle, Vogels said.
But the drops in usage tend to set off alarms. Amazon uses order rates as a metric of the overall health of the system, he said. "So if the order rate drops, it's an indicator that there's a problem," he said.
Once, Amazon noticed that orders dropped to zero in Germany and the company scrambled to figure out what problem had cropped up. It turned out that Germany's soccer team was having an important match and the "whole country was at a standstill," he said. "It's more those events that are surprising than worldwide events."
As one of the largest data center operators around, Amazon faces some unique challenges. "At the scale of Amazon, we have to deal with every possibility as a reality," Vogels said.
For instance, the average annual failure rate of disks is 8 percent to 10 percent. If a data center has 10,000 nodes, each with four disks, that's about 10 disk failures per day. "Now imagine you have 40,000 servers and each has not four but eight disks. Then you have a whole staff doing nothing else but replacing disks all day," he said. "These are realities and we have to build software to deal with it."
The sheer size of the operation presents other challenges. Making changes that span multiple services becomes hard to coordinate, he said.
Also, because Amazon has chosen a very decentralized organization structure, sharing information among engineering teams becomes more difficult than if there were fewer, large teams, he said. Amazon addressed the issue by having more senior engineers and asking them to meet frequently to discuss issues. "Still, communication and sharing of experiences is harder in a world like that," he said.
The company has also learned some lessons the hard way, he said. "In hindsight, it always seems like you make decisions too late," he said. For instance, there have been occasions when third-party technologies weren't able to handle Amazon's scale. "It's not that I blame the vendors for not being as scalable as Amazon, but we could have made the decision earlier to build technologies ourselves," he said.
Vogels stressed that Amazon Web Services isn't just about selling Amazon.com's excess capacity but is a business itself. Jeff Bezos, Amazon's founder, president and CEO, has said that he thinks Web Services could become as big as the company's retail business, Vogels said.
"The joke is that we're a technology company that happens to do retail," Vogels said.
Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.