Happy birthday Apache Hadoop!
Ten years ago, on Jan. 28, 2006, Doug Cutting and Mike Cafarella split the distributed file system and MapReduce facility from their open source Web crawler project (Apache Nutch) and spun it off as a subproject called Hadoop. The subproject was named after Cutting's son's stuffed elephant toy.
Cutting and Cafarella had been working on Nutch since 2003. In 2004, a lightning bolt of inspiration struck in the form of two papers published by Google describing a distributed file system (GFS) and an execution engine (MapReduce) that allowed Google engineers to write a computation with just a few lines of code that would run in parallel on thousands of machines. Since Cutting and Cafarella were attempting to build a distributed system that could process billions of web pages, Cutting says it was clear that Google's approach would make Nutch a much more viable system. And the tools could probably be used in a lot of other applications too.
Poster child for big data
Ten years later, Hadoop in many ways has come to represent big data and has helped to spark a revolution in data and analytics technologies that is disrupting businesses and industries of all stripes today. Cutting, now chief architect at Cloudera, never imagined where Hadoop would go.
"The fact that it's spread to so many corporations around the world is a total surprise to me," he says. "It's part of a larger story that I didn't see completely. All of industry is becoming digitized. That is the fuel for growth and progress that we're seeing in company after company."
Cutting says he expected Hadoop to help Web companies that were already processing massive amounts of data. But today even decidedly non-digital companies, from railways and airlines to hospitals are becoming highly digitized.
"Hadoop has benefitted from that trend," Cutting says. "There was a need for something that could store and process all this data."
Hacker and enterprise worlds collide
But Cutting says that what has surprised him even more is how the past 10 years have seen two disparate software traditions — what he calls the "enterprise" tradition and the "hacker" tradition — merge.
"In the enterprise tradition, vendors developed and sold software to businesses who ran it — the two rarely collaborated," Cutting wrote in a blog post celebrating Hadoop's 10th birthday. "Enterprise software relied on a Relational Database Management System (RDBMS) to address almost every problem. Users trusted only their RDBMS to store and process business data. If it was not in the RDBMS, it was not business data."
"In the hacker tradition, software was largely used by the same party that developed it, at universities, research centers and Silicon Valley web companies," he wrote. "Developers wrote software to address specific problems, like routing network traffic, generating and serving web pages, and so on. I came out of this latter tradition, specifically working on search engines for over a decade. We had little use for an RDBMS, since it did not scale well to searching the entire web, becoming too slow, inflexible and expensive."
In 2006, after splitting Hadoop into its own project, Cutting joined Yahoo and gained access to a dozen or so Yahoo engineers and access to thousands of their computers.
For hardy developers only
"Ten years ago, it was barely working," he says. "You had to be a pretty hardy developer to try it and get it working."
But with Yahoo's resources, they had a relatively stable, reliable system that could process petabytes using affordable commodity hardware. That, Cutting says, meant developers could much more quickly and easily build better methods of advertising, spell-checking, page layout, etc. Yahoo! began using it internally, but users outside the company also started deploying it, especially at companies like Facebook, Twitter and LinkedIn. Some of the projects that would become core to the new Hadoop ecosystem were also being built on top of Hadoop, including Apache Pig, Apache Hive and Apache HBase. Academic researchers also began using it, Cutting says.
"We had reached the target I had initially imagined: a popular open source project that enabled easy, affordable storage and analysis of bulk data," Cutting says.
Embraced by mainstream America
But, of course, it didn't end there. Cutting was approached by venture capitalists who saw uses for Hadoop beyond the Web and academia, despite its lack of security, clunky API and the fact that it only supported big batch computation.
"I thought they were crazy," Cutting says. "Banks, insurance companies and railways would never run the open source "hacker" software that I worked on."
Cutting turned them away. But the venture capitalists weren't deterred. In 2008 they funded Cloudera with the mission of bringing Hadoop and related technologies to traditional enterprises.
"What I didn't imagine was that software from this hacker tradition would get embraced by mainstream organizations, by corporate America," he adds. "That really took a company like Cloudera getting started and filling in the gaps so that it could be used."
It took Cutting a year to come around to what the VCs had seen.
"If we could make Hadoop approachable to Fortune 500 companies, it had the potential to change their businesses," he says. "As companies were adopting more technology, from websites and call centers to cash registers and bar code scanners, more and more data about their businesses passed through their fingers."
But institutions that could capture and use that data would be in a position to better understand and improve their businesses. The traditional RDBMS technology these institutions were used to were not well-suited to this task. As Cutting notes, they were simply too rigid to support variable, messy data and rapid experimentation. They couldn't scale easily to petabytes of data. Perhaps more important than all of those things, they were expensive; the expense and procurement process did not support engineers experimenting with new ideas for using the data. But Hadoop could handle all those issues.
"Open source is pretty amazing," Cutting says. "It's an accelerant for software development and adoption. It kind of gives open source software an unfair advantage over software created in other ways. People are much more relaxed about trying and adopting open source software. If an engineer in company is trying to do some analysis, he's got an experiment in mind and some data he access to, does he want to talk to IT about deploying a new database or does he just want to download something and try it?"
Today, Cutting says, the enterprise and hacker traditions have merged. There is no sharp dividing line in the enterprise between those who develop software and those who use it. Cloudera customers regularly collaborate with Cloudera engineers, he says, and users are often direct involved in software advances.
"No single software component dominates," he says. "Hadoop is perhaps the oldest and most successful component, but new, improved technologies arrive every year. New execution engines like Apache Spark and new storage systems like Apache Kudu (incubating) demonstrate that this software ecosystem evolves rapidly, with no central point of control. Users get better software sooner."
Looking ahead, Cutting says he can't say what the next hot new software thing will be, but he believes big changes are coming in the deployment model. Things are moving to the cloud slowly but surely, which means tools need to respond to that and better support cloud-based operation. Advances in container tools, like Docker, will change things.
Hardware developments, too, promise big changes. For instance, Intel's new XPoint technology combines the capabilities of flash memory and DRAM.
"If you fundamentally change the performance and economics of the architecture that you're deploying on, then the software needs to change to take advantage of those economies," Cutting says. "I think we're going to see a lot of modifications to tools or all new tools."