As enterprise adoption of Hadoop booms, the pool of IT personnel able to build and maintain deployments hasn't kept pace. In May, analyst firm IDC pegged the compound annual growth rate for the Hadoop software market at more than 60 percent, forecasting an ascent to $812.8 million in 2016 from a base of $77 million in 2011.
The Apache Foundation's Hadoop distributed computing technology initially staked out a role in search engines. Yahoo helped get the software off the ground, announcing what it termed the largest Hadoop production application in 2008. Distributions of the open-source software have since boosted availability beyond the earliest adopters. Cloudera kicked off its Hadoop distribution in 2009, followed by Hortonworks and MapR Technologies.
As Hadoop penetrates a broadening range of industries-from publishing to agriculture-IT departments look to Hadoop distributors and specialized consulting firms to fill Hadoop skills gaps. CIOs and IT managers look for outside help to launch projects, write code and generally navigate the Hadoop ecosystem. IT organizations also tap channel partners for training as they seek to grow in-house Hadoop talent.
How-To: What Hadoop Can, and Can't, Do
Staff supplementation and training are about the only options for organizations struggling to hire Hadoop experts.
"Short supply is an understatement," says Geoffrey Weber, CIO at Shutterfly, describing the scarcity of Hadoop expertise. "I think, realistically, it is virtually impossible for a company our size to expect to go into the market and find a pocket of...Hadoop veterans."
Shutterfly, which offers an Internet-based personal publishing service, is hardly a small business-2011 revenue exceeded $473 million-but the company competes with social media giants such as Facebook and LinkedIn for a limited supply of Hadoop expertise.
"If you were, yourself, a Hadoop expert, coming out of Yahoo and part of the original team, your skills and experience are almost unique. You can name wherever you want to work and how much you get paid," Weber says. "It's difficult for us to go out and acquire those kinds of skills."
For Large-Scale Deployments, Hadoop Skills In Short Supply
Hadoop targets data sets too large and cumbersome to manage and analyze using conventional database technology. It does this by dispersing big data processing tasks across multiple computing nodes.
Hadoop software tends to be grouped alongside NoSQL databases as big data technology. Its core components of Hadoop consist of MapReduce, which distributes processing jobs on a Hadoop cluster, and the Hadoop Distributed File System. A number of other open-source projects, as well as some commercial software, round out the Hadoop ecosystem.
A company's journey into this ecosystem could well begin as an informal experiment. For example, Weber says a company may have an employee interested in Hadoop who downloads the software and builds a small cluster.
Doing something more ambitious with Hadoop will typically require additional resources. In Shutterfly's case, the organization started with in-house resources, but now works with an outside contractor and plans to bring on additional help. Shutterfly also aims to harness Hadoop for website analytics. The company hopes to glean greater insight into customer transactions and the website's overall technical performance.
While Shutterfly works with the contractor on a limited basis, the company is will be working with vendors like Hortonworks to start "an effort that is much more formalized," Weber says. Contractor and vendor resources will initially focus on getting the company's Hadoop project off the ground. Weber says he also aims to train a small group of in-house personnel beyond introductory Hadoop knowledge.
Case Study: How ComScore Is Using Hadoop to Tame Its Big Data Flow
Monsanto, an agricultural products company based in St. Louis, also finds itself cultivating internal resources and looking for outside support. The company's geographic location-away from the big IT centers on the East and West coasts-creates Hadoop recruiting and hiring issues. "Being&in the Midwest, that is a challenge for us," says Lori Yancey, R&D IT staffing lead at Monsanto.
The company has been evaluating Hadoop since late 2009. Last year, Monsanto decided to build out a full production cluster, notes Erich Hochmuth, R&D IT high performance analytics lead at Monsanto. He says the company has a couple Hadoop projects underway and uses the platform "for analytics over large unstructured and semi-structured datasets."
Monsanto's Hadoop initiatives focus on using the platform to build enterprise data processing pipelines for analyzing and storing data generated from scientific instruments. Hochmuth says building these analysis pipelines in Hadoop will allow Monsanto to scale as new scientific instruments are adopted and, as a result, increase data volume. Traditional solutions, on the other hand, require IT personnel to rewrite and engineer the analysis pipelines to accommodate increases in data volume.
Hochmuth says Monsanto has tapped Cloudera as a source of Hadoop know-how. Cloudera will offer consulting services to get Monsanto's Hadoop projects up and running. Once Monsanto has a team using Hadoop, the next step will involve building up its in-house knowledge, Hochmuth notes. To that end, Cloudera will provide on-site training sessions for Hadoop administrators, as well as ongoing enterprise support, he adds.
Consulting, Development, Training Address Hadoop Skills Shortage
Vendors pursuing the Hadoop skills gap offer a mix of consulting, software development and training services. Key players here include Hadoop distributors and specialized IT services companies.
"Hadoop is only now moving from R&D domain to mainstream corporate arena. There are not very many professionals out there in the market," notes Timothy Diep, business development manager at DCKAP, a technology consulting company that provides Hadoop development and consulting. "There is a premium on people who know enough about the guts of Hadoop."
Diep says customers ask for three main skill sets- data analysts/ data scientists, data engineers and data management professionals.
Analysts should have experience in SAS, SPSS and programming languages such as R. "These are the professionals who will generate, analyze, share and integrate intelligence gathered and stored in Hadoop environments," he says.
How-To: Cascading, Open Source Java Framework, Can Ease Big Data Hiring Pain
Data engineers, meanwhile, are responsible for creating the data processing jobs and building the distributed MapReduce algorithms that the data analysts use, Diep explains. Finally, data management personnel do three things, he says-make the call on whether to deploy Hadoop either in the cloud or using on-premises and selected vendors and distributions; determine the size of the cluster, and decide whether the cluster will be used for running production applications or for quality-testing purposes.
Lunexa, a boutique technology consulting firm, also pursues the Hadoop space. The company focuses on helping customers develop business solutions on top of the Hadoop platform, says David Cole, partner at Lunexa.
CIOs and IT managers aren't just looking for Hadoop skills. They also seek knowledge of how Hadoop interacts with business intelligence tools, Cole says, adding that "virtually every BI vendor is putting together a Hadoop story."
Lunexa launched eight years ago as a traditional extract, transform, load (ETL) and data warehousing company. The company deployed an in-house Hadoop cluster about a year ago, Cole says, and invested in Hadoop training for its consultants at the same time.
Much of Lunexa's Hadoop business is in developing MapReduce code for complex analytics or ETL, Cole says. Demand cuts across a range of industries, but Cole notes that Lunexa sees a bit more work coming from large media companies and financial services firms, where large data volumes are common.
Lunexa also works with Hive, a Hadoop data warehousing system, while a consulting firm may also find a role in Hadoop administration. "There are multiple facets of working on Hadoop," Cole says.
Training is another Hadoop service-and one that distributors provide. Hortonworks, for instance, offers courses in developing solutions using and administering Hadoop. The developer class spans four days; the administrator course takes two days.
Renee Beckloff, senior director of global delivery services for Hortonworks, cites the scarcity of Hadoop professionals. "I've been in education for the last 15 years, and this is probably the shortest supply of quality personnel for any type of platform or any type of technology that I've seen," she says.
MapR Technologies offers training for administrators, data modelers and developers. Cloudera also offers a combination of developer and administrator training.
Omer Trajman, vice president of technology solutions for Cloudera, says the company has trained about 12,000 people since launching its training program three years ago. The company is currently training Hadoop personnel at the rate of 1,500 per month.
"Training is in high demand and the skills are in high demand," Trajman says. "Not enough people have learned how to use Hadoop."
John Moore has written on business and technology topics for more than 20 years. His areas of focus include mobile app development, health IT, cloud computing, government IT and distribution channels. Follow everything from CIO.com on Twitter @CIOonline, on Facebook, and on Google +.
Read more about data management in CIO's Data Management Drilldown.