Hadoop becomes critical cog in the big data machine
- 19 June, 2012 10:08
Apache's Hadoop technologies are becoming critical in helping enterprises manage vast amounts of data, with users ranging from NASA to Twitter to Netflix increasing their reliance on the open source distributed computing platform.
Hadoop has gathered momentum as a mechanism for dealing with the concept of big data, in which enterprises seek to derive value from the rapidly growing amounts of data in their computer systems. Recognizing Hadoop's potential, users are both using the existing Hadoop platform technologies and developing their own technologies to complement the Hadoop stack.
[ Facebook has tackled Hadoop's "Achilles' heel": the reliance on a single name server to coordinate operations. | Get up to speed on big data with InfoWorld's primer. | Subscribe to InfoWorld's Data Explosion newsletter for the best practices in managing data growth. ]
Hadoop's corporate usage now and in the futureNASA expects Hadoop to handle large data loads in projects such as its Square Kilometer Array sky-imaging effort, which will churn out 700TBps when built in the next decade. The data systems will include Hadoop, as well as technologies such as Apache OODT (Object Oriented Data Technology), to cope with the massive data loads, says Chris Mattmann, a senior computer scientist at NASA.
Twitter is a big user of Hadoop. "All of the relevance products [offering personalized recommendations to users] have some interaction with Hadoop," says Oscar Boykin, a Twitter data scientist. The company has been using Hadoop for about four years and has even developed Scalding, a Scala library intended to make it easy to write Hadoop MapReduce jobs; it is built on top of the Cascading Java library, which is designed to abstract away Hadoop's complexity.
Hadoop subprojects include MapReduce, which is a software framework for processing large set sets on compute clusters; HDFS (Hadoop Distributed File System), which provides high-throughput access to application data; and Common, which offers utilities to support other Hadoop subprojects. Movie rental service Netflix has begun using Apache ZooKeeper, a Hadoop-related technology for configuration management. "We use it for all kinds of things: distributed locks, some queuing, and leader election" for prioritizing service activity, says Jordan Zimmerman, a senior platform engineer at Netflix. "We open-sourced a client for ZooKeeper that I wrote called Curator"; the client serves as a library for developers to connect to ZooKeeper.
The Tagged social network is using Hadoop technology for data analytics, processing about half a terabyte of new data daily, says Rich McKinley, Tagged's senior data engineer. Hadoop is being applied to on tasks beyond the capacity of its Greenplum database, which is still in use at Tagged: "We're looking toward doing more with Hadoop just for scale."
Although they laud Hadoop, users see issues that need fixing, such as deficiencies in reliability and job-tracking. Tagged's McKinley notes a problem with latency: "The time to get data in is quite quick and then, of course, I think everybody's big complaint is the high latency for doing your queries." Tagged has used Apache Hive, another Hadoop-derived project, for ad hoc queries. "That can take several minutes to get in a result that in Greenplum would return in a couple of seconds." Using Hadoop is cheaper than using Greenplum, though.
What's in store for Hadoop 2.0Hadoop 1.0 was released late in 2011, featuring strong authentication via Kerberos and support for the HBase database. The release also limits individual users from taking down clusters via constraints on MapReduce. But a new version is on the horizon: HortonWorks CTO Eric Baldeschwieler has provided a road map for Hadoop that includes the upcoming 2.0 release. (HortonWorks has been a contributor to Apache Hadoop.) Version 2.0, which went into an alpha release phase earlier this year, "has an end-to-end rewrite of the MapReduce layer and a pretty complete rewrite of all the storage logic and the HDFS layer as well," Baldeschwieler says.
Hadoop 2.0 focuses on scale and innovation, with Yarn (next-generation MapReduce) and federation capabilities. Yarn will let users add their own compute models so that they do not have to stick to MapReduce. "We're really looking forward to the community inventing many new ways of using Hadoop," Baldeschwieler says. Expected uses include real-time applications and machine-learning algorithms. Scalable, pluggable storage is planned also.
Always-on capabilities in Version 2.0 will enable clusters with no downtime. Scalable storage is planned as well. General availability of Hadoop 2.0 is expected within a year.
This story, "Hadoop becomes critical cog in the big data machine," was originally published at InfoWorld.com. Follow the latest developments in business intelligence at InfoWorld.com. For the latest developments in business technology news, follow InfoWorld.com on Twitter.
Read more about business intelligence in InfoWorld's Business Intelligence Channel.
Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.
Why change management doesn’t work
Larry Page wants to see your medical records
Dual-Persona Smartphones Not a BYOD Panacea
After two-year hiatus, EFF accepts bitcoin donations again
CIOs struggle to deliver timely mobile business apps: survey
Tips Choosing a Cloud Service Provider
Because cloud is still a new and evolving business model, it can be argued that the decision to select a cloud service provider should be approached with even greater diligence than other IT decisions. Many providers use the same term to define very different services, “hybrid cloud” is one example, making it difficult to compare offers. This whitepaper will help enterprises evaluate their options in two critical areas: the cloud service portfolio and the service provider itself. Read now.
A Holistic Approach to your BYOD Challenge
More and more enterprises are seeing significant benefits from allowing employees to choose the device they use to get their jobs done, and are adopting bring your own device (BYOD) initiatives. While the BYOD trend increases flexibility and productivity, it introduces a host of new challenges for your IT administrators. Click for more!
Protecting Your Data, Intellectual Property, and Brand from Cyber Attacks
Enterprises and government agencies are under virtually constant attack today. It is clear that the cybercriminals, nation-states, and hacker activists waging these attacks are growing increasingly sophisticated and more effective in their efforts to steal and sabotage. Why are today’s security defenses failing? In this battle, your security teams are using outdated arsenal - download now to learn more.