Large Data Set Analysis in the Cloud: Hadoop Gets a boost
- 10 April, 2009 07:15
- Comments
Traditional business intelligence solutions can't scale to the degree necessary in today's data environment. One solution getting a lot of attention recently: Hadoop, an open-source product inspired by Google's search architecture. Twenty years ago, most companies' data came from fundamental transaction systems: Payroll, ERP, and so on. The amounts of data seemed large, but usually were bounded by well-understood limitations: the overall growth of the company and the growth of the general economy. For those companies that wanted to gain more insight from those systems' data, the related data warehousing systems reflected the underlying systems' structure: regular data schema, smooth growth, well-understood analysis needs. The typical business intelligence constraint was the amount of processing power that could be applied. Consequently, a great deal of effort went into the data design to restrict the amount of processing required to the available processing power. This led to the now time-honored business intelligence data warehouses: fact tables, dimension tables, star schemas.
Today, the nature of business intelligence is totally changed. Computing is far more widespread throughout the enterprise, leading to many more systems generating data. Companies are on the Internet, generating huge torrents of unstructured data: searches, clickstreams, interactions, and the like. And it's much harder-if not impossible-to forecast what kinds of analytics a company might want to pursue.
Today it might be clickstream patterns through the company website. Tomorrow it might be cross-correlating external blog postings with order patterns. The day after it might be something completely different. And the system bottleneck has shifted. While in the past the problem was how much processing power was available, today the problem is how much data needs to be analyzed. At Internet-scale, a company might be dealing with dozens or hundreds of terabytes. At that size, the number of drives required to hold the data guarantees frequent drive failures. And attempting to centralize the data imposes too much network traffic to conveniently migrate data to processors.
One thing is clear: the traditional business intelligence solutions can't scale to the degree necessary in today's data environment.
Fortunately, several solutions have been developed. One, in particular, has gotten a lot of attention recently: Hadoop. Essentially, Hadoop is an open source product inspired by Google's search architecture. Interestingly, unlike previous open source products that were usually implementations of previously-existing proprietary products, Hadoop has no proprietary predecessor. The innovation in this aspect of big data resides in the open source community, not in a private company.
Hadoop creates a pool of computers, each with a special Hadoop file system. A central master Hadoop node spreads data across each machine in a file structure designed for large block data reads and writes. It uses a clever hash algorithm to cluster data elements that are similar, making processing data sets extremely efficient. For robustness, three copies of all data is kept to ensure that hardware failures do not halt processing.
When it comes time to mine the data, the programmer can avoid all details of how the data is laid out. A single function is used to organize the overall data set by reading through it and outputting an aggregation organized as key/value pairs. This is known as a map function. A second function-known as the reduce function -then goes through the aggregation output by the map function and selects the desired data, outputting it to a temporary file, organizing it in a table in memory, or even putting it into a data mart to be analyzed with traditional BI tools.
Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.
- Bookmark this page
- Share this article
- Got more on this story? Email CIO
- Follow CIO on twitter
-
The 30 best Safari extensions -- so far
-
Apple and Google disagree over licensing of essential patents
-
Monash Uni reduces IT teams after consolidation project
-
FTC warns makers of background checking apps
-
QLD govt demands answers after pay glitch
-
Web 2.0 in the Workplace Today
More than a decade after the term ‘Web 2.0’ was coined, many businesses are still nowhere near to taking full advantage of the collaborative technologies the term refers to. Undoubtedly, confidence is growing in relation to using tools such as Facebook, Skype, Twitter, and indeed many more organisations are using such technology now compared to even just a couple of years ago. But the fact remains that a worrying amount of businesses seem to be operating a ‘lockdown’ approach – an approach that I’m sure many Board-level staff know is simply not good for business in the long-term. -
Best Practices for Energy Efficient Storage Operations Version 1.0
The energy required to support data center IT operations is becoming a central concern worldwide. For some data centers, additional energy supply is simply not available, either due to finite power generation capacity in certain regions or the inability of the power distribution grid to accommodate more lines. Read on. -
Implementing, Serving, and Using Cloud Storage
Organisations of all types are trying to control costs and satisfy increasing demands at the same time— demands created by explosive data growth and ever-changing regulations. To address these challenges, storage industry professionals are turning to cloud computing and cloud storage solutions.
-
Professional ASP.NET 3.5 Ajax
-
Introduction to Java Programming with Games
-
AutoCAD & AutoCAD LT All-In-One Desk Reference for Dummies
-
AutoCAD 2000 Bible
-
Photoshop Elements 2 for Dummies
-
Powerbook and Ibook Digital Field Guide
-
Unicode
-
Upgrade Your Life
-
Operating Systems Concepts with Java 6E Wileyplus/Blackboard Standalone Card











Comments
Post new comment