Large Data Set Analysis in the Cloud: Hadoop Gets a boost
- 10 April, 2009 07:15
- Comments
Traditional business intelligence solutions can't scale to the degree necessary in today's data environment. One solution getting a lot of attention recently: Hadoop, an open-source product inspired by Google's search architecture. Twenty years ago, most companies' data came from fundamental transaction systems: Payroll, ERP, and so on. The amounts of data seemed large, but usually were bounded by well-understood limitations: the overall growth of the company and the growth of the general economy. For those companies that wanted to gain more insight from those systems' data, the related data warehousing systems reflected the underlying systems' structure: regular data schema, smooth growth, well-understood analysis needs. The typical business intelligence constraint was the amount of processing power that could be applied. Consequently, a great deal of effort went into the data design to restrict the amount of processing required to the available processing power. This led to the now time-honored business intelligence data warehouses: fact tables, dimension tables, star schemas.
Today, the nature of business intelligence is totally changed. Computing is far more widespread throughout the enterprise, leading to many more systems generating data. Companies are on the Internet, generating huge torrents of unstructured data: searches, clickstreams, interactions, and the like. And it's much harder-if not impossible-to forecast what kinds of analytics a company might want to pursue.
Today it might be clickstream patterns through the company website. Tomorrow it might be cross-correlating external blog postings with order patterns. The day after it might be something completely different. And the system bottleneck has shifted. While in the past the problem was how much processing power was available, today the problem is how much data needs to be analyzed. At Internet-scale, a company might be dealing with dozens or hundreds of terabytes. At that size, the number of drives required to hold the data guarantees frequent drive failures. And attempting to centralize the data imposes too much network traffic to conveniently migrate data to processors.
One thing is clear: the traditional business intelligence solutions can't scale to the degree necessary in today's data environment.
Fortunately, several solutions have been developed. One, in particular, has gotten a lot of attention recently: Hadoop. Essentially, Hadoop is an open source product inspired by Google's search architecture. Interestingly, unlike previous open source products that were usually implementations of previously-existing proprietary products, Hadoop has no proprietary predecessor. The innovation in this aspect of big data resides in the open source community, not in a private company.
Hadoop creates a pool of computers, each with a special Hadoop file system. A central master Hadoop node spreads data across each machine in a file structure designed for large block data reads and writes. It uses a clever hash algorithm to cluster data elements that are similar, making processing data sets extremely efficient. For robustness, three copies of all data is kept to ensure that hardware failures do not halt processing.
When it comes time to mine the data, the programmer can avoid all details of how the data is laid out. A single function is used to organize the overall data set by reading through it and outputting an aggregation organized as key/value pairs. This is known as a map function. A second function-known as the reduce function -then goes through the aggregation output by the map function and selects the desired data, outputting it to a temporary file, organizing it in a table in memory, or even putting it into a data mart to be analyzed with traditional BI tools.
Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.
- Bookmark this page
- Share this article
- Got more on this story? Email CIO
- Follow CIO on twitter
-
Australia's first 4G smartphone is the HTC Velocity 4G
-
Swedish e-commerce startup's execs linked to NYC sex crime
-
Face Time - Interview with John Brennan and Robert DiStefano
-
How to implement next-generation storage infrastructure for Big Data
-
Pfizer's Future Depends on IT Transformation
-
Stella Travel Services embarks on a strategic refresh of print operations
Stella Travel Services embraces Managed Print Services (MPS) to deliver savings, centralise and consolidate print operations in order to gain control of print costs and streamline IT support. Read more. -
Managing Trust - Data protection and compliance for financial services
If it’s becoming something of a cliché that the financial services industry is one of the world’s most heavily regulated, that’s largely because it’s true. Data retention and archiving, authentication and authorisation, data loss prevention and privacy regulations compete with demands for transparency and accountability, while market imperatives calling for multiple service channels delivered over a broad spread of technologies add to the pressure. Read on. -
Oracle Business Intelligence and Data Warehousing From Storage to Scorecard
Getting actionable data in the hands of the right decision makers translates to positive business outcomes – whether that means competing more effectively, reducing operational costs, meeting compliance requirements, or anticipating changing market conditions. To get the right data to the right people at the right time, you need an integrated business intelligence and data warehousing solution that can provide fast access to reliable information and the tools to translate that insight into actions.
-
Mastering System Center Data Protection Manager 2007
-
Windows XP in 10 Simple Steps Or Less
-
Practical Support for ISO 9001 Software Project D Ocumentation Using IEEE Software Engineering Standards
-
Suse Linux 10 for Dummies
-
Professional Java Native Interfaces with Swt/Jface
-
Objects, Abstraction, Data Structures and Design
-
Mastering Microsoft Project 2002
-
Software as Capital
-
Excel 2000 VBA Programmer's Reference








Comments
Post new comment