The 7 steps in Big Data delivery
- 11 July, 2012 15:38
This vendor-written tech primer has been edited by Network World to eliminate product promotion, but readers should note that it will likely favor the submitter's approach.
The Big Data trend represents the evolving need to process large amounts of data with a new crop of technology solutions that aren't necessarily your father's database. So, what does a company need to consider when contemplating getting started with Big Data?
First, they need to know what Big Data is. Here is how I define it:
"The emerging technologies and practices that enable the collection, processing, discovery and storage of large volumes of structured and unstructured data quickly and cost-effectively."
Big Data -- from financial trades to human genomes to telemetry sensors in cars to social media interactions to Web logs and beyond -- is expensive to process and store in traditional databases. To solve that problem new technologies leverage open source solutions and commodity hardware to store data efficiently, parallelize workloads and deliver screaming-fast processing power.
As more IT departments research Big Data alternatives, the discussion centers on stacks, processing speeds and platforms. And inasmuch as these IT departments are savvy enough to grasp the limitations of their incumbent technologies, many can't articulate the business value of these alternative solutions, let alone how they will classify and prioritize the data once they identify it. Enter Big Data governance.
In fact as we look at the emerging need for Big Data, the platforms and processes discussions are only part of the overall approach to Big Data delivery. In reality we're seeing seven steps in realizing the full potential of a Big Data development effort:
Collect: Data is collected from the data sources and distributed across multiple nodes -- often a grid -- each of which processes a subset of data in parallel.
Process: The system then uses that same high-powered parallelism to perform fast computations against the data on each node. The nodes then "reduce" the resulting data findings into more consumable data sets to be used by either a human being (in the case of analytics) or machine (in the case of large-scale interpretation of results). [Also see: "Could data scientist be your next job?"]
Manage: Often the Big Data being processed is heterogeneous, originating from different transactional systems. That data usually needs to be understood, defined, annotated, cleansed and audited for security purposes.
Measure: Companies will often measure the rate at which that data can be integrated with other customer behaviors or records and whether the rate of integration or correction is increasing over time. Business requirements should inform the type of measurement and ongoing tracking.
Consume: The resulting use of the data should fit in with the original requirement for the processing. For instance, if bringing in a few hundred terabytes of social media interactions helps us understand whether and how social media data drives additional product purchases, then we should set up rules for how social media data should be accessed and updated. This is equally important for machine-to-machine data access.
Store: As the "data as a service" trend takes shape, increasingly the data stays in a single location as the programs that access it move around. Whether the data is stored for short-term batch processing or longer-term retention, storage solutions should be deliberately addressed.
Data Governance: Data governance is the business-driven policy-making and oversight of data. As defined, data governance applies to each of the six preceding stages of Big Data delivery. By establishing processes and guiding principles it sanctions behaviors around data. And Big Data needs to be governed according to its intended consumption. Otherwise the risk is disaffection of constituents, not to mention overinvestment.
Most staff members charged with researching and acquiring Big Data solutions focus on the Collect and Store steps at the expense of the others. The question is implicit: "How do we gather all these petabytes of data and where do we put 'em all once we have 'em?"
But the processes for defining discrete business requirements for Big Data still elude many IT departments. Business people often see the Big Data trend as just another pretext for IT resume-building with no clear end game. Such an environment of mutual cynicism is the single biggest culprit for why Big Data never transcends the tire-kicking phase.
As IT Business Edge author Lorraine Lawson said in a recent blog post, "The only way to ensure your analysis is sound is to ensure you have a governance program in place for Big Data."
Entrenching data governance processes on behalf of a Big Data effort ensures that:
- Business value and desired outcomes are clear
- Policies for the treatment of key data have been sanctioned
- The right subject matter expertise is applied to the Big Data problem
- Definitions and rules for key data are clear
- There is an escalation process for conflict and questions
- Data management -- the tactical execution of data governance policies -- is deliberate and relevant
- There are decision rights for key issues during development
- Data privacy policies are enforced [Also see: "Panel heats up over big data privacy concerns"]
In short, data governance means that the application of Big Data is useful and relevant. It's an insurance policy that the right questions are being asked. So we won't be squandering the immense power of new Big Data technologies that make processing, storage and delivery speed more cost-effective and nimble than ever.
Jill Dych is vice-president of thought leadership, strategic products for SAS. SAS DataFlux Data Management solutions enable business agility and IT efficiency by providing innovative data management technology and services that transform data into a strategic asset. See www.datafluxinsight.com for the latest education on data governance and Big Data best practices.
Read more about data center in Network World's Data Center section.
Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.
- Process Automation – What the Best CIOs Ask
- IDC Insight: Saving Time and Money with Savvy Use of Flash in Automated Storage Tiering
- Australian Red Cross Blood Service Enhances the Performance of Its Mission-Critical Applications
- Mobility Apps: What every developer should know
- Reference Testing Procedures for Trend Ready Verification
"How many of the Fortune 500 companies have access to PRISM? https://en.wikipedia.org/wiki/Industrial_espionage ..."Australia suspected to have PRISM data: Ludlam
Australia Post’s mail business to lose $200 million this year
Australia Post’s mail business to lose $200 million this year
Microsoft's ambivalence about Office on the Web gives Apple shot with iWork on iCloud
3 Lessons Learned From a Failed Customer Feedback Test
McAfee Complete Endpoint Protection - Business
McAfee makes endpoint security painless for users and easy and efficient for IT. Built for strength, speed, and simplicity, McAfee Complete Endpoint Protection - Business suite helps growing organisations get Internet security right, from turnkey installation to rapid response. Find out more.
Key Factors in Modernising Backup and Recovery
There is a definite need for better data protection solutions in today’s enterprise data centres. The question is whether to continue with software-only backup and recovery solutions, or to make the move to a purpose-built backup appliance with de-duplication capabilities. This paper discusses the trends that have made modernising backup and recovery an urgent priority. Click to download.
St. Vincent’s Hospital - Finding Visibility, Flexibility and Control with MaaS360
St. Vincent’s Hospital in Australia offers best-in-class services, facilities, and expertise, along with educational opportunities to the residents of the greater Sydney area and NSW. They faced a challenge of meeting the demand to deliver Apps on mobile devices while maintaining the security of patient data. Download now to find out the solution they deployed.