Big data - Part 1
- 19 September, 2011 10:12
- Comments
According to IDC’s Digital Universe report the data created globally on an annual basis will leap from 1.2 zettabytes this year to 35 zettabytes in 2020 (one zettabyte is equal to one billion terabytes).
Even when scaled back to the data generated by a single business, the numbers can be comparatively scary. The internet has opened new marketing and communication channels whose digital nature result in the generation of data by the gigabyte, while many organisations have also become better at capturing transactional data. And most organisations have not even begun to analyse the massive piles of unstructured data being generated through social media.
Many of the data sets in existence today defy being crunched with conventional tools, so the wisdom they could yield lies unrealised.
It’s little wonder that the explosion of data volumes and complexity — and the techniques for dealing with it — has become known as Big Data.
Read more about Big Data.
While the data sets might be daunting, Big Data promises to unlock business value through delivering better intelligence in areas such as fraud detection, loan risk analysis and customer behavioural analytics — and do so potentially even in real time.
Director of the business intelligence specialist company C3, Cameron Wall, says Big Data is pushing the limits of the current technology, and the glory days of traditional relational database management technology are fading.
For some sectors, such as genomics and bio-informatics research, dealing with Big Data issues is nothing new, and in many ways the technologies pioneered in these sectors are forming the basis of solutions used in the commercial world. The reason commercial organisations are becoming more interested in Big Data techniques is simple: Big Data analysis can deliver insights above and beyond that which is possible using even current relational technology. Online companies such as Amazon and eBay in particular are showing what can be done, and rivals are keen to whittle down their competitive advantage.
“There used to be this catch-cry that knowledge is power, and that information is the asset of the organisation, and I think companies were paying lip service to that,” Wall says. “But now they are paying full attention and are forced to deliver on that promise, because they are seeing companies overseas perform, and perform well — especially the online businesses.”
The Australian commercial online video company Movideo is using Big Data analysis techniques to perform advanced analytics on its video feeds. Movideo is a subsidiary of the digital media company MCM Entertainment and streams video content for popular television programming such as MasterChef and Formula 1 racing, as well as music video streaming.
Movideo’s chief technology officer, Cameron Moore, says his company is using massively parallel database technology from Greenplum to create a data warehouse for storing and analysing terabytes of data that are being collected in relation to its video streams, such as how long is a viewer watching a video for, where are they watching it from, and what format they are watching.
“Basically we use that information to work with key performance indicators of the platform and content editors use that to work out if the content is doing well or not,” Moore says.
The advantage of using Greenplum is Movideo can store and process terabytes of data while keeping the data in its most basic native form, rather than ‘rolling it up’ into averaged data.
“We don’t roll up our data, which the traditional systems do,” Moore says. “Once you roll that data up, you destroy what’s underneath and you can never go and do an ad hoc query after the fact. If you roll up all your data for the averages, you can’t drill down to a specific moment in time.”
For instance, were Movideo to roll up regional data to a state or country level, it would never be able to go back into that data to pin-point data at a suburban level.
“If you are a local provider, having data at a country level is pretty much useless to you,” he says.
The massively parallel nature of Greenplum means that huge tasks can be broken down into manageable bites and executed in parallel on commodity hardware, significantly reducing the cost of processing Big Data problems.
Greenplum was acquired by EMC in 2010, and is just one of dozens of tools that have emerged to handle these problems. Other distributed processing tools commonly used to tackle these problems include MapReduce, Hadoop and Cassandra, which are all designed to process large data sets across clusters of computers. MapReduce was developed by Google in 2004; the latter two are open source projects, with Cassandra initially developed by Facebook to power its Inbox Search feature. The online dating service eHarmony uses Hadoop to determine with whom members are ideally matched.
Read Part 2 of Big Data.
Moore says Movideo has been working with Greenplum for about 12 months, having previously used traditional open source databases. But these simply weren’t fast enough to process the growth in the data that it was collecting.
Now, he says, Movideo is able to offer customers a filtering system whereby they can ‘drag’ in a video and the immediately analyse metrics such as how often it was viewed in a specific region and for how long.
“The system can do an ad hoc query on the fly and return that data, and draw that on a map.”
In future Moore hopes to be able to use the power of massive parallel processing to provide new options to consumers, such as being able to provide ad hoc matching of preferences to help them discover additional content.
“What we are looking at doing right now is exposing that data back to the users in a social networking sense,” Moore says. “One user watching content can see other users watching content at the same time, so we can create a social atmosphere around premium content.”
Recommended reading:
Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.
- Bookmark this page
- Share this article
- Got more on this story? Email CIO
- Follow CIO on twitter
- Removing BPM Silos to Unleash Process Power - 15 Best Practices for Enterprise BPM
- Simplifying branch office security
- Case Study - TNT Express successfully reduces their paper usage and costs using a new document solution
- Enterprise Buyers Guide for Application Development Software
- Justifying Business Intelligence Applications
-
Google Jumps Into Social Bookmarks Game
-
NBN build gaining momentum daily: Quigley
-
Face Time - Interview with John Brennan and Robert DiStefano
-
Monday Grok: Will Siri crack the walls of GOOG?
-
Face Time - Interview with John Brennan and Robert DiStefano
-
No Bull - What Customers Should Expect from Cloud Services
This paper describes how a cloud Services User achieves the true benefits of cloud services and sends warning messages to the providers, hosting companies and telecommunications firms. It also provides clues on how a customer can gain better value from services offered by the new cloud companies and Hosting businesses. -
10 Ways to Stretch your storage budgets in virtualised, consolidated environments
Everyone’s heard the line about the only inevitabilities in life being death and taxes. IT managers, however, would quickly assert a third absolute – higher storage needs. There’s no question data storage requirements continue to skyrocket, and there’s absolutely zero likelihood of that ending any time in our lifetime. Enterprises have successfully controlled their IT budgets and server sprawl issues with the help of virtualisation technologies, but what’s next? Increasingly, organizations are turning to storage consolidation for virtualised server environments in order to reduce data center costs and inefficiencies. -
Oracle Database 11g for Data Warehousing and Business Intelligence
Oracle Database 11g is a comprehensive database platform for data warehousing and business intelligence that combines industry-leading scalability and performance, deeply integrated analytics, and embedded integration and data-quality -- all in a single platform running on a reliable, low-cost grid infrastructure. Read on.
-
Unofficial Guide to Excel 2007
-
Microsoft Windows NT Server Administrator's Bible
-
XSLT 2.0 and Xpath 2.0 Programmer's Reference, 4th Edition
-
Mastering Vmware Vsphere 4
-
Practical Support for ISO 9001 Software Project D Ocumentation Using IEEE Software Engineering Standards
-
Home Networking for Dummies, 4th Edition
-
Introduction to Programming Using Visual C++ .Net
-
Active Directory for Dummies, 2nd Edition
-
Microsoft PowerPoint Version 2002 Step By Step Courseware








Comments
Post new comment