Big Data is a relative term. We have been managing data growth since back in the 1880s when American statistician, Herman Hollerith, developed a mechanical tabulator based on punched cards to tabulate statistics from millions of pieces of data.
Hollerith’s tabulator was used in the United States Census in 1890, resulting in it being completed months ahead of schedule and significantly under budget. Hollerith went on to become the founder of the company that would later become IBM.
From those early days of data collection and analysis, there has been an increasing appetite by organisations and individuals to collect and analyse data for a variety of purposes such as operational efficiency, sales and marketing and forecasting.
As organisations found new uses for data, and began to ask more complex questions of it, technology also evolved to meet the demand in being able to manage it, and this led to a symbiotic relationship between information and technology.
Data thus far had been a measurable quantum and was organised in a structured manner that was easily identifiable and retrievable. Technology evolved at a commensurate pace to ensure the equation was balanced and stayed relatively the same.
However, this relied on all relevant data to be managed centrally, where it could be neatly indexed, filed in rows and columns in multiple layers, and then called upon and manipulated to give any number of results based on the questions asked.
This is the typical modus operandi of a relational database management system (RDBMS), which will store an organisation’s operational data to be used to derive insights at various stages of the information lifecycle.
Regardless of where, when and how information was collected, it was rendered to the database structure to comply with standard query language. This standard of information management gave rise to the database administrator (DBA), who became somewhat a demigod in the IT department as he managed the loads and bottlenecks in the flow of information to an organisation’s mission-critical applications.
This also brought about bespoke technologies that automated many of the DBA’s functions and allowed him to concentrate on more important tasks in managing an ever growing data warehouse.
As queries became more complex and multi-dimensional, business intelligence tools came to the forefront to give organisations insights on the data they collected and stored to help manage their business operations through meaningful insights.
Thus far the DBA existed in relative comfort, orchestrating the flow of information in and out of the data warehouse with the help of many management tools that kept the organisation running at an optimum level. Then something unexpected happened. To understand this, we need to look quickly at what IDC classifies as the third platform of technology.
The first platform saw users connecting to a central computer, either a mainframe or some other host system through a terminal. The second platform saw this evolve to the use of personal computers in a client-server relationship, and then as the Internet came into the equation, application servers and web-enabled applications.
The third platform, the current iteration, sees a democratisation of technology across enterprises and consumers through such trends as mobile devices, cloud computing, social media, application development platforms and analytics.
The third platform does not restrict access to any of the above. Where the first and second platforms were the domain of the enterprise, the third platform now lies in the palms of consumers of all ages, both to create and access information. This contributes to what IDC calls the digital universe, and it is the exponential growth of the digital universe which has led to the phenomenon known as big data.
By IDC’s estimation, 90 per cent of the world’s total data has been created in the last two years and 70 per cent of it by individuals. IDC predicts the digital universe will expand in 2013 by almost 50 per cent to just under 4 trillion gigabytes. Despite this, 38 per cent of organisations still don’t understand what big data is.
Big data is a term that has been thrown around quite extensively over the last couple of years, and in the process has been misused, misaligned, misconceived and misinterpreted.
At the heart of it, big data has described the sudden explosion of data through the proliferation of smartphones, tablets, sensors, scanners, machines and any other receptacle of electronic information, but the concept is far more encompassing than that.
IDC defines big data as a new generation of technologies and architectures designed to economically extract value from very large volumes of a wide variety of data by enabling high-velocity capture, discovery, and/or analysis.
The real problem of big data is not so much about volume. Technologies are continually evolving to manage the growth of data, and the Hadoop Distributed File System seems to be the emerging standard most solutions are adopting. The real problem lies in the variety and velocity of data.
Big data is messy. It is unstructured and does not fit neatly into the rows and columns of the relational database. It is varied and comes in different types and from different sources.
Organisations are now collecting social media feeds, images, streaming video, text files, documents, telemetry data and so on, reading everything from sentiment, to expression, to electronic forms, to genomes, to soil temperatures and pH levels. This variety of data is hard to render into a structured format and almost impossible for a standard query language to interpret.
Data is being created as fast as it is being collected. High velocity and streaming data could become obsolete minutes after it was created as in the movement of markets on a stock trading floor, or multimedia streaming used for surveillance and security. The challenge with this is to be able to take action on the insights from information that is ever changing.
However, even the variety and velocity of data may be the least of an organisation’s concerns. In a recent IDC study, which polled 300 organisations from all industries across Australia, 47 per cent of respondents revealed they do not have the skill sets required to manage big data.
While the DBA was quite adept and trusted to manage the structured data in a relational database, he is suddenly out of his depth when it comes to mapping and contextualising large volumes of data of different types and sources.
The dilemma is that the skill sets required to manage big data are not those a DBA can typically up-skill for, which leaves many organisations exposed when it comes to dealing with the new unstructured data coming into their environment.
Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.