​Shining a light on dark data

​Shining a light on dark data

Guy Betar looks at some of the legal and other implications of dark data

You have to love the IT industry – it really does conceive the best new terms, acronyms and buzzwords. “Dark data” sounds like something Ridley Scott came up with for the Prometheus sequel.

Whilst there are a variety of definitions for the term, they are all very similar. My suggested meaning is – retained data or information that is not being used or accessed for current business activities, and whose contents is not well known or structured in a readily accessible manner.

This most likely sounds like a verbose way of saying “old, retained information.” The interesting thing about dark data is that in fact at its simplest, it literally is “old, retained information.”

That being so, then why is it attracting attention? What does it have to do with contemporary business? For that matter, what legal interest does it have?

Data has become one of the world’s most valued commodities – however, unlike gold, it is definitely not the case that the more you collect, the greater the wealth.

The value in data is knowing what information it contains, being able to organise and connect it, and understanding how that information can be best used.

The interesting duality of dark data is that on the one hand, it may contain information that could be extremely valuable and useful – alternatively, it might be information that could be very damaging – conceivably it might be both at the same time!

The most problematic aspect of dark data is that by its very nature, the entity to which it belongs does not really know in detail what it contains and what value and/or risk it presents.

With the expansion of digital storage capacity, and a corresponding reduction in the cost of such storage, there has for some while been a tendency for organisations to store almost all data that it creates and/or collects.

However, several factors are now having the effect of slowing down this “store everything” tendency. Firstly, even digital storage has a cost, and the massive increase in retained data is giving rise to a noticeable overhead.

Secondly, whilst digital data is not vulnerable to the problems of traditional warehouse storage, there are a whole other range of vulnerabilities and hazards associated with digital data and storage – hacking and security breaches not the least of them.

Thirdly, the amount of retained data is becoming so large that for many substantial organisations, there is no cost effective way to order it, categorise it and make it readily accessible. In its worst form, this problem amounts to literally not knowing what data you have.

Why is this becoming a concern? If an organisation has lots of data, but it doesn’t know what is in it – what is the problem?

There is huge marketing value in data pertaining to peoples’ buying patterns and choices. However, customer data very likely contains personal information, and retention and use of such data is regulated stringently by the various Privacy Acts around the world.

The Australian Privacy Act, and in particular the Australian Privacy Principles which form part of that Act, place stringent controls and requirements on the collection of personal information, the purpose for which it was collected, retention of it, security measures applied to it, and use of it.

If you do not know clearly what is in your data, you will not know if it contains personal information and whether you initially complied, and continue to comply, with the Privacy Act.

In addition to concerns relating to personal information, there is a huge potential for volatile information and documents to be retained, with little or no current knowledge of their existence.

From a risk/liability point of view, if you do not know what information and documents you have, you cannot assess liability and risk. You also cannot determine your compliance with corporate and/or industry specific regulation.

Potentially problematic and/or damaging documents could range from emails to internal communications, to company forms and agreements.

Read more: Use cloud to innovate like startups: AWS

This is not necessarily referring to documents pointing to deliberate illegal behaviour – it could be records or communications relating to what might once have been acceptable conduct or practices, but which have subsequently been legislated against.

From an alternate perspective, a company should know what sensitive records it has so it can comply with legal obligations of production where necessary.

Further, as part of its document management and security processes, knowing what documents and information are in its possession enables an organisation to properly assess risk and liability, and to ensure appropriate security measures are taken to protect it, including insurance.

Recent incidents of large scale data breaches have shown that hackers are more than prepared to troll through large volumes of data in the hope of finding valuable information.

The Sony hack of November 2014 involved possibly 100 terabytes of data. Around 1.1 million customer records were stolen during an attack on the network of US healthcare provider CareFirst BlueCross Blueshield.

An attack on Harvard University also exposed records of 8 separate schools, and an improperly handled data transfer uncovered personal information belonging to more than 850,000 current and former US Army National Guard members.

It is bad enough when you know what is taken. If your company suffers a security breach, and the data taken contains sensitive information that you did not even realise was there or are unsure about, it could give rise to major liability issues and significant damage to reputation.

The increasing use of cloud facilities is adding to this potential problem. The “out of sight out of mind” adage is all too likely to apply.

Old data is a prime candidate for storage on the cloud, as it is cheaper than having it reside on your own storage facilities, and in terms of its perceived nature, the company does not need to regularly and easily access it.

Allowing ‘cost’ to be the dominant driver of your data solutions and philosophy could be a high risk approach.

From a data management point of view, as well as risk minimisation, I suggest the following guidelines should be applied when looking at data, and how and where it is stored:

  • Data should not be stored anywhere, locally or the cloud, unless it has been reviewed and its contents identified, appraised and that information recorded.
  • At least one manager outside of the IT group should be aware of all data repositories, their contents, and how to retrieve the data.
  • An essential part of the storage process should be a risk assessment review, from the perspectives of internal loss, liability to third parties, and compliance with legal requirements.
  • There should be annual reviews of the processes and whether the guidelines are being complied with, including an assessment of the guidelines themselves.

There is an aspect of false economy in placing large quantities of non-contemporary data onto cheaper data storage solutions, where there is inadequate or no knowledge of the contents of that data.

The cost of properly investigating it and understanding what it contains, may be far outweighed at some later stage by the costs and possible damage that arise from production and review in contentious and/or liability situations. Know what data you have, and keep that information up to date.

Guy Betar is a corporate/IT lawyer with more than 20 years’ experience. He is currently special counsel at Salvos Legal and can be contacted by email at

Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.

Join the CIO newsletter!

Error: Please check your email address.

Tags Guy BetarAustralian Privacy PrinciplesNational GuardAustralian Privacy Actdark dataCloudsonyus army

More about Harvard UniversitySonyUS Army

Show Comments

Market Place