Critical.
Authoritative.
Strategic.
Subscribe to CIO Magazine »

Sensible behaviours for nonsensical data

Data quality is critical to the success of any enterprise application. Systems from business intelligence to customer relationship management are destined to fail without high-quality data - the "garbage in, garbage out" theory.

George Preston scratched his head as he contemplated the road accident data from NSW. Something about it just did not make sense . . .

A director of software company Prometheus Information, Preston was preparing some educational and promotional material for the HealthWiz product his company produces for the Commonwealth Department of Health and Ageing. The subject of road accidents seemed to offer fertile ground, so he had started with an age-standardized comparison of hospitalization rates across all states. Next, with no particularly surprising comparisons standing out he had moved on to prepare an age breakdown, expecting - as all anecdotal evidence leads us to believe - to find a peak in rates for young drivers. Yet while all other states were living up to those expectations, NSW was a distinctly different case.

Could there be something about the NSW P-plate system that explained why drivers in their 20s were hospitalized at half the rate that they are in other states? Perhaps, but this would not explain why elderly people and infants had hospitalization rates three to five times higher. Puzzled, Preston then turned to the geographic distribution of the rates for males and females, finding that the low hospitalization rates were a NSW-wide phenomenon applying to both sexes.

Concluding there must be something systematically different about how the data is collected or compiled in NSW, Preston tried further analysis, hoping this would reveal whether this was a systematic difference across the whole of NSW, and whether there might be some obvious reason for the phenomenon. There was not. Now, with nothing obvious standing out, he plans to seek guidance from someone in NSW Health over whether they can explain the difference.

"I'm working on trying to determine how much I want to rely on this information, so I'm looking for a measure of confidence of some kind," Preston says. "The data has got to have an internal consistency, and it should line up with known information, so that there's kind of external validity as well.

"How one uses that information really depends on whether you've got some kind of coherent explanation for how the information is actually being generated. If you don't have some mental model in your head, it's generally pretty hard to use information, I think. The really important point is that usually you're looking for some kind of signal in the midst of noise. You need to try and get a handle on what the difference is between the noise and the signal."

In some cases a statistical test can help distinguish signal from noise; in others the organization can adjust the data for known causal factors and what counts is to put all data on a comparable basis.

Ultimately, it is a matter of applying common sense, Preston says. "The important thing is to actually pick up on the signal - the presence of the signal - and then you can work harder to try and get a better handle on what the signal is so you can focus your efforts around that aspect of the data quality, without worrying about the rest of it."

To help alert those using its data to its degree of reliability or otherwise, the HealthWiz team has developed a warning system that can be set to trigger for any specific value, variable or category in the data. These warnings are authored in consultation with the data custodian who supplied the collection in question.

"I think in general you should be able to alert users to parts of the data that are stronger and weaker," Preston says.

Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.

More about: AMR Research, Creative, Department of Health, Emerson, Emerson Electric, HIS Limited, IT People, NSW Health, PricewaterhouseCoopers, PriceWaterHouseCoopers, Solomon, University of South, University of South Australia

Comments

Post new comment

The content of this field is kept private and will not be shown publicly.
Users posting comments agree to the CIO comments policy.
Login or register to link comments to your user profile, or you may also post a comment without being logged in.
Related Whitepapers
Latest Stories
Community Comments
Latest Blog Posts
Whitepapers
  • FTP Replacement: Where MFT Makes Sense and Why You Should Care
    This research provides advice on when to replace FTP with managed file transfer (MFT) solutions, and which features to consider. This Gartner report includes MFT software and MFT as a service. Also highlighted is where MFT fits into the technology landscape and some of the key benefits. Key Findings include: - Technical differences between FTP and MFT including security, administration and scalability - Implementation concerns that organisations should be aware of (when migrating) - List of vendors and how they are expanding their MFT porfolios (including IBM)
    Learn more »
  • Managing Private and Hybrid Clouds for Data Storage
    Many organisations, driven by the opportunities for significant cost-savings, are considering cloud computing and cloud storage solutions, which take advantage of Web-based technologies to allow scalable, virtualized IT resources to be provided as a service over the network. Not a new technology in itself, cloud computing is a new business model wrapped around existing technologies, such as server virtualization, to make the use of information technology resources more efficient.
    Learn more »
  • The Value of Information: Business Decisions
    Traditional data-storage approaches are geared toward delivering structured data to management and knowledge workers through business intelligence and performance management applications. But CIOs need to look at the enterprise information taxonomy in a much broader context. External and internal information has to be collected, managed, and provided to many internal and external stakeholders. In addition, storage capacity is challenged by an almost exponential growth of unstructured data, such as audio and video files.
    Learn more »
All whitepapers
rhs_login_lockGet exclusive access to Invitation only events CIO, reports & analysis.
Recent comments

HP and IDG news, product videos and resources