Sensible behaviours for nonsensical data
- 07 December, 2004 13:24
- Comments
Data quality is critical to the success of any enterprise application. Systems from business intelligence to customer relationship management are destined to fail without high-quality data - the "garbage in, garbage out" theory.
George Preston scratched his head as he contemplated the road accident data from NSW. Something about it just did not make sense . . .
A director of software company Prometheus Information, Preston was preparing some educational and promotional material for the HealthWiz product his company produces for the Commonwealth Department of Health and Ageing. The subject of road accidents seemed to offer fertile ground, so he had started with an age-standardized comparison of hospitalization rates across all states. Next, with no particularly surprising comparisons standing out he had moved on to prepare an age breakdown, expecting - as all anecdotal evidence leads us to believe - to find a peak in rates for young drivers. Yet while all other states were living up to those expectations, NSW was a distinctly different case.
Could there be something about the NSW P-plate system that explained why drivers in their 20s were hospitalized at half the rate that they are in other states? Perhaps, but this would not explain why elderly people and infants had hospitalization rates three to five times higher. Puzzled, Preston then turned to the geographic distribution of the rates for males and females, finding that the low hospitalization rates were a NSW-wide phenomenon applying to both sexes.
Concluding there must be something systematically different about how the data is collected or compiled in NSW, Preston tried further analysis, hoping this would reveal whether this was a systematic difference across the whole of NSW, and whether there might be some obvious reason for the phenomenon. There was not. Now, with nothing obvious standing out, he plans to seek guidance from someone in NSW Health over whether they can explain the difference.
"I'm working on trying to determine how much I want to rely on this information, so I'm looking for a measure of confidence of some kind," Preston says. "The data has got to have an internal consistency, and it should line up with known information, so that there's kind of external validity as well.
"How one uses that information really depends on whether you've got some kind of coherent explanation for how the information is actually being generated. If you don't have some mental model in your head, it's generally pretty hard to use information, I think. The really important point is that usually you're looking for some kind of signal in the midst of noise. You need to try and get a handle on what the difference is between the noise and the signal."
In some cases a statistical test can help distinguish signal from noise; in others the organization can adjust the data for known causal factors and what counts is to put all data on a comparable basis.
Ultimately, it is a matter of applying common sense, Preston says. "The important thing is to actually pick up on the signal - the presence of the signal - and then you can work harder to try and get a better handle on what the signal is so you can focus your efforts around that aspect of the data quality, without worrying about the rest of it."
To help alert those using its data to its degree of reliability or otherwise, the HealthWiz team has developed a warning system that can be set to trigger for any specific value, variable or category in the data. These warnings are authored in consultation with the data custodian who supplied the collection in question.
"I think in general you should be able to alert users to parts of the data that are stronger and weaker," Preston says.
Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.
- Bookmark this page
- Share this article
- Got more on this story? Email CIO
- Follow CIO on twitter
-
Australia's first 4G smartphone is the HTC Velocity 4G
-
Swedish e-commerce startup's execs linked to NYC sex crime
-
Face Time - Interview with John Brennan and Robert DiStefano
-
How to implement next-generation storage infrastructure for Big Data
-
Pfizer's Future Depends on IT Transformation
-
Staying Secure and Preventing Data Leaks in a Cloud-obsessed World
If your organisation is to benefit from this explosive growth, it needs to be able to exploit all that the cloud has to offer. But at the same time, it is vital to protect your company’s employees, networks, data and reputation from the risks that exist in the cloud. -
Reducing Costs Through Better Server Utilisation
By consolidating systems onto the latest server technology and taking advantage of virtualization techniques, enterprises can optimize datacenter efficiency, gain flexibility, and reduce operating costs—without sacrificing performance or impacting service levels. Read on. -
Virtual Certainty - Best Practices for Gaining Monitoring Clarity in VMware Environments
The benefits of virtualisation are unassailable: increased agility, scale, and cost savings to name but a few. However, so too are the monitoring challenges posed by these environments—including complexity, lack of visibility and control, and inefficiency. This white paper reveals the best monitoring practices to employ in virtualized environments—best practices that are essential in enabling organizations to overcome their monitoring challenges so they can get the most business value from their virtualisation investments.
-
Linux Command Line and Shell Scripting Bible
-
Hacking Photoshop CS2
-
Flex 3 Bible
-
Java Concepts 4E Wileyplus/WebCT Standalone Card
-
Excel 2007 Data Analysis
-
Starting an Online Business All-In-One for Dummies®, 2nd Edition
-
Professional Crystal Reports for Visual Studio .Net, Second Edition
-
Microsoft Windows Vista Visual Encyclopedia
-
Macromedia Studio MX Bible








Comments
Post new comment