Data quality is critical to the success of any enterprise application. Systems from business intelligence to customer relationship management are destined to fail without high-quality data - the "garbage in, garbage out" theory.
George Preston scratched his head as he contemplated the road accident data from NSW. Something about it just did not make sense . . .
A director of software company Prometheus Information, Preston was preparing some educational and promotional material for the HealthWiz product his company produces for the Commonwealth Department of Health and Ageing. The subject of road accidents seemed to offer fertile ground, so he had started with an age-standardized comparison of hospitalization rates across all states. Next, with no particularly surprising comparisons standing out he had moved on to prepare an age breakdown, expecting - as all anecdotal evidence leads us to believe - to find a peak in rates for young drivers. Yet while all other states were living up to those expectations, NSW was a distinctly different case.
Could there be something about the NSW P-plate system that explained why drivers in their 20s were hospitalized at half the rate that they are in other states? Perhaps, but this would not explain why elderly people and infants had hospitalization rates three to five times higher. Puzzled, Preston then turned to the geographic distribution of the rates for males and females, finding that the low hospitalization rates were a NSW-wide phenomenon applying to both sexes.
Concluding there must be something systematically different about how the data is collected or compiled in NSW, Preston tried further analysis, hoping this would reveal whether this was a systematic difference across the whole of NSW, and whether there might be some obvious reason for the phenomenon. There was not. Now, with nothing obvious standing out, he plans to seek guidance from someone in NSW Health over whether they can explain the difference.
"I'm working on trying to determine how much I want to rely on this information, so I'm looking for a measure of confidence of some kind," Preston says. "The data has got to have an internal consistency, and it should line up with known information, so that there's kind of external validity as well.
"How one uses that information really depends on whether you've got some kind of coherent explanation for how the information is actually being generated. If you don't have some mental model in your head, it's generally pretty hard to use information, I think. The really important point is that usually you're looking for some kind of signal in the midst of noise. You need to try and get a handle on what the difference is between the noise and the signal."
In some cases a statistical test can help distinguish signal from noise; in others the organization can adjust the data for known causal factors and what counts is to put all data on a comparable basis.
Ultimately, it is a matter of applying common sense, Preston says. "The important thing is to actually pick up on the signal - the presence of the signal - and then you can work harder to try and get a better handle on what the signal is so you can focus your efforts around that aspect of the data quality, without worrying about the rest of it."
To help alert those using its data to its degree of reliability or otherwise, the HealthWiz team has developed a warning system that can be set to trigger for any specific value, variable or category in the data. These warnings are authored in consultation with the data custodian who supplied the collection in question.
"I think in general you should be able to alert users to parts of the data that are stronger and weaker," Preston says.
Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.