Five years after BEACH was introduced to analyse data collected from Australian GPs, its creators have a message for CIOs: achieving data quality can be a long, hard road.
About 85 per cent of Australians visit a GP at least once in any year. Altogether, that amounts to roughly 100 million consultations annually. Every consultation is unique, because every patient is an individual, each with their own unique combinations of complaints — morbidity and associated other illnesses — and differing from each other in age, gender and background.
That makes assessing the value the community gets from the $2.5 billion a year spent in direct costs, and the further $7.5 billion a year in secondary costs — such as drugs prescribed and specialist referrals and pathology — on GP visits, not to mention the health outcomes, a hugely complex task. So for decades the Family Medicine Research Centre at the University of Sydney and its predecessors have been developing and refining methods to improve the quality of its data about patients and “true life” outcomes.
The result of their work is BEACH (Bettering the Evaluation And Care of Health), a continuous national survey of GPs and the patients they treat, launched in 1998 with the aim of studying what goes on in GP surgeries and the outcome of the consultations. BEACH relies on data quality methodology developed within the Department of General Practice at the University of Sydney (which gave birth to the Family Medicine Research Centre in August 1999), refined during comparisons with population-reported information such as that found in the Australian Health Survey.
“The ideal research would be able to test in real life what happens when a real patient with mixed morbidity and a specific lifestyle is given a product as a treatment — looking at real-life effectiveness, rather than efficacy of treatment — which is what is being tested in randomised controlled trials,” says associate professor and director Helena Britt. “Now this is just huge — it is just fraught with difficulty, so for the past 30-odd years we’ve been developing methods to allow us to at least look at the consultations.”
Britt says the very size of the task (see “Here’s to Your Health”, left), not to mention the complexities involved and the solutions determined, have cast significant light on selective aspects of the ongoing issues of data quality facing many organisations. She says the lessons she has learned about data quality have been invaluable.
Data collection and assessment for the survey have proven enormously complex, Britt says. For instance, a major issue in collecting data in Australian health is that it is very difficult to follow a patient. A researcher might think they have the patient’s full general practice medical record on their desk, when in fact the patient may very well have visited several other doctors, either within the same practice or externally.
Nevertheless, Britt’s team developed a method within the University of Sydney General Practice, as it was then, and worked out methods of collecting usable longitudinal patient data on its own patient population. “And that’s fine; that can work in a single general practice because you are representing — you think — your practice population. It sounds pretty simple. But it is actually not easy in Australia to define your practice population. Next we tested whether data collected from GPs at the consultation actually ‘reflects’ population-reported information, as is collected in the Australian Health Survey,” Britt says. “The question was: Does the GP really record what problems are managed at the consultation? And people were saying: ‘Well, if you ask the patient they will know, so you can test it against that’ — people regard patient-reported morbidity almost as a gold standard,” Britt says. “Now we know of course that patient self-report has its shortcomings. If you ask my great-aunt what she’s got she’ll tell me she’s got ‘blood pressure’. And I say: ‘Well, is it high or low? If you have no blood pressure you’re dead dear’.”
To resolve the dilemma the unit conducted a small local study under a National Health and Medical Research Council (NHMRC) grant, collecting information about a series of consultations recorded by a group of GPs and then following up by asking patients what they thought they were managed for at the consultation. The outcome revealed about a 30 per cent interchange of error, where a patient might assert that they consulted the doctor about, for example, a cold, and the GP would talk about hypertension.
“At a population level it all worked out in the wash but you couldn’t then break it down to over 65 year olds, or under 15s, because of this overlapping error,” Britt says.
“We only got a 29 per cent response rate from the GPs in this project, and it was always thought that if you had a low response rate you couldn’t reliably reflect what was really happening across all the GPs in the population you sampled. Ideally in studies you’d look for an 80 per cent response rate, although nobody ever gets that anyway. So we tested whether that low response rate actually reflected in the results, and we concluded that it didn’t — it was still representative of the group.”
That initial work done, the centre then conducted the first national study of general practice in 20-odd years, called the AMTS (the Australian Morbidity and Treatment Survey 1990-91), giving it the chance to apply the methods it had developed to a national survey, and leading to a database to help it model how such surveys should really be done.
That study involved a national random sample of 495 GPs (stratified by state) who each recorded details of all surgery and home consultations for two periods of one week, six months apart. Encounter details were recorded on structured paper forms, and GP recording weeks were evenly spread throughout the year. The resulting database incorporated records of more than 110,000 doctor-patient encounters and included more than 160,000 problem contacts.
But that study led to new complications.
“See, the trouble with pitfalls is you don’t know what they are until they have happened,” Britt says. “So we could say: ‘Alright, well here we have a data set. What does it tell us and what can we do with it?’”
First the centre modelled the sample size that would define the best and most cost-effective manner of representing national general practice activity. It relied on achieving a truly random sample, and if possible, standardising that against the sample frame (the population it was drawn from), and adjusting for any possible inadvertent bias due to low response rate.
“So in BEACH you will see we report that we always have an under-representation of the very youngest group of GPs, the new completed registrars, because they don’t have to do quality assurance,” Britt says. “And if they don’t do quality assurance they don’t want to participate in BEACH, because that’s how they get rewarded for participating. We know about the under-representation of young GPs because we can test that against the original population of GPs that we drew the sample from. Usually that is not possible in health services research.”
Another pitfall is finding a reliable source of the population that you are trying to randomly sample. In this instance the only really reliable source is the Health Insurance Commission’s Medicare data, which can identify which GPs are actively practising.
Britt says while researchers can adjust for differences between their population and the source population, the problem is that many people who do health services research use a “convenience” population of GPs. “So they really don’t know how representative it is, and unless you can find a good baseline against which to compare and adjust, then you really don’t know how generalisable your results are.”
A further issue was that the cluster the centre was trying to sample was encounters, or consultations, yet the sample they were drawing on was general practitioners. According to Britt this complication highlights an issue many people who rely on data fail to recognise — that any cluster sample has to be dealt with statistically. “You have to adjust for that cluster sample, and you can do that with statistical programs such as SAS 8.2, but people first have to recognise they’ve got a cluster,” Britt says. “If you don’t adjust for that cluster effect you get much tighter estimates than you should, so your results look better than they should.”
Actually, it can be even more complicated than that. If you sample practices, and then sample GPs within the practice, and then conduct a sample of clusters around the individual GP, you would end up with a double cluster effect, she points out.
Like cluster effects, double cluster effects can be dealt with by using statistical programs to adjust for the recognised problem. These systems can also be used to deal with low response rates, problems of representativeness and other such issues, but again, that work will get done only if the problems are first recognised by the researcher. “I think one of the major problems is that people don’t recognise the pitfalls. There are ways to deal with them, but they don’t recognise them,” Britt says.
Putting It to the TestOf course the complications did not stop there. The next step was to assess the reliability of the data the GP wrote down. To determine that, the centre videoed GPs during patient encounters and then compared the data they entered with the assessments of two independent clinicians who subsequently viewed the encounters on video.
The issue here is that there are numerous ways to label symptoms, and that each clinician has his or her own style of labelling. One GP might label a set of symptoms under a single broad heading, such as viral illness, whereas another GP might write the two symptoms down separately, not yet convinced those two symptoms represent a condition they could call a viral illness. “So we found differences in style like that — you know, grouping and not grouping — and such differences are reflected in the relative rate of events we report,” Britt says. It was also clearly no good relying on the same sample of GPs year after year, because it has long been known, that despite huge variances between GPs, individual GPs are remarkably consistent over time in the labels they use and the way they manage patients.
“So it’s not good having a system where you say: ‘I’ve got these 500 GPs and they’re going to record for me every three months’,” Britt says. “And you’ve also got to account for seasonality. It’s no good only recording in winter and then talking about what GPs do, because they do something totally different in winter than they do in summer. It’s no good looking at the relative rate of giving influenza vaccines in April and then talking about that as is if it goes on the whole time, because of course it doesn’t — it’s only in April or May.” Another lesson, says Britt, is that you need to give participants (in this case the GPs) lots of instructions about filling in the forms. (She also always feeds back the results to participating GPs: a) because it rewards them to some degree for their time and effort, and b) because knowing they will get feedback motivates them to be more accurate about the forms they fill in.)
Of course when the data does come back from the GPs, it then has to be entered into a computer. That means finding ways to code and classify the data. “The reliability of that data entry is incredibly important,” Britt says. “It’s no good me calling a few casual staff in and saying: ‘Listen, just code and classify that to the following classification.’ You’ve got to put lots of time into training them, making sure, testing your internal coder consistency, having other staff check for error.
“After that, in a project like BEACH which is ongoing, you need to build in computer test checks. So if a data entry person tries to enter pregnancy and the patient is less than 10 years old, [the system] says: ‘Are you sure? Have another look at this will you?’ Then we do other tests later on a regular basis that look for other sorts of known inconsistencies, such as that you can’t have a vasectomy for a female. All those need to be built in.
“I think the underlying issue is you’ve got to recognise there is room for error at each step of the process,” Britt says. “And there will always be error, there is no question. We estimate our error rate to be just less than 1 per cent. That’s not bad in 100,000 consultations a year.
“When you’re doing specific analyses, look at the data about a particular subject in much more detail, you can also say: ‘Well, is this logical?’ If you’ve got an extreme outlier on a prescription-prescribed daily dose, you go and check why it is. If that is what the GP has written, it gets left in; but if it’s pretty illegible, it is probably better omitted, so it will be corrected or removed.”
The final data quality check involves constantly looking at the results for inconsistencies. Britt says in her teaching role she never fails to be surprised by the tendency of young people to trust a computer ahead of their own brains and hence their failure to notice “impossible” results. “You need to be constantly looking at the logic and having a good feel for the data so that you can see where something is wrong. You can almost rely on your gut feelings about a result if you know your data well enough.”
Another problem to be aware of here is that as people move in and out of projects, you can lose continuity of knowledge, so a small change introduced in method some years ago is lost to corporate memory but may have resulted in a sudden change of output.
Results ConsistentBritt says the BEACH experience highlights that achieving quality data involves many steps, each of which can create error. Those aiming for data quality need to examine the reliability and validity of every step in the process, not just the total process.
“But what I can say now is that after five years, our results are so consistent. The things you would not expect to change are not changing, but we can measure changes in things you may expect to change. For instance, we unexpectedly discovered a decrease in the asthma management rate, after the introduction of the management plans. We thought: ‘This is funny, is this just what we would call a glitch, you know, that statistically you can test a million things and come up with 200,000 differences? So the next year we had a look, and that glitch had stayed steady. There had been [an initial] drop, it appeared to be the result of the management introduction plans, and now it had stayed steady.”
Now the quality of data is so refined that researchers can measure differences when new products come on the market, and the shifts that the GPs go through in changing from old products to new products.
“You can measure that sort of thing. And yet hypertension stays consistently managed at the same rate every year. So repeatability is something that gives you a better idea of the reliability of your data, particularly considering that you’ve got a changing sample of GPs,” Britt says.
Here’s to Your HealthBEACH (Bettering the Evaluation And Care of Health) combines health services research, traditional epidemiological research, patient risk factors and health states being assessed in parallel with the study of health care delivery. The information is designed to provide general practice patient population estimates of the incidence and prevalence of conditions and risk factors. It will also serve to investigate the relationships between risk factors and health states and other aspects of the consultation (for example, problems managed).
It uses a cross-sectional, paper-based data collection system developed over the past 20 years in the Department of General Practice. (The system is paper-based because less than 10 per cent of GPs rely totally on computers for their patient medical records.) Data generated is used by researchers, government and industry.
The survey aims to fill what was an information black hole by getting a random sample of 1000 GPs every year throughout Australia — each of whom had claimed a minimum of 375 general practice Medicare items in the most recently available three-month Health Insurance Commission data period — to fill out forms regarding 100 consecutive consultations with their patients.
Information collected includes three inter-related data collections: encounter data, general practitioner characteristics and patient health behaviours. A different sample is surveyed every year. “With 100 million consultations in a year going on in Australia, and knowing damn-all about what happens at these encounters, we felt the least we could do was to go and find out,” associate professor and director Helena Britt says.
BEACH is conducted by the AIHW GP Statistics and Classification Unit, a collaborating unit of the Australian Institute of Health & Welfare and the Family Medicine Research Centre, University of Sydney. The program is funded by the Commonwealth Department of Health and Ageing, AstraZeneca (Australia), Roche Products Pty Ltd, Janssen-Cilag Pty Ltd, Merck Sharp & Dohme (Aust) Pty Ltd.
Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.