Data scientists: Question the integrity of your data

Data scientists: Question the integrity of your data

You can build a shiny model but what use is it if it’s based on bogus data?

If there’s one lesson website traffic data can teach you, it’s that information is not always genuine. Yet, companies still base major decisions on this type of data without questioning its integrity.

At ADMA's Advancing Analytics in Sydney this week, Claudia Perlich, chief scientist of Dstillery, a marketing technology company, spoke about the importance of filtering out noisy or artificial data that can skew an analysis.

“Big data is killing your metrics,” she said, pointing to the large portion of bot traffic on websites.

“If the metrics are not really well aligned with what you are truly interested in, they can find you a lot of clicking and a lot of homepage visits, but these are not the people who will buy the product afterwards because they saw the ad.”

Predictive models that look at which users go to some brands’ home pages, for example, are open to being completely flawed if data integrity is not called into question, she said.

“It turns out it is much easier to predict bots than real people. People write apps that skim advertising, so a model can very quickly pick up what that traffic pattern of bots was; it can predict very, very well who would go to these brands’ homepages as long as there was bot traffic there.”

The predictive model in this case will deliver accurate results when testing its predictions. However, that doesn’t bring marketers or the business closer to reaching its objective of real human ad conversions, Perlich said.

In addition, any model that looks too good to be true probably is, and data scientists need not let themselves be blinded by shiny models that beam perfect performance.

“This is [how we found] the [fake data] issue, because the performance of our model doubled to [help us] predict who goes to a brand's home page. Now, a double in performance in predictive modelling usually takes a lot of work. But I didn’t do anything, it just happened,” Perlich said, on her experience with website traffic data.

“At that point you start to wonder what the hell is going on? We started seeing a huge amount of traffic of basically cookies circling from sequences of sites over and over again and in between hitting some of the brands that we were running campaigns for.

“There is a network of sites that are created just so that bots can ‘buy’ ads on those sites.”

Click-through data can also be misleading. Perlich gave the example of ads appearing on the Flashlight app resulting in high chance of click-through, not because people were interested in the ad, but because fumbling in the dark with a torch app usually results in accidental clicks.

“Here’s the dirty secret: If you ever want to have a really, really good mobile campaign in terms of click-through rate, just show the ad on the Flashlight app. Independent of the brand and the product, the click-through rates are the highest on the Flashlight app and some other game [apps].

“Now with that data available we can optimise [a predictive model] towards these things, and you end up with a population that is completely uninterested in your product.

“A click is a click, whether or not it was an intentional request for more information about the product is a completely different question. It’s the human interpretation that’s typically wrong, and it often requires a geek to question the typical interpretation.”

Noisy geographic data is also a problem that many people miss, as there’s a lot of hype around this kind of data at the moment, especially in marketing, Perlich said.

“It turns out location is actually nothing 80 per cent of the time.”

She gave an example of an analysis showing population accumulating astronomically in a rural part of the US, which doesn’t seem likely.

“It’s the geographic mid point of the US, which many of the ad requests default to when they have to send latitude and longitude but don’t really know where you are. There is a very, very small percentage of geolocation data that is reliable,” she said.

“The same is true for probabilistic matching of devices. Is it really the same person who is sitting on that laptop or holding that mobile? You can tell with certainty for a small percentage of people but you can tell with much less certainty for a larger percentage of people,” she added.

Join the CIO Australia group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.

Join the CIO newsletter!

Error: Please check your email address.

Tags predictive modelingData sciencemarketing technologybig datamachine learning

More about ADMAClick

Show Comments