The (un)problem of Dirty Data

One of the oft-cited reasons to delay (or avoid) moving to an evidence-based culture is lack of trust in data quality. The data is riddled with errors, biased, and inaccurate. If used to support decision-making, it could drive you toward the wrong decisions. Right?

This is almost certainly false. More often than not, these concerns are an excuse to avoid change or to justify complacency.

Let’s look more closely at what is meant by poor data quality. First of all, there is no such thing as perfect data. No data collection scheme can ensure perfect measurement, transcription, and storage. There will be outliers, blanks, discontinuities, and mistakes. But to paraphrase George Box:  “all data is wrong, but sometimes it’s useful”. In fact, you can almost always gain insights from data—even if it's not error-free.

How is this possible?  

We need to consider two things: the nature of the error and the resolution of the decision.  

Random error

Let’s first consider the error. The most common type is called random error. In fact, it’s in virtually every data set that has some kind of measurement. Sometimes you measure higher than reality.  Sometimes you measure lower. The individual errors are randomly variable and the reasons behind them are effectively unknown.

The trick with random error is that by averaging together more and more observations, the errors cancel each other out. You just need more data to make it go away, you don’t need better data.  

Imagine you're a fire chief looking to speed up your response performance. You decide to look at the turnout times of your different platoons. You pull yesterday’s numbers and see that Platoon A took 140 seconds on its first call and 160 seconds on its second call. Platoon B took 130 seconds on its only call.

Intuitively, you know that you can say very little from this data. In order to know which Platoon is the fastest, you would have to look at dozens of responses and average them. And the more you look at, the more confident you’d become. In this way, you address the random error in turnout times.

Systematic Error

The other main type of error is systematic error. Systematic error is more insidious. It’s difficult to detect and may lead to wrong conclusions. As you collect more data, the errors reinforce one another.

Let’s continue the example from above. You find that Platoon A is about 20 seconds faster than Platoon B over the last year. You’re just about to lay down the law on Platoon B when one of your analysts contacts you. She has gone on some ride-alongs and noticed some interesting facts. Platoon A is younger than Platoon B and they all use the mobile data terminals to signal their departure. Platoon B is older, and uses radios to call in their departures.  

After doing some further digging, she estimates that the dispatchers take about 30 seconds before they enter the en-route marker in the data. The method of collection is adding (on average) 30 seconds to each of Platoon B’s turnout times. Good thing you didn’t bring down the hammer.

How do you fix this? You correct for it analytically by subtracting thirty seconds from the reported turnout times (and adding thirty to the travel times) of platoon B. Systematic error is dangerous, but it is not fatal.

Decision Resolution

Now let’s talk about decision resolution. Decision resolution is the amount of precision you need to make a confident decision.

Again, an example is helpful. Suppose you are planning to add an additional vehicle to your fleet. Your rule of thumb is that you add a new unit for each thousand calls per year. The three-year average has been 14,500 calls, and you have 13 vehicles.  

You present your request to council, but are denied. One of the council members noticed that some of your structure fire calls are really alarm calls. His audit estimates that as much as two per cent of calls are mis-categorized and shouldn’t be counted. He suggests that your analysis is flawed.

Well, even if the council member is correct, two per cent would reduce your call count to 14,210. This is still above your threshold of 14,000 to justify the 14th vehicle. In fact, your error rate would have to be double that (four per cent) in order for you to reconsider your request.

Perception is Reality

As far as a decision-maker is concerned, perceived accuracy is the same as true accuracy. Even small, immaterial errors can devastate your credibility. A good analyst deals with data quality through outlier coding or error-robust analysis. When she presents results, there are no noticeable flaws in the data.

In summary, analytics is by its nature equipped to deal with dirty data. A good analyst can identify, remove, or even correct erroneous data so that it doesn’t impact the quality of the decision. As a decision-maker, you should ask your analysts about sample size. And you should try to to identify systematic errors. This process will both improve the quality of your team’s analysis and your confidence in it.