How do you know if you can trust analytical outcomes? Do you know where the data came from? Is the quality appropriate for the use case? Was the right data used? Have you considered the potential sources and effects of bias?
All of these issues matter, and one of the most insidious of them is bias because the source and effects of the bias aren’t always obvious. Sadly, there are more types of bias than I can cover in this blog, but following are a few common ones.
Vendor research studies are a good example of selection bias because several types of bias may be involved.
Think about it: Whom do they survey? Their customers. What are the questions? The questions are crafted and selected based on their ability to prove a point. If the survey reveals a data point or trend that does not advance the company agenda, that data point or trend will likely be removed.
Data can similarly be cherry-picked for an analysis. Different algorithms and different models can be applied to data, so selection bias can happen there. Finally, when the results are presented to business leaders, some information may be supplemented or withheld, depending on the objective.
This type of bias, when intentional, is commonly used to persuade or deceive. Not surprisingly, it can also undermine trust. What’s less obvious is that selection bias sometimes occurs unintentionally.
A sound analysis starts with a hypothesis, but never mind that. I want the data to prove I’m right.
Let’s say I’m convinced that bots are going to replace doctors in the next 10 years. I’ve gathered lots of research that demonstrates the inefficiencies of doctors and the healthcare system. I have testimonials from several futurists and technology leaders. Not enough? Fine. I’ll torture as much data as necessary until I can prove my point.
As you can see, selection bias and confirmation bias go hand-in-hand.
Outliers are values that deviate significantly from the norm. When they’re included in an analysis, the analysis tends to be skewed.
People who don’t understand statistics are probably more likely to include outliers in their analysis because they don’t understand their effect. For example, to get an average value, just add up all the values and divide by the sum of the individuals being analyzed (whether that’s people, products sold, or whatever). And voila! End of story. Except it isn’t…
What if 9 people spent $100 at your store in a year, and the10th spent $10,000? You could say that your average customer spend per year is $1,090. According to simple math, the calculation is correct. However, it would likely be unwise to use that number for financial forecasting purposes.
Outliers aren’t “bad” per se, since they are critical for such use cases as cybersecurity and fraud prevention, for example. You just have to be careful about the effect outliers may have on your analysis. If you blindly remove outliers from a dataset without understanding them, you may miss an important indicator or the beginning of an important trend such as an equipment failure or a disease outbreak.
Simpson’s Paradox drives another important point home: validate your analysis. When Simpson’s Paradox occurs, trends at one level of aggregation may reverse themselves at different levels of aggregation. Stated another way, datasets may tell one story, but when you combine them, they may tell the opposite story.
A famous example is a lawsuit that was filed against the University of California at Berkeley. At the aggregate level, one could “prove” more men were accepted than women. The reverse proved true in some cases at the departmental level.