Some common statistical mistakes

September 15, 2020

“Everyone has their faults. Even I have one.” is one of my favorite quotes, made (mercifully) in jest by a former colleague. Statistics is hard. No statistician, however esteemed, has never made a mistake. We all make mistakes, though how consequential they were and whether they really were mistakes may be a matter of opinion.

Even experienced statisticians may disagree about which is the “best” model, for instance. This rarely can be mechanically ascertained by AIC, BIC or similar criteria, or with k-fold cross-validation. Statistics is not arithmetic.

Statistics is getting harder too, in terms of theoretical sophistication as well as mathematical and computational complexity. I’ve been at it more than thirty years and it’s certainly not gotten any easier for me.

Setting all this aside, what are some common statistical mistakes? Perhaps the most persistent and most serious are not understanding the purpose of the analysis we’re conducting and not having sufficient background information. The two are often interrelated. Statistics can’t just be done “by the numbers,” as noted earlier.

Equally indefensible is not clearly understanding what the variables in our data mean, where the data came from (e.g., the sampling method or data sources), and not sufficiently checking and cleaning data. None of this should still happen in the 21st century but it does and even statisticians, who should know better, are not entirely innocent.

Though less common among statisticians, it’s not unusual to see claims that something “works” when there is no control group or when it is unclear how similar the groups or organizations being compared were prior to the treatment or exposure. This is quite common in the business media and blogosphere, but I’ve also noticed it in medical and epidemiological journals, which does not make one rest easily at night.

Confusing cause and effect or ignoring the possibility that an association between two variables is spurious – the result of a third variable – are two other common and potentially serious mistakes. Frequency of drinking beer and eating ice cream are correlated, for instance, but the underlying cause of this association is the weather.

Determining the most appropriate general model form can be complicated. For example, many will use OLS linear regression when the outcome (dependent variable) is a count. This is not ideal. Moreover, there are more than twenty kinds of models that have been developed for count data, thus deciding which specific type of model to use is typically challenging. The line between a mistake and a judgment call is often fuzzy.

Ignoring distributional or other statistical assumptions is another common mistake. The significance of this falls into the “it depends” category. Statistical methods, in general, are quite robust to violation of assumptions. An exception is when we apply a model which assumes a linear (straight line) relationship between the outcome and predictor (independent variable) when this association is distinctly non-linear. On the other hand, building a highly parameterized (i.e., complicated) model when the departure from linearity is at best modest is an example of overfitting and also a waste of precious time.

Whether faults or not, everyone has their cognitive biases, including statisticians. My Favorite Cognitive Biases gives a snapshot of this elusive but important topic.

Statistical Mistakes Even Scientists Make, Funny Things Happened on the Way to the Forum, and Costly Mistakes Marketing Researchers Make probe these subjects in a bit more depth and from different angles. The first looks at science generally, the second data science, and the last, as indicated by the title, marketing research.

I hope you’ve found this interesting and helpful!

Arrange a Conversation

Browse

Article by channel:

Everything you need to know about Digital Transformation

Read more articles tagged: Analytics, Featured

Data & Analytics

Popular Now

Related Articles