What is Statistics?

The field of statistics is finally getting some long-overdue attention. I used to joke that statisticians had an image problem: We didn’t have an image. I recall sitting next to a fellow on a long-haul flight and chatting with him about work. He was an electrical engineer – a quant guy like me – yet didn’t know what a statistician was. He’d never even heard of statistics. This was in 2010, not exactly ancient history.

There is now a lot of talk about statistics but also a lot of confusion. It sometimes refers to numbers or figures, as in “according to these statistics…” It also is used to describe complex mathematical modeling. The term itself, which I’ve never cared for, is derived from nation state and the professed need of states to base policy on demographic and economic data. Statistics is not a recent innovation and dates back at least as far as the 5th century BC. Some historical and other details are given in Vital Statistics You Never Learned…Because They’re Never Taught, an interview with eminent statistician Frank Harrell.

Statistics is used in many ways. One is to estimate a population number from a sample, for instance, what percent of motorists in the US have heard of our brand of motor oil. Based on our sample, we may also want to know whether brand awareness and other marketing metrics vary by motorist characteristics in the population, and inferential statistics such as t-tests, chi-square, ANOVA are helpful for this.

Another frequent use is to examine associations between pairs of variables, as in the scatterplots depicting the correlation between Y and X most of us have seen. The Pearson product-moment correlation coefficient may ring a bell. This basic notion can be extended to more than two variables with principal components and factor analysis and to groups of variables with canonical correlation. In marketing research, methods such as these are often used in mapping.

Sometimes we want to explain or predict a variable from others, and regression analysisis an example of this application. Identifying variables that discriminate among groups is yet another, and discriminant analysis, logistic regression and probit analysis are three methods commonly used. Conjoint analysis is a variation on this theme in which product characteristics are used to explain and predict product choice.

Cluster analysis is often employed by marketing researchers to identify groups (clusters) of consumers who are more like one another than they are to other consumers. It is one of the primary tools in segmentation. There are also methods for studying data collected across time, and Time Series Analysis: A Primer and Multilevel, Longitudinal and Growth Modeling provide overviews of these approaches. Time-to-event modeling is known by many names and has many uses, analysis of customer churn being one. Causal analysis is a particular interest of mine and Structural Equation Modeling is one technique frequently used in this line of research.


Bayesian statistics is increasingly popular – this interview with Andrew Gelman, an authority on Bayesian methods, is a brief introduction to the topic. My examples have all cited “traditional” statistical methods. Most of these can be performed with Bayesian approaches too.

Machine learning is a vaguely-used term that often refers to (mostly) newer approaches developed by computer scientists and academics in various disciplines. They are also in the toolboxes of many statisticians nowadays. Personally, I am indifferent to the origins of a method provided it’s suited to the problem I’m working on. Students now learning statistics in university programs study these methods as well as Bayesian statistics and the more familiar frequentist statistics. There is a lot to learn if you want to become a statistician these days!

An Analytics Toolbox describes methods popular among statisticians and data scientists in more detail than I have here. If, like most people, you’ve had little or no formal education in this exotic discipline, Statistics in Plain English (Urdan) provides a good introduction. I should stress that new developments in the field are occurring continually and that the pace of innovation is increasing. Statistics was never very cut and dried and is decreasingly so. It is becoming harder to automate, not easier, as some might believe.

I should mention that many statisticians feel statistics is poorly understood and often misused or misrepresented by non-statisticians, especially in data science. Given the relative obscurity of the discipline even today this is understandable, and Myths and Misconceptions about Statistics is my take on this debate.

In the interview with Professor Harrell linked earlier he refers to statistical thinking, which is at least as important as the mathematics and associated computer software. In Statistical Thinking and the Art of Lawnmower Maintenance and What to Look for in a Statistician I offer some thoughts about that topic. Hint: All statisticians are notthe same.

The foregoing has only been a snapshot of a very big topic, but I hope you’ve found it useful!



Article by channel:

Read more articles tagged: Analytics, Featured, Statistics

Data & Analytics