Statistical Modeling 101

Modeling 101

Mom used to joke that her son was a model. Why? Because I do modeling. Obviously, I’ve never done that sort of modeling, but it may not be obvious to everyone what I mean by modeling. Let me give you an illustration with a simple example of supervised learning.

Supervised learning refers to statistical models with one or more dependent variables which we try to understand or predict using one or more independent variables. Note that outcome and predictor are now often used in place of dependent and independent variable, and I will use these terms interchangeably in this article.

Regression is probably the most popular type of supervised learning. There are many others, including methods ambiguously called machine learning. (Regression itself is considered machine learning by many data scientists.)

Say, for instance, we have purchase data for a certain product category. We’d like to use these data to better understand why some people purchase this type of product more (or less) often than others do. Based on what we’ve learned from previous research, agegenderhousehold income and household size are our candidate predictors. Remember, this is only a hypothetical example and in a real setting there probably would be many more predictors we would consider.

Our basic regression model could be represented in English as follows:

Purchase Frequency = intercept + Age + Gender + Household Income + Household Size + error

Intercept is the value of purchase frequency when the predictors are all zero. Error denotes deviations of our model’s predicted purchase frequency from actual purchase frequency. It’s what the model is unable to explain and is a very important part of the equation.

Even uncomplicated regressions such as this hypothetical one aren’t simply a case of clicking and dragging variable names into boxes. We have many decisions to make, including those when we’re setting up and cleaning our data. Regression comes in many flavors and the nature of the dependent variable determines which type of regression we should use. This is a critical decision.

For example, purchase frequency might have been obtained from a consumer survey with the answer categories every day, 2-3 times a week, once a week, 2-3 times a monthonce a month, and less than once a month. This is an ordinal scale and our usual regression, which assumes interval data (equally-spaced categories) or continuous data, would be inappropriate. Some form of ordinal regression, and there are many, would be a better choice.

On the other hand, we might have a record of the number of times each consumer had bought the product over a three-month period, for example. Here, Poisson regression or another kind of regression designed for data of this type would be most suitable. There are more than twenty kinds of regressions for count data I know of.


There are other decisions that pertain to the right-hand side of the equation. We might need to recode gender as Female = 1, Male = 0, for example. This is usually called dummy coding, though other terms are also used. Likewise, we might collapse household income and household size into fewer categories and recode them as dummy variables – that is just one possibility and not a wise choice in every situation.

Age might need to be transformed in some way – it’s common to center variables such as age on the data mean or some value that makes intuitive sense. For example, if the mean age of the consumers in our data is 40, a 50 year-old would be recoded as 10 and a 30 year-old as -10. This way, the intercept would be meaningful since none of our consumers would be zero years of age. The relationship between age and purchase frequency could be curvilinear, in which case some variety of spine or polynomial regression might be used. There are many options.

There may also be interactions, moderated effects in which, for example, the association between age and purchase frequency depends on gender.

Multilevel, mixture or longitudinal modeling might also be considered. These are advanced topics but, in our example, multilevel modeling could be used to account for geographic patterns in our data. Mixture modeling is quite complex, but one type essentially blends regression and cluster analysis. We might find, for instance, that separate regression equations are needed for different segments of consumers. Longitudinal analysis would be used if we’d like to study how the behavior of consumers changes over time, perhaps in response to marketing activity.

Bayesian methods are becoming mainstream and often are alternatives to traditional maximum likelihood estimation. The pros and cons of the two approaches is a big topic and one on which authorities frequently clash, but this is another choice we must make.

Our final model might look quite different from the one shown earlier. We might conclude, for example, that some independent variables aren’t needed and can be dropped from the equation. How to decide what is the “best” model is a large topic and involves the use of fit indices and information criteria as well as other model diagnostics.

Cross-validation, in which part of the data is held out from the model building, may also be advisable. Most importantly, unless we’re only concerned with making predictions, our final model should be interpretable and make sense to those who will use our results.

Building “simple” models with just a few variables can become quite complicated, and not only what we do but the order in which we do it is important. Automated modeling will usually provide us with models which fit the data reasonably well but, in my experience, they are often hard to interpret and, therefore, useless beyond prediction. Models we cannot understand, even if they predict well, do not inform us about the Why.

Results will also vary depending on the automated procedure used – just as no two statisticians are exactly alike, no two automated procedures are exactly alike. I suspect it will take Artificial General Intelligence – true AI – before we can confidently turn over most statistical modeling to machines.

I haven’t mentioned unsupervised learning, factor and cluster analysis being two examples. There is no distinction between dependent and independent variables in unsupervised learning and no dependent variables to “supervise” the modeling. Generally speaking, more subjective judgment is required in unsupervised learning.

In the past couple of decades, innovation in statistics and machine learning has been increasing at a rapid pace and we are now able to do things unimaginable when I began my career. Progress has downsides too, though, and there is now much more to learn in the same space of time and many more decisions for modelers to make. The probability of serious errors has probably risen as a result.

This has just been a snapshot of statistical modeling, but I hope you’ve found it interesting and helpful.

Arrange a Conversation 


Article by channel:

Read more articles tagged: Analytics, Featured

Data & Analytics