Statistics Is Easy?

With today’s software, statistics is easy, right? Wrong. Even before the start of Data Mania, circa 2010, vendors have been suggesting that if we buy their easy-to-use statistical software, we don’t really need to know what we’re doing.  

Since then, hogwash about automated machine learning and “AI” has populated the blogosphere in great quantity. What should populate the blogosphere instead are the true horror stories about costly errors people with little background in statistics are making with this easy-to-use software.

What about MOOCs and Masters programs in Data Science and Analytics? Over time, the good ones will help, but typically they cover a wide range of subjects superficially. To illustrate my point, let’s have a look at K-means cluster analysis. Just in case your memory needs a quick refresh, cluster analysis is used to group objects (e.g., consumers) in such a way that they are more like each other than those in other groups (clusters). In marketing, clustering is often used for segmentation and anomaly detection, among other purposes.

Many in the Data Science community consider K-means a form of machine learning. Despite its age (about 60), it shows no signs of arthritis and is still widely-used by Statisticians and Data Scientists. K-means is one of dozens of clustering methods but is popular because it is simple, fast and works well in a variety of circumstances. It is not what I use most often but I do use it quite a bit. If you’d like to study cluster analysis in depth, several textbooks have been written about it and two I can recommend are Cluster Analysis (Everitt et al.) and Data Clustering (Aggarwal and Reddy).

I’ve been referring to K-means as if it comes in just one flavor. This is because I would like to demonstrate that even a “simple” procedure such as K-means is not that simple. There actually are other K-means flavors, for example, K-medians, K-medoids and Fuzzy C-Means, as well as other very different types of clustering methods. For the sake of illustration, though, let’s say you have decided to use vanilla flavor K-means. It’s easy – all you need to do is import the data and press start, right?


Nope. Again, for the sake of simplicity, let’s imagine you have the data you need, you’ve explored it and cleaned it, and understand what the variables mean and how they logically (or at least intuitively) should interrelate. You need to decide how many clusters to examine. If you knew a priori how many there were, it is unlikely you would need to do any clustering in the first place.

I should digress slightly here and point out that, with cluster analysis, we are really partitioning multidimensional space in a way we find most meaningful. We are not actually “identifying” discrete groups. In the vast majority of settings, the clusters, or segments, are not real. They are summaries of patterns in the data. Cluster analysis will reduce subjectivity, but I have never encountered a situation in which it has eliminated it. How theory fits into segmentation, as well as the much broader subject of statistical thinking, are discussed in other posts.

There are important decisions you’ll need to make, and you won’t be able to pass the buck to your computer. Regarding the number of clusters, you can ask your software to run a range of clusters, say from three to eight. There are numerous indices we can use to help us decide which is the right number of clusters but they are rarely in perfect agreement. At the end of the day, this is a human decision. For example, statistically, seven segments might be “better” than six but, from a business standpoint, six may be a better choice. Seven segments might be difficult to interpret or include a cluster that is too small to have business meaning.

Even before this point in the analysis, there are several other fundamental decisions to make. I will note a couple of them. First, what distance measure to use? Euclidean is common but Minkowski, Pearson and Mahalanobis are a few of the many others that may be more suitable for your data and purpose. There are also various matching coefficients intended for binary data, and Gower for mixed data.

You’ll also need to decide how to select the initial seeds to get to clustering process started. One option, which I suggest you avoid, is to use the first k observations in your data file. More commonly, seeds are randomly selected with the stipulation that they are separated from each other by a minimum distance. There are numerous other ways, for instance, using another procedure such as Hierarchical cluster analysis to suggest seeds, employing Principal Components Analysis to make an initial split of the data, or partitioning the data into groups of roughly equal size and using the means of these groups as the starting seeds.

There are many other choices, including how to handle correlated variables, how to deal with response styles in survey data, and whether to use cluster ensembles. But I am not trying to give you a tutorial on K-means clustering in this brief article. I’m only mentioning K-means to illustrate a simple point:

Even with “easy” statistics, small decisions can make big differences. Statistics ain’t easy.

Arrange a Conversation 


Article by channel:

Read more articles tagged: Analytics, Featured, Statistics