What is Cluster Analysis?

July 16, 2019

Practically anyone working in marketing research and data science has heard of cluster analysis, but there are many misunderstandings about what it is.

This is not surprising since cluster analysis originated outside the business world and is frequently applied in ways we may not be familiar with.

Cluster analysis is actually not just one thing and is an umbrella term for a very large family of methods which includes familiar approaches such as K-means and hierarchical agglomerative clustering (HAC).

For those of you interested in a detailed look at cluster analysis, below are some excellent if technical books on or related to cluster analysis:

Cluster Analysis (Everitt et al.)
Data Clustering (Aggarwal and Reddy)
Handbook of Cluster Analysis (Hennig et al.)
Applied Biclustering Methods (Kasim et al.)
Finite Mixture and Markov Switching Models (Frühwirth-Schnatter)
Latent Class and Latent Transition Analysis (Collins and Lanza)
Advances in Latent Class Analysis (Hancock et al.)
Market Segmentation (Wedel and Kamakura)

Another excellent and very recent book is Handbook of Mixture Analysis (Frühwirth-Schnatter et al.). The chapter on model-based clustering contributed by Bettina Grün of Johannes Kepler University in Austria provides a succinct and useful definition of cluster analysis, which I have reproduced below.

As Professor Grün stresses, cluster analysis is not something than can be done mechanically by the numbers, at least not competently.

Any copy/paste and editing errors are mine.

“Cluster analysis – also known as unsupervised learning – is used in multivariate statistics to uncover latent groups suspected in the data or to discover groups of homogeneous observations.

The aim is thus often defined as partitioning the data such that the groups are as dissimilar as possible and that the observations within the same group are as similar as possible. The groups forming the partition are also referred to as clusters.

Cluster analysis can be used for different purposes. It can be employed (1) as an exploratory tool to detect structure in multivariate data sets such that the results allow the data to be summarized and represented in a simplified and shortened form, (2) to perform vector quantization and compress the data using suitable prototypes and prototype assignments and (3) to reveal a latent group structure which corresponds to unobserved heterogeneity. A standard statistical textbook on cluster analysis is, for example, Everitt et al. (2011).

Clustering is often referred to as an ill-posed problem which aims to reveal interesting structures in the data or to derive a useful grouping of the observations. However, specifying what is interesting or useful in a formal way is challenging.

This complicates the specification of suitable criteria for selecting a clustering method or a final clustering solution. Hennig (2015) also emphasizes this point. He argues that the definition of the true clusters depends on the context and on the aim of clustering.

Thus there does not exist a unique clustering solution given the data, but different aims of clustering imply different solutions, and analysts should in general be aware of the ambiguity inherent in cluster analysis and thus be transparent about their clustering aims when presenting the solutions obtained.

At the core of cluster analysis is the definition of what a cluster is. This can be achieved by defining the characteristics of the clusters which should emerge as output from the analysis. Often these characteristics can only be informally defined and are not directly useful for selecting a suitable clustering method.

In addition, some notion of the total number of clusters suspected or the expected size of clusters might be needed to characterize the cluster problem. Furthermore, domain knowledge is important for deciding on a suitable solution, in the sense that the derived partition consists of interpretable clusters that have practical relevance.

However, domain experts are often only able to assess the suitability of a solution once they are confronted with a grouping but are unable to provide clear characteristics of the desired clustering beforehand.”

Source: Bettina Grün, Johannes Kepler University, in Handbook of Mixture Analysis, CRC Press.

Arrange a Conversation

Browse

Article by channel:

Everything you need to know about Digital Transformation

Read more articles tagged: Featured, Marketing Analytics

Data & Analytics

Popular Now

Related Articles