However, k-means cluster analysis and

cluster analysis in general has some disadvantages.

First, we need to specify the number of clusters.

But we don't know the true number of clusters.

And figuring out the correct number clusters that represent the true number of

clusters in the population is pretty subjective.

On top of that, your results can change depending on the location of

the observations that are randomly chosen as initial centroids.

K-means cluster analysis is not recommended if

you have a lot of categorical variables.

If you have a lot of categorical variables, then you need to use

a different clustering algorithm that can better handle them.

K-means clustering, assumes that the underlying clusters in the population

are spherical, distinct, and are of approximately equal size.

As a result, tends to identify clusters with these characteristics.

It won't work as well if clusters are elongated or not equal in size.

There are a few steps you can take to help you feel more confident about

the reliability and validity of your clusters.

First, conduct the k-means cluster analysis using a range of values of k.

This helps, but doesn't completely solve the cluster instability problem

related to the selection of initial centroids.

Splitting your data into training and test data sets,

will allow you to run more than one sample through your algorithm, and

can be helpful in determining whether the clusters you find are reliable.

If you get the same results in different samples, you can be more confident that

the clusters are reasonably catching the underlying subgroups in your population.

In addition,

validating the clusters by determining whether they are interpretable, and

whether they differ from each other on other variables not used in the cluster

analysis, can increase your confidence in the cluster solution that you choose.

In this course, we've just scratched the surface of cluster analysis.

There are many different methods, distance algorithms, and

approaches to choosing initial centroids, and the number of clusters to retain.

Some of these methods may be better suited to the data you have, or

to the shapes and sizes of the clusters you think might exist in your population.

K-means cluster analysis is a good starting point

because its simplicity makes it easier to convey the concepts.

Hopefully you will have learned enough to feel confident about exploring other

methods.