0:00

By now, you should have an understanding of what cluster analysis is.

And also, a good grasp on the type of business questions that this

technique is able to answer.

Before we review the methods for

performing the analysis, we need to discuss data preparation.

And we also need to discuss how clustering methods

measure the similarity between observations.

While datasets could contain a wide variety of data types.

For the purpose of this discussion we will focus on two most common data types,

numerical and categorical.

Numerical variables include quantities that may be continuous, such as time.

Or integer, such as number of purchases or number of dependents.

Categorical variables may be ordinal or nominal.

An ordinal variable implies some sort of ranking.

For instance, a customer satisfaction rating is stated as high,

medium, and low, implies an order.

Therefore, a value transformation to a numerical variable will be to make

high equal to 3, medium equal to 2 and low equal to 1.

Note that these transformations imply that the difference between high and

medium is the same as the difference between medium and low.

Nominal variables on the other hand, can be thought of representing choices.

Political party affiliations are nominal data.

In the United States for example, this nominal data would indicate Democrat,

Republican, or Independent voters.

These choices do not imply any particular order, and

therefore they cannot be transformed into a single numerical variable.

The transformation requires binary variables.

A binary variable has two possible values, 0 and 1.

The number of binary variables needed for

the transformation is equal to the number of categories minus 1.

Note that in the transformation for the political affiliation,

we use two binary variables, because there are three categories.

A Democrat is transformed into a value of 1 for variable 1, and a variable of 0 for

a variable 2.

A Republican is transformed into a value of 0 for

variable 1, and a value of 1 for variable 2.

And then Independent has a value of 0 for both variables.

A special case of nominal data occurs when there are only two categories.

For instance, yes or no options.

For yes is typically given the value of 1, and no is given the value of 0.

Most software packages for data analysis include tools that can transform

categorical data that is given in the form of text to numerical variables.

Although software can perform these transformations automatically.

It is always good to verify how categorical variables are being

transformed, to avoid problems such as the software treating nominal data as ordinal.

Datasets may contain variables with values that are on very different scales.

Therefore it is recommended to perform data analysis, such as clustering.

On normalized, or also called standardized data instead of the original data.

Normalization takes care of differences in scale by transforming each original value

to its standard value.

The operation consists of subtracting the mean and

dividing by the standard deviation.

In this example, we have age and income data for five people.

The average age of the sample is 42.20 years and the average income is 105,000.

The last two columns of the table show the normalized values, for

instance, the normalized age of Ann is -0.4948.

Which is the result of subtracting the average age of the group,

which is 42.20 years from Ann's age, which is 35 years.

And then dividing by the standard deviation of 14.55.

The normalized value means that Ann's

age is 0.4948 standard deviations below the mean.

The normalized values for age and income are now on the same scale.

That is with an average of 0 and a standard deviation of 1.

The normalized values allow us to identify the outliers in our dataset.

And they eliminate biases from variables with relatively large original values.

Also, normalized values enable an easier interpretation of cluster

analysis results.

Since the mean of a normalized variable is zero, we are able to

easily detect values that are above the mean and those that are below the mean.

We're also able to know how far a value is from

the mean in terms of standard deviations.

A normalized value for instance of 1.7 means

5:09

Using the normalized age and income values in our previous example,

we can compute the distances from each pair of persons in our set.

For instance, the distance between David and Ann, or

the distance between David and Clara.

Then a scatter plot can be used to create a graphical representation

of the distance between each pair of observations.

The plot shows that in terms of age and income,

David is at least three times closer or more similar to Ann, than he is to Clara.

This make sense, since David is both closer in age and

income to Ann, than he is to Clara.

Now that we have a way to measure distances between observations,

we need to establish a measure of distance between clusters.

There are five distance measures between clusters.

And they are single linkage, complete linkage,

average linkage, average group linkage, and Ward's method.

In single linkage the distance between two clusters is determined by

the minimum distance between every pair of objects that are not in the same cluster.

Complete linkage uses a maximum distance between objects that are not

in the same cluster.

Average linkage calculate the average of all distances across the two clusters.

Average group linkage is the distance between the center of one cluster

to the center of the other.

Ward's method uses a sum of squares criterion.

The sum of squares refers to the squared distance from each observation

to the centroid of the cluster to which it is assigned.

Let's go through the calculations using the data in our simple example.

Suppose that Clara and Erin form a cluster,

the centroid of this cluster is the average normalized values.

So in our example we have 0.9828 and 0.0906 as the average of the normalized

7:05

age and the average of the normalized income for the members of the cluster.

We then calculate the squared distance from each cluster member to the centroid.

The squared distance is calculated by adding the squared differences for

each variable value and the centroid.

For example, Clara has a normalized age of 1.567 and the centroid is 0.9828.

So we square the difference between these two numbers,

we do the same for the normalized income.

The result is a squared distance of 0.3495.

Then we go through the same calculations for Erin.

Because the cluster has only two members,

the centroid is exactly halfway between them.

The sum of the squares for this cluster is 0.699.

The sum of squares for

our complete solution is the aggregate of sum of the squares for all the clusters.

Since each cluster method may generate a different outcome,

it is generally recommended to experiment and compare results.

In this slide, we show how single linkage does better than k-means and

Ward's method in these two-dimensional problems.

However, no single method will always outperform the others.

We have reviewed three concepts that are critical to performing

a valid cluster analysis.

First, data should be in the correct form

by taking into consideration what each variable represents.

Second, and a proper metric should be established

to be able to measure the distance between every pair of observations.

And third, we must decide how distance between clusters is going to be measured.

We are now ready to review how the most common clustering methods operate.