By now, you should have an understanding of what cluster analysis is.
And also, a good grasp on the type of business questions that this
technique is able to answer.
Before we review the methods for
performing the analysis, we need to discuss data preparation.
And we also need to discuss how clustering methods
measure the similarity between observations.
While datasets could contain a wide variety of data types.
For the purpose of this discussion we will focus on two most common data types,
numerical and categorical.
Numerical variables include quantities that may be continuous, such as time.
Or integer, such as number of purchases or number of dependents.
Categorical variables may be ordinal or nominal.
An ordinal variable implies some sort of ranking.
For instance, a customer satisfaction rating is stated as high,
medium, and low, implies an order.
Therefore, a value transformation to a numerical variable will be to make
high equal to 3, medium equal to 2 and low equal to 1.
Note that these transformations imply that the difference between high and
medium is the same as the difference between medium and low.
Nominal variables on the other hand, can be thought of representing choices.
Political party affiliations are nominal data.
In the United States for example, this nominal data would indicate Democrat,
Republican, or Independent voters.
These choices do not imply any particular order, and
therefore they cannot be transformed into a single numerical variable.
The transformation requires binary variables.
A binary variable has two possible values, 0 and 1.
The number of binary variables needed for
the transformation is equal to the number of categories minus 1.
Note that in the transformation for the political affiliation,
we use two binary variables, because there are three categories.
A Democrat is transformed into a value of 1 for variable 1, and a variable of 0 for
a variable 2.
A Republican is transformed into a value of 0 for
variable 1, and a value of 1 for variable 2.
And then Independent has a value of 0 for both variables.
A special case of nominal data occurs when there are only two categories.
For instance, yes or no options.
For yes is typically given the value of 1, and no is given the value of 0.
Most software packages for data analysis include tools that can transform
categorical data that is given in the form of text to numerical variables.
Although software can perform these transformations automatically.
It is always good to verify how categorical variables are being
transformed, to avoid problems such as the software treating nominal data as ordinal.
Datasets may contain variables with values that are on very different scales.
Therefore it is recommended to perform data analysis, such as clustering.
On normalized, or also called standardized data instead of the original data.
Normalization takes care of differences in scale by transforming each original value
to its standard value.
The operation consists of subtracting the mean and
dividing by the standard deviation.
In this example, we have age and income data for five people.
The average age of the sample is 42.20 years and the average income is 105,000.
The last two columns of the table show the normalized values, for
instance, the normalized age of Ann is -0.4948.
Which is the result of subtracting the average age of the group,
which is 42.20 years from Ann's age, which is 35 years.
And then dividing by the standard deviation of 14.55.
The normalized value means that Ann's
age is 0.4948 standard deviations below the mean.
The normalized values for age and income are now on the same scale.
That is with an average of 0 and a standard deviation of 1.
The normalized values allow us to identify the outliers in our dataset.
And they eliminate biases from variables with relatively large original values.
Also, normalized values enable an easier interpretation of cluster
analysis results.
Since the mean of a normalized variable is zero, we are able to
easily detect values that are above the mean and those that are below the mean.
We're also able to know how far a value is from
the mean in terms of standard deviations.
A normalized value for instance of 1.7 means