So, let's run a K-means cluster analysis in Sass. Following my Libname statement and data step which we are using to call in the data set, we can delete the observations with missing data on the clustering variables. We'll first create a dataset that includes only my clustering variables and the GPA variable. which I'm going to use to externally validate the clusters. Then, we will assign each observation a unique identifier. So, that we can merge the cluster assignment variable back with the main dataset later on. We will use the cmiss command to tell Sass to delete observations with missing data. And of underscore all underscore, in parentheses, to tell Sass to do this for every variable in the dataset. I will also turn on ODS graphics with the statement ODS graphics on. ODS stands for Output Delivery System. Which manages the output and display such as those in HTML. Sass will not print any plots if ODS graphics is not turned on. Then, I will use the surveyselect procedure to randomly split my data set into a training data set consisting of 70% of the total observations in the data set. And a test dataset consisting of the other 30% of the observations. Data equals, specifies the name of my managed dataset, called clust. And out equals the name of the randomly split output dataset, which I will call traintest. With it, we include the seed option. Which allows us to specify a random number seed to ensure that the data are randomly split the same way if I run the code again. The SAMP rate command tells Sass to split the input data set. So, that 70% of the observations are designated as training observations. And the remaining 30% are designated as test observations. Method equals SRS specifies that the data are to be split using simple random sampling. And the out all option tells staffs to include both the training and test observations in a single output data set that has a new variable called selected. The selected variable indicates whether an observation belongs to the training data set or the test data set. Training set observations are coded one on the selected variable and test observations are coded zero. In cluster analysis, variables with large values contribute more to the distance calculations. Variables measured on different scales should be standardized prior to clustering. So, that the solution is not driven by variables measured on larger scales. We use the following code to standardize the clustering variables to have a mean of zero and a standard deviation of one. We use the proc standard procedure. Data equals is where we provide the name of our clus under score train training data set. With the unstandardized clustering variables. Out equals clustvar, produces a data set called clustvar that includes the standardized clustering variables. Mean equals zero and std equals one, tells Sass to standardize the clustering variables to have a mean of zero and a standard deviation of one. Then we list the clustering variables that we want to be standardized. Next, because we don't know how many clusters actually exist in the population, we will run a series of cluster analyses. For a range of values for the number of clusters. Rather than run new Sass code for each value of K, there's a macro called knean that we can use to automate the process. The % macro code here, indicates that the code is part of a Sass macro. The name of the macro is knean and the K in parenthesis indicates that the macro will run the procedure code for number of different values of K. Which we specify later. We will this fastclus procedure to conduct the K means cluster analysis. The fastclus procedure uses the standardized training data equals clustvar as input. In L equals data ampersand K dot, creates an output data set called outdata for a range of values of K. This data set contains a variable for cluster assignment for each observation. And the distance of each observation from the cluster centroid. The ampersand K dot, tells Sass to add a numeric value to the name of the data set. So for example, the alpha data set for K equals one cluster will be called outdata1. The alpha data set for K equals two clusters will be called outdata2 and so on. Outstat equals clusterstat ampersand K dot, creates an output dataset for the cluster analysis statistics for range of values of K. Maxclusters equals ampersand K dot, tells Sass to run the cluster analysis. Specifying the maximum number of clusters for a range of values of K. And maxiter equals 300, as to up to 300 iterations be used to find the cluster solution. Then, we list the standardized clustering variables, then type run, to run the code. Percent mend tells Sass to stop running the macro. Following that, we asked Sass to print the output and create the output data sets for K equals 1 to 9 clusters. We do this by typing % knean which is the name of the macro and in parenthesis, the value of K. You'll do this from 1 to 9 clusters. We can then create an elbow plot by plotting the r squared values for each of the k equals one to nine cluster solutions. To help us determine how many clusters to retain and interpret. To do this though, we first have to extract the r squared value from the output For each of the 1 to 9 cluster solutions. And merge them together using the following code, data clus1 tells sass to create a dataset called clus1. Set cluststat1 tells Sass to use the cluster analysis to a statistics dataset for K=1 to create this dataset. We are then going to create a variable called nclust. Which will be the variable that identifies the value of K for the r square. So we will set nclust=1. Then we select r-square statistics, which is quoted RSQ and the underscore type, underscore variable using quotes because it is a string variable. Finally, we'll keep the nclust variable and the variable label over_all. Which is the variable that contains the actual r-square value. Then we do the same for K equals two through nine. We'll then create one dataset called clusrsquare,that contains the rsquare values for all nine cluster solutions. By adding together these nine rsquare data. Data clusrsquare is the name of our new data set. Now, we type set and list the nine data sets that we want to add together. Then we type run to run the code. Now we can plot the elbow curve using the clusrsquare data set, with the gplot procedure. The first line of code provides some display parameters for the plot. Color equals blue tells Sass to plot the r-square in blue. And I-N-T-E-R-P-O-L equals join, tells Sass i to connect each of the plotted r-square values with a line. Then we type proc gplot and the name of the data set, which is clusrsquare followed by a semicolon. In the next line of code, we type the plot command to plot the name of the variable that has the R2 values over_all on the Y axis. And the variable that has the number of clusters nclust on the x axis followed by semi colon. Then we type run to generate the plot. So, what this plot shows is the increase in the proportion advantage in the clustering variables, explained by each of the cluster solutions. We start with the K equals 1R squared which is zero because there's no clustering yet. Then we can see that the two cluster solution accounts for about 20% of the variance. The R-square value increases as more clusters are specified. What we're looking for here is a bend in the elbow that kind of shows where the R-square value might be leveling off. Sets the adding more cluster doesn't increase the r-square is much. We can see how subjective this is though. There appears to be a couple of bends in the line at 2 clusters, 4 clusters, and again at 8 clusters. To help us figure out which solutions is best, we should further examine the results for the 2, 4, and 8 cluster solutions. To see whether the clusters overlap. Or the patterns of means on the clustering variables are unique and meaningful. And whether there are significant differences between the clusters on our external validation variable, GPA.