Clustering is an important integral part of description task, and predictive task of data mining. It is based on similarity to divide similar objects through static classification into different groups and subsets. In Python, there are many third party libraries and dedicated toolkits for cluster analysis. Let's first enjoy some charm of cluster analysis in Python. There are a lot of clustering algorithms. Of them, the K-means clustering algorithm, for its simple and fast characteristics, is widely used. We may think the K-means algorithm is simple, but it suffices. Let's look at its basic operational procedures. The first step is to randomly select k objects to be the initial clustering centers, and then determine the clustering center for each point, which is actually to calculate the distance. The mean square deviation is usually adopted, as the standard measure function. Next, calculate the clustering center of each new cluster till convergence. That is to say, the determined center does not change any more, meaning the clustering is completed. We should guarantee that clusters are as compact as possible, and all clusters shall be as separated as possible. Next, let's look at two examples of cluster analysis with the K-means algorithm. Common clustering algorithms all contain K-means the basic algorithm like the "vq" in the cluster module of SciPy. It is a vector quantization package which includes the algorithm of K-Means. Our example here is like this. It is known that Dameng is a good learner. We'll look for potential good learners based on the scores. There are six students here. Xiaoming, Daming, Xiaopeng, Dapeng, Xiaomeng and Dameng. They all have four courses. Advanced Math, English, Python and Music. Their scores are as follows. Now, let's use the K-means for clustering of these data with the following method. At first,, put their scores into a list and then use the "array()" function in NumPy to generate data for them. Next, use the "whiten()" function to calculate the standard deviation for each column of elements, and form a new array. There are two core functions in K-Means. One is "kmeans()" and the other is "vq()". "kmeans()" is for data clustering. This part, as we see, is data. Next, look at the argument "2" after it. What does it mean? Let's think about it. Since we're to find good learners, does it mean there're two groups? one is good learners and the other is not good learners. So, we may select Group 2, i.e. to cluster into 2 groups. The return result of the "kmeans()" function is a tuple, of which we only use its first value. It is an array of clustering center. We may write it as this, a comma followed by an underline. This is the second one, meaning we don't need it. Then, we put the result into "vq()" such a function. It is a vector quantization function, which may classify each data i.e. everybody here, and then acquire the result. Have a look. The result is like this. As we see, the group where Dameng belong to is represented by 0. Well, let's consider this, Who else are good learners? This data and this data are also 0, which means Daming, Xiaopeng and Dameng are in the same group, i.e. good learners. Then, let's look at the detailed scores. It seems that this group of scores have the potential for good learners. Let's look at other three students. As we see, since Xiaoming's English is poor, he hasn't become a good learner yet. And Xiaomeng has three courses with similar scores, he has to work harder. How about Dapeng? He seems to have the potential for a good learner. It's worth noticing that. The K-Means algorithm does not acquire a globally optimal solution, but a locally optimal solution. Thus, the clustering result is likely to vary. For example, when you run this program, you might find out that, Dapeng and Dameng may be classified into the same group. Whichever the result might be, we will see that K-Means is quite simple, but it really works. If the data is of a bigger quantity, we might see a more interesting result. This is a small case for finding good learners. Next, let's see how we can perform the previous task of finding good learners with the renowned machine learning toolkit Scikit-learn. Scikit-learn is a open-source machine learning module in Python, which is built based on the previously-mentioned NumPy Scipy library, and the Matplotlib module. It provides various interfaces with machine learning algorithms, and it's quite convenient for the user to simply and efficiently call these interfaces and for various mining and analyses etc. Now, let's see how to solve this problem with Scikit-lear. As we see, the first part is the same. Generate an array. The next part is mainly two functions. The first one is "fit()", and the other is "predict()". "fit()" is a method, for what, as we see, for clustering of datasets after K-Means determines the group. By contrast, the effect of "predict()" is to based on the clustering result determine the belonging group. Finally, the output result of code is like this. As we see, like we mentioned before, Dapeng and Daming are classified into the same group, right? both good learners. Clustering is an important method in machine learning or data mining. Apart from clustering, classification is also an essential method. However, classification is different from clustering. To put it simply, the way of classification is like this. At first, it divides a dataset into two parts. The first part is the training set and the second part is the test set. Acquire a model from the training set, and then give a definite mark to the group of unknown object in the training set. For example, let's take one randomly for example. If there are data on my work attendance for one year, the mark of group is at work or absence. Suppose the test attributes include weather, feelings, day of week. I'm full or not. And suppose we get such a rule in training, I go to work as long as I'm full. Then, we apply this rule to the test set. Can it mark off the data of my absence? That's roughly the idea of classification. There's another simple instance. In this instance, the renowned support vector machine algorithm is utilized to classify data. Similarly, two methods are used, namely, the "fit()" method and the "predict()" method. The "fit()" method is to learn n-1 training sets and then predict one test set. So, is it still quite simple. We only need to understand these methods, and see how its arguments are set up, and then we are able to perform some basic classification tasks. Sure, if you're non-computer majors, it's unnecessary to focus too much on the technical details of these algorithms. Let's look at a more practical example. Based on the rise and drop regularity of closing prices of two consecutive days, of 10 Dow Jones Industrial Average stocks over the recent year, conduct clustering to them. Suppose they are the 10 companies. Then, can we use the previously introduced function, like this, to acquire the data of these companies? Acquire their closing prices, and then how to find the regularity of rise or drop in data. You guys might still remember that, as we mentioned before, there's a "diff()" function in NumPy, to perform this tasks. Therefore, here we still use the same function to process our data, and then use our previously introduced "fit()" and "predict()" methods in Scikit-learn to cluster these data. Here, we cluster them into 3 groups. That's the clustering result we want. According to these results, can we think about it? Why is there a similar regularity in these companies? Is that related to political and economic factors? We may explore a lot here. In this section, we mainly used the simple algorithm of K-Means to talk about how to apply it to data clustering. Hope I've unveiled the mysteries of machine learning or data mining to you. Are you very interested in it now?