Let's continue talking about unsupervised learning techniques.

Now, we'll focus on a different question,

can we find any clusters in the data?

I think it's the time to discuss what clusters are.

Formally, what we usually understand by clusters is a group of

points which are more similar to each other than to other groups.

Among many definitions of what this similarity might mean,

we'll consider the most useful one.

A group of points is considered to be a cluster,

if the distances among them are much

smaller than the typical distance of points outside of the clusters.

To understand what I mean, look at this picture.

On the left, you can see set of points that

doesn't show any particular clustering structure,

while on the right, set of points clearly consist of two clusters.

Of course, if we could always just take a look at the picture,

there would be no need for any techniques,

but unfortunately, if we are dealing with multidimensional data,

they are absolutely required.

As I've told you previously,

all the unsupervised learning tasks need some formalization.

To formalize a clustering task,

we should first define the metric.

In other words, we need to say,

how we are measuring distances between our points in space.

I suggest to stick with simple Euclidean measure.

For now, though, you of course,

can use all other kinds of metrics.

Another simplification that will do for now,

is supposed that we somehow know the number of clusters in the data.

The parameter k will stand for this number.

Let's associate each cluster with its central point,

Mu with an index i.

To calculate the coordinates of its central points,

we just need to average the coordinates of all the points in the cluster.

On the other hand, a set of central points defines all clusters by a simple rule.

For each point, we find the closest center and assigned it to the corresponding cluster.

Imagine that we have some cluster candidates,

let's define the function that measures how good the clustering is in the following way.

Take for each cluster,

total sum of distances from points to their center,

and add up these values for all clusters.

In case the value of k equals to one,

we don't have any options and just assign all the points to one cluster.

In this case, the value of function V is the sum of length of green segments.

In case the value of k is equal to two,

and the clustering is done properly,

that is we distinguish the left group from the right.

The value of V is a sum of length of red segments.

So, at least in this example,

the function V seems to be much smaller for proper clustering compared to a wrong one.

Now, we are ready to formulate our task in case of a fixed and known k. Our task is

finding such centers of clusters that minimize the energy function V. Turns out,

the task is finding a real global minimum is np-hard.

We kind of have to try

all the possible clustering options which typically are of a large number.

The most reasonable alternative is a simple gradient Lloyd's

algorithm which is guaranteed to converge to a local minimum.

Let's see how it works.

First of all, we fixed the value of k,

and initial centers of k clusters.

The next steps, are we assigning all points to the clusters according to current centers?

Followed by recalculating the centers again.

These two steps are iterated,

until the centers stop changing from iteration to iteration.

Let's see it again on the picture.

First, we assign initial centers,

then assign each point to a cluster,

then recalculate the centers and so on.

Just in a few iterations here,

the cluster center stop changing,

and we found the desired clustering. Nice and easy.

Moreover, this algorithm is easily adapted to distributed settings.

When you start at points on Hadoop cluster, namely,

cluster assignment can be done in parallel on the map step,

while new centers can be calculated on the reduced step.

Let's address some important issues of the algorithm.

The first and the simplest one is scaling.

The recipe is straightforward.

Data should always be standardized before applying kmeans.

Otherwise, the distances are dominated by features with large values,

and the result of a clustering is usually useless.

Another issue is a strong dependence on the initial centers.

It's been shown over time that in many examples,

that a purely random initialization doesn't work well.

Fortunately, there is a good heuristic called kmeans++ that seems to work much better.

The heuristics makes use of the fact that closely selected centers tend to work poorly.

So the idea to make the process iterative,

picking one center at a time,

making sure that the next center is not close to already selected.

To achieve it, every time,

the probability of which point to become a center of a cluster is

proportional to square distance to the closest already chosen center.

So, the further the point,

the more the probability.

To make this process easier to execute in parallel.

There's been proposed the better version of kmeans++ called kmeans parallel.

The idea is to reasonably reduce a set of candidates for becoming a center,

and then to apply kmeans++ on the small set of candidates.

This heuristic is implemented in Spark MLlib.

Well, looks like there is an elephant in the room.

Of course, I completely forgot to tell you how to define the number of clusters.

This is the most tricky question as there is no right answer to it.

I'll tell you the simplest approach that is called,

The Elbow method which is similar to what we've discussed previously.

Remember, that we've been minimizing the V function,

when k equals to one,

it gets its maximum value which we'll call V total.

The idea is to try different values of k and see

how small the value of V function becomes as a fraction of V total.

So, we could build this simple graph showing reduction of V for various values of k,

and look at the so-called elbow point.

Just to remind you, this is the point where the marginal reduction becomes rather small.

In this example, the elbow point is k equal to three,

which indicates that the most reasonable number of clusters is three.

Actually, there are many other ways to answer this question,

and I encourage you to at least get familiar with them.

So, today we've learned about the task of finding clusters in the data.

We discussed formal definition of a cluster along with the formal definition of our task.

Then we talked about kmeans method,

with different heuristics to define initial sentence,

and to define the number of clusters.