Okay.

In this lesson,

you will study gradient boosted decision trees for solving the classification problem.

It is a bit harder than gradient boosted

decision trees for solving the regression problem.

I think this blog can be considered as advanced material.

The main difference here is that,

in training set, y_i,

which are the targets you're going to predict,

now they act as class labels,

but not real values in case of regression problems.

And your goal here is to find the function f_x using training set

where the misclassification error at the testing set will be as small as possible.

How you're going to build such the function, f_x?

You use a probabilistic model by using the following expression,

one by one plus exp,

you model the probability of belonging of an object to the first class.

And here inside the exp,

there is the sum of h_m_x,

and each h_m_x is a decision tree.

You can easily check that such expression

for probability will be always between zero and one,

so it is normal regular probability.

This function one by one plus exp minus x is called the sigmoid function,

which maps all the real values into the range between zero and one.

Let us denote the sum of all h_m_x by f_x.

It is an ensemble of decision trees.

Then you can write the probability of belonging to

the first class in a simple way using f_x.

And the main idea which is used here is called the principle of maximum likelihood.

What is it?

First of all, what is the likelihood?

Likelihood is a probability of absorbing some data given a statistical model.

If you have a data set with an objects from one to n,

then the probability of absorbing such data set is

the multiplication of probabilities for all single objects.

This multiplication is called the likelihood.

The principle of maximum likelihood states the following: you should find a function f_x,

where this function maximizes the likelihood, thus,

you want to find

the underlying hidden statistical model which

maximizes the likelihood of absorbing the data set.

It is easy to see that it is equivalent to find a function f_x,

which maximizes the logarithm of the likelihood because logarithm is a monotone function.

It is easier from the computational point of view to deal with logarithms.

Okay. We will denote by q_f,

the logarithm of the likelihood, and now,

it is sum of all logarithms of probabilities and you are going to maximize this function.

We will use shorthand for this logarithm l_y_i, f_x_i.

It is the logarithm of probability.

And here, I emphasize that this logarithms depend actually on the true label,

y_i and your prediction, f_x_i.

Now, q_f is a sum of l_y_i, f_x_i.

In this slide, you can see an algorithm for

training gradient boosted decision trees for classification.

You have the training data set Z and capital M is the number of iterations.

And what's going on here?

At the first step,

the initial approximation is f_zero_x,

and it equals log_p_one by one minus p_one.

p_one is the part of objects of the first class.

One can show that it is an optimal initial approximation for the sigmoid function.

Then you repeat capital M iterations.

At the step number three,

you calculate the gradient of L. What does it mean?

You want to maximize the q_f and q_f is the sum of l_y_i, f_x_i.

In order to maximize some function,

you should move by the direction of its gradient.

Thus, you calculate this gradient for all the objects in your data set.

Then at step number four,

you create an auxiliary training set x_i_g_i.

The only difference of this data set with the original one is

that original labels y_i are replaced with gradients g_i,

and you fit a decision tree,

h_m_x_i to this new target, g_i.

After fitting, you calculate their optimal step size, rho_m.

It is such as step size that their addition of new decision tree,

h_m, into the composition maximizes the likelihood.

And finally, at the step number six,

you add this new decision tree,

h_m_x to the ensemble with a small multiplier,

mu, which improves convergence.

Doing such small steps,

you also guarantee the good predictive power of your model.

Your model will not overfit in this case.