Minimising this objective requires us to find

the orthonormal basis that spans the subspace that we will

ignore and when we have that basis we take

its orthogonal complement as the basis of the principal subspace.

Remember that the orthogonal complement of a subspace U,

consists of all vectors in

the original vector space that are orthogonal to every vector in U.

Let us start with an example to determine the v vectors.

And let's start in two dimensions where we wish to find

a one dimensional subspace such that the variance of

the data when projected onto that subspace is minimised.

So we're looking at two basis vectors b1 and b2 in r2

and b1 will be spanning the principal subspace and b2 its orthogonal complement.

That means the subspace that we will ignore.

We also have the constraint that b1 and b2 are orthonormal which

means that bi transpose times bj

is δij meaning that this dot product is one if i equals j and zero otherwise.

In our example with our two vectors,

our loss function J is b2 transpose times s times b2,

with the constraint that b2 transpose times b2 is one.

To solve this optimisation problem,

we write down the Lagrangian and the Lagrangian is

b2 transpose sb2 plus λ times 1-b2 transpose times b2,

where λ is the Lagrange multiplier.

So now we compute the gradients of the Lagrangian with

respect to b2 and with respect to λ and set them to zero.

So dL/dλ is 1-b2 transpose times b2.

And this is zero,

if and only if b2 transpose times b2 is one.

So we recover our constraint.

So now let's have a look at the partial derivative of L with respect to b2.

So we get 2b2 transpose times s from the first term

and minus 2λ b2 transpose from the second term.

And this needs to be zero.

And that is zero if and only if s times b2 is λ times b2.

Here we end up with an eigen value problem.

B2 is an eigen vector of the data covariance matrix and

the Lagrange multiplier plays the role of the corresponding eigen value.

If we now go back to our loss function,

we can use this expression.

You can write J which was b2 transpose times s times b2.

Now we know that s times b2 can be written as λ times b2,

so we get b2 transpose times b2

times λ and because we have an orthonormal basis,

we end up with λ as our loss function.

Therefore the average squared reconstruction error is

minimised if λ is the smallest eigen value of the data covariance matrix.

And that means we need to choose b2 as

the corresponding eigen vector and that one will span the subspace that we will ignore.

B1 which spans the principal subspace is then the eigen vector

that belongs to the largest eigen value of the data covariance matrix.

Keep in mind that the eigen vectors of the covariance matrix are already

orthogonal to each other because of the symmetry of the covariance matrix.

So if we look at a two dimensional example,

if this is our data,

then the best projection that we can get that retains

most information is the one that projects onto

the subspace that is spanned by the eigen vector of

the data covariance matrix which belongs to

the largest eigen value and that is indicated by this long arrow over here.

Now let's go to the general case.

If we want to find the M dimensional principal subspace

of a D dimensional data set and we solve for the basis vectors bj

where j equals M+1 to D. We optimise

these ones we end up with

the same kind of eigen value problems that we had earlier for a simple example.

We end up with s times bj equals λj times bj.

For j equals M+1 to D.

And the loss function is given by the sum of the corresponding eigen values.

We can write J is the sum from M+1 to D of all λj.

Also in the general case,

the average reconstruction error is minimised if we choose the basis vectors that span

the ignored subspace to be the eigen vectors of

the data covariance matrix that belong to the smallest eigen values.

This equivalently means that the principal subspace is spanned by

the eigen vectors belonging to the M largest eigen values of the data covariance matrix.

This nicely aligns with properties of the covariance matrix.

The eigen vectors of the covariance matrix are orthogonal to each other

because of symmetry and the eigen vector belonging to

the largest eigen value points in the direction of the data with

the largest variance and the variance in

that direction is given by the corresponding eigen value.

Similarly, the eigenvector belonging to

the second largest eigenvalue points in the direction of

the second largest variance of the data and so on.

In this video we identified the orthonormal basis of the principal subspace

as the eigen vectors of

the data covariance matrix that are associated with the largest eigen values.

In the next video, we will put all the pieces

together and run through the PCA algorithm in detail.