So, this means that Xn tilde is

the orthogonal projection of Xn onto the subspace spanned by the M basis vectors,

bj where j = 1 to M. Similarly,

we can write Xn as the sum j = 1 to

M of bj times bj transpose times Xn,

plus a term that runs from M +1 to D,

bj times bj transpose times Xn.

So, we write Xn as a projection onto

the principal subspace plus a projection onto the orthogonal complement.

And this term is the one that is missing over here.

That's the reason why Xn tilde is the approximation to Xn.

So if we now look at the difference vector between Xn tilde and Xn,

what remains is exactly this term.

So Xn minus Xn tilde is

the sum J = M + 1 to D of

bj times bj transpose times Xn.

So, now we can look at this displacement vectors

of the difference between Xn and its projection,

and we can see that

the displacement vector lies exclusively in the subspace that we ignore.

That means the orthogonal complement to the principal subspace.

Let's look at an example in two dimensions.

We have a data set and two dimensions represented by these dots and now we are

interested in projecting them onto the U1 subspace.

Well, we do this and then look at

the difference vector between the original data and the projected data,

we get these vertical lines.

That means they have no x component,

no variation in x.

That means they only have a component that lives in the subspace U2 which

is the orthogonal complement to U1 which is the subspace that we projected onto.

So, with this illustration,

let's quickly rewrite this in a slightly different way.

Going to write this as sum of J = M +1 to D of

bj transpose Xn times bj and we're going to call

this now equation E. We looked at

the displacement vector between Xn and it's

a orthogonal projection onto the principal subspace, Xn tilde.

And now we're going to use this to reformulate our loss function.

So, from equation B,

we get that our loss function is 1 over N times

the sum n = 1 to N of Xn minus Xn tilde squared.

So, this is the average squared reconstruction error and now we're

going to use equation E for the displacement vector here.

So we rewrite this now using equation E as 1 over N times the sum N

= 1 to capital N. And now we're going

to use inside that squared norm this expression here.

So we get the sum j = M + 1 to

D of bj transpose times Xn times bj squared.

And now we're going to use the fact that the bjs form an

orthonormal basis and this will greatly simplify this expression,

and we will get 1 over N times the sum n = 1 to capital N times

the sum J = M + 1 to D of bj transpose times Xn squared.

And now we're going to multiply this out

explicitly and we get 1 over N times the sum over

n times the sum over j times

bj transpose times Xn times Xn transpose times bj.

So, this part is now identical to this part.

And now I'm going to rearrange the sums.

So I'm going to move the sum over j outside.

So I'll have sum over

J = M + 1 to D times bj transpose.

So this is independent of n,

times 1 over N the sum

n = 1 to N of Xn times Xn transpose

and there's a bj from here missing times bj.

So I'm going to bracket it now in this way.

And what we can see now is that if we look very carefully,

we can identify this expression as the data covariance matrix S,

because we assumed we have centred data.

So the mean of the data is zero.

This means now we can rewrite our loss function using the data covariance matrix,

and we get that our loss is the sum over j = M + 1 to D of bj

transpose times S times bj and we can

also use a slightly different interpretation

by rearranging a few terms and using the trace operator.

So, we can now also write this as the trace of the sum of j = M + 1 to D

of bj times bj

transpose times S and we

can now also interpret this matrix as a projection matrix.