We derive PCA from the perspective of minimising the average squared reconstruction error. However, PCA can also be interpreted from different perspectives. In this video, we'll have a brief look at some of these interpretations. Let's start with a recap, of what we have done so far. We took a high dimensional vector x and we projected it onto a lower dimensional representation z using the matrix B transpose. The columns of this matrix B are the eigenvectors of the data covariance matrix that are associated with the largest eigenvalues. The z values are the coordinates of our data point with respect to the basis vectors which span the principal subspace, and that is also called the code of our data point. Once we have that lower dimensional representation z, we can get a higher dimensional version of it by using the matrix B again, so multiplying B onto z to get a higher dimensional version of the z in the original data space. We found the PCA parameters such that the reconstruction error between x and the reconstruction x tilde is minimised. We can also think of PCA as a linear autoencoder. An autoencoder encodes a data point x and tries to decode it to something similar to the same data point. The mapping from the data to the code is called an encoder. Let's write this down. So this part here is called an encoder, the mapping from the code to the original data space is called the decoder. If the encoder and decoder are linear mappings, then we get the PCA solution when we minimise the squared autoencoding loss. If we replace the linear mapping of PCA with a nonlinear mapping, we get a nonlinear autoencoder. A prominent example of this is a deep autoencoder with the linear functions of the encoder and decoder are replaced with deep neural networks. Another interpretation of PCA is related to information theory. We can think of the code as a smaller compressed version of the original data point. When we reconstruct our original data using the code, we don't get the exact data point back, but a slightly distorted or noisy version of it. This means that our compression is lossy. Intuitively, we want to maximise the correlation between the original data and the lower dimensional code. More formally, this would be related to the mutual information. We would then get the same solution to PCA we discussed earlier in this course by maximising the mutual information, a core concept in information theory. When we derived PCA using projections, we reformulated the average reconstruction error loss as minimising the variance of the data that is projected onto the orthogonal complement of the principle subspace. minimising that variance is equivalent to mazimising the variance of the data when projected onto the principle subspace. If we think of variance in the data as information contained in the data, this means that PCA can also be interpreted as a method that retains as much information as possible. We can also look at PCA from the perspective of a latent variable model. We assume that an unknown lower dimensional code z generates data x and we assume that we have a linear relationship between z and x. So, generally, we can then write that x is B times z plus mu and maybe some noise. We assume that the noise is isotropic with mean zero and covariance matrix sigma squared times I. We further assume that the distribution of the z is a standard normal so P of z is Gaussian with mean zero and covariance matrix the identity matrix. We can now write down the likelihood of this model. So, the likelihood is P of x given z and that is a Gaussian distribution in x with mean Bz plus mu and covariance matrix sigma squared I. And we can also compute the marginal likelihood as P of x is the integral of P of x given z. So, that is the likelihood times the distribution on z, dz, and that turns out to be a Gaussian distribution in x with mean mu and with covariance matrix B times B transpose plus sigma squared I. The parameters of this model are mu, B, and sigma squared. And we can write them explicitly down in our model up here. So model parameters are B and mu and sigma squared. We can now determine the parameters of this model using maximum likelihood estimation, and we will find that mu is the mean of the data and B is a matrix that contains the eigenvectors that correspond to the largest eigenvalues. To get the low dimension code of a data point, we can apply Bayes' theorem to invert the linear relationship between z and x. In particular, we are going to get P of z given x as P of x given z. So, that is the likelihood which comes from here times P of z. So, that's our distribution that we have here. Divided by the marginal likelihood P of x which comes from here. In this video, we looked at five different perspectives of PCA that lead to different objectives: minimising the squared reconstruction error, minimising the autoencoder loss, mazimising the mutual information, mazimising the variance of the projected data, and mazimising the likelihood in a latent variable model. All these different perspectives give us the same solution to the PCA problem. The strengths and weaknesses of individual perspectives become more clear and important when we consider properties of real data.