So it turns out the second right singular vector also shows a similar behavior.

It's got a difference between the first five samples and the next five samples.

But then there's an oscillation in the pattern between within the groups.

So what does this mean?

It means there's a singular value decomposition is finding patterns that

explain the most variation.

But it doesn't necessarily directly decompose the patterns due to variables

that you think that you might care about.

And so it's not quite a perfect recapitulation

of the variables that generated the data set, but it does still give you some idea

of the patterns that you might see in the data set.

Again if you calculate the percentage of variance explained, so

here's the D values plotted from one to ten, because it's a diagonal matrix.

You can also see the percentage of variance explained is still very high by

the first pattern and the second pattern, and then it drops off.

So again we're kind of getting some idea of the dimension of

the true underlying variables that are sort of contributing to that data set,

as well as what they look like.

But they're not exactly the same, because of this requirement of orthogonality.

So how is this applied?

I was going to show you one example from genetics here.

So in this example, they took a genetic matrix that consisted of,

in the rows they had many, many, many snips, so single nucleotide polymorphisms.

And in the columns they had many samples from people from different

places throughout Europe.

And, so they calculate the first two singular vectors which

are equivalent to the first two principal components, PC1 and PC2 here.

And when they plot them, you can see that if you plot each sample according

to these two principal components, you see that they cluster by geography.

So for example, here you see the sort of the Spanish and

Portuguese samples down here.

You see Italian samples over here and so forth.

So you get basically an identification of the structure and

the genetic data that corresponds to the geographic structure.

And that makes sense because genetics tend to be associated or have patterns that

are associated with population structure, which is then associated with geography.

Because people tend to have a relationship with and

childrens with people that are close to them.

So there's a relationship between geography and population structure.

So another way this can be used is to identify patterns in a data set.

So again here I'm plotting PC1, or

Singular Vector One, versus Singular Vector Two.

And so what I'm trying to do is I'm trying to find distances between samples.

And I'm looking at the right singular vector that's looking at patterns in

the samples across rows.

And so here, each dot represents one sample and they're colored by

whether they're a human or a mouse sample from this specific study.

And then the symbol comes from what tissue did they come from..

So, if you look at this data set,

the distance between any two points in the plot is supposed to be a sort of

an estimate of the distance between those two samples.

If these PCs explain a large percentage of the variation, or

the singular vectors explain a large percentage of the variation.

Then that's a really close approximation of the distance between the two samples.

If they're not very close to each other then it's not a very close, sorry,

if they don't explain a large percentage of variation then it's not a very

good approximation.

So here you can see, for

example, that the testing samples from human and mouse are close to each other.

And the liver samples for human and mouse are also close to each other.

If you actually do a clustering you see that that's true.

You see testees cluster close to each other as do liver.

And so what does this plot suggest?

This would suggest that there's close relationship between tissues

than there is between species.

And so, another way that you can use this is you can actually try to identify

effects that are different between groups.

So here, what, this is it's an actual example that comes from this book.

And so, in this example,

they've actually taken a real data set and made a subset of that data set.

And so, the subset of the data set that they've taken is from two

different batches.

But then, within those two different batches they've taken some samples from

men and some samples from women and

they've looked at genes on the Y chromosome.

And so, here you can see, here are the women and the men from batch one, and

here are the women and the men from batch Two.

And so, you can see, for example,

that there are some genes that are very different between the two batches.

But there also are some genes that are different between the two sexes.

And so if you do the first singular value of this data set of the first

principle component,

you actually see that the biggest effect that you see is the batch variable.

So you can see that batch one and batch two are very different from each other.

And so you can use that to detect different variables in the data set.

Whether it's batch effects or whether it's group differences

by decomposing the data into smaller variables.

This is widely used like I said for batch effects.

This often comes up in technical artifact correction which we'll talk about later.

There are also many other decompositions people use.

They use multidimensional scaling,

independent component analysis, non-negative matrix factorization.

We're not going to cover those in this class,

because they're not as widely used as PCA and SVD, but they are other

matrix decompositions or ways to reduce the dimension of data that you might see.

If you want a lot of more discussion of this you can see it in this

Advanced Statistics for Life Sciences course, where they go into pretty deep

detail about these different matrix decompositions.