Let's look at Francis Galton's data, and, I just want to note he, first used this data in 1885, and that you should read up on Francis Galton. He's really an interesting character, in, in history, in general and, and definitely in the history of statistics. Had many quirky aspects to his personality and he invented a lot of really great things and some things that are not so great. You need to run install.packages("UsingR"). Here UsingR is the package for the book, Using R for Introductory Statistics. Is a great book, and they've very kindly packaged all these data sets together in a single R package. So you need to use UsingR then the library UsingR to, to get a lot of the data sets that we are going to be talking about. So let's first look at the marginal distribution of the parents. In other words, distribution of the parents disregarding children. And the marginal distribution of the children, disregarding parents. Parent distribution is all heterosexual couples, correcting for sex, I guess, by multiplying the female heights by 1.08. Here I give my ggplot commands for creating the two histograms. On the left, I have the children's heights. The X-axis is in inches, the scale goes from 60 inches to 75. The Y-axis is the count, the number of children that fall in each bin of heights. On the right in the more bluish teal color, I have the parents heights. I've broken the association by the children and the parents by not doing a scatter plot, and only looking at the marginal distribution of the children, and the marginal distribution of the parents by themselves. I'd like to use these distributions to introduce least squares, and then we'll build on the bivaried assoc, association after that. So consider only the child's height, for, forget for the moment about using the parent's height to predict the child's heights. We just want to find maybe the best prediction of the child's heights without any Information, any other information. Well, probably the best predictor would be the middle and how could one define the middle? Well, let's give one specific definition. So I'm going to go through a little bit of the notational mathematics and if that isn't your thing, then wait just a minute and we'll go through a sort of physical experiment, experiment to sort of explain it visually. So for the mathematics, let's let y i, be the height for child i. Where y 1, is the height for child 1. Y 2, is the height for child 2, and so on. My I index here goes from 1 to n, and being 928 children in this particular data set. So the middle is the num will be the value of mu that minimizes the sum of the square distances between the observed data, the y i's, and this value mu. That's how we define the middle. It's also related to physics in this so called physical center of mass of the, for example, of the histogram that we showed on the previous slide. If you thought of those bars as having, being physical entities, having weight and trying to figure out where you would put your finger to balance it out. That would be the physical center of mass. You might have guessed that the answer has to be the mean, right? The center of the data has to be the mean. And that does turn out to be the answer. We'll even go through the proof. Though, of course, if that's not your thing then you don't have to worry about the proof. Let's use our studio's manipulate function to experiment with trying to find that center of mass. So I've already loaded manipulate, but maybe I can do it again. I've loaded the data set Galton. And I've defined a histogram function here that executes the ggplot commands to draw my histogram. And it takes as an argument mu, the parameter that we're trying to find. So let me try my manipulate command here. Oh, I don't have ggplot2 loaded. [SOUND] Let's try it again. There it is. Okay, now, because we're using manipulate I can move my slider around. What I have up here is the value of mu and the mean squared error, that is the sum of the squared distances between the observed data points and that particular value of mu. And because I've added up n things, I might want the average value instead of the sums. So I divided by n, and I get a mean squared error there instead of the sum of the squared errors. Okay. So I can move it around, and notice as I get toward the center of the histogram, the mean squared error is going down. [SOUND] And around there might be as, the best I can do, 6.34. Now it's going back up. And then notice if I move it way up, it get's up large again there it is at 14. And so what you can do is you can play around and try to find the place that actually minimizes the sum of the squared distances between the observed data points and mu. This would be the point that balanced out this histogram. Now we know with the answer, it has to be exactly at the average of the child's height, but you can actually experiment and see that it minimizes that function. Here I just give the code again, and I want to remind everyone that the, all the lecture notes are in these r mark down format where you can get the code. It's embedded in the actual r mark down file. So there should be a file called index.rmd. You can go into the index.rmd file and get all the r code for all of the lectures in the entire specialization. Here I'm just reiterating that the least squares estimate is the empirical mean. And I'm showing exactly where that would be with, and, and, giving you the ggplot commands to create this same plot.