Estimation is something you do quite often. What budget should you save for vacation? How long will it take you to get to work in rush hour? Or do you have enough gas to get home? These are all examples of estimations. You do not know the exact answer or value, but you can make an educated guess. In this video I want to explain the concept of estimation, because it is at the heart of many statistical analysis and we will talk about the accuracy of an estimate as well. To try to give an answer to the question, how precise is my estimate? For that, let's have a look at our coffee example again. We are interested in the caffeine percentage of all produced coffee and have collected a sample of 40 batches of coffee. And we measured the caffeine percentage for these 40. In the video on descriptive statistics, we already discovered that the mean caffeine percentage in the sample was equal to 0.083. However, this is only these 40 measured batches. And in the end, we want to draw conclusions about all produced coffee, n ot just these 40 batches. So, how does the sample mean of 0.083 relate to all produced coffee? Well, that is exactly the goal of statistics. To generalize a sample distribution to a population distribution. Basically we use our sample statistics, like the mean, the X with the bar, and the standard deviation, the S to estimate our population-parameters. And these are then often denoted by the Greek letters, mu and sigma. To explain this in more detail, I first want to analyze the sample of caffeine percentages in Minitab, before we are going to generalize the sample statistics to the entire population of produced coffee. I have copied the data into Minitab and I have a column with batch numbers and I have the column with caffeine percentages that we measured for the 40 batches. Okay, to analyze this data I will show you a function which you can find under Stat and basic descriptive statistics. We will make what's called the graphical summary which will put all the different things together about the caffeine percentage. So we want to have a summary of the caffeine percentage. Okay, and this is the output provided by Minitab, where you get a histogram, some statistics, a box plot, and confidence intervals. Let's study this. Let's first take a look at the histogram, before we study all the other provided statistics. The blue bars are the histogram of our 40 values in this sample. The red line is the estimated, or fitted, population distribution. This represents all the coffee produced, and not just the batches in our sample. In the video on normal, lognormal and Weibull distributions, I will discuss these fitted distributions in more detail. But for now just remember that we are interested to know what this distribution looks like in the sense of where its location is and what the spread is. The location is denoted by the Greek letter mu. And it is an unknown value, because we have not measured all produced coffee. And the spread is denoted by the Greek letter sigma, and is also an unknown value, because we have not measured all produced coffee. So, how can we use this sample to say something about this location, and this spread? Let's have a look at the other output Minitab provided us. This bit on the Anderson-Darling test will not be discussed here. For that see the video on probability plots. We already know these reported statistics from the video on descriptive statistics. The mean based on the sample, can be used as an estimate for the location of the population. Alternatively, we could also use the median as an estimate for the unknown location of the population of produced coffee. So we can conclude that based on the sample we expect the population to have a location of 0.0832. Or should it be 0.0845? Which one is it? Before I answer that question, let's recap what we just discussed. We wish to say something about the caffeine percentage in the total population of coffee. Because we do not know what that caffeine percentage is, we select the sample of size 40 and measure the caffeine content for those batches. We analyzed the data and made a histogram and descriptive statistics. Next we wish to say something about the total population using this sample. And the location of the population is denoted by mu, and the spread of the population is denoted by the Greek letter sigma. But which descriptive statistics should we use to estimate the mu and the sigma? For mu, the location, we often use the mean. And an alternative is the median. So which estimator to use? It depends on the situation. The mean is more precise, as it uses all data, but the median is more robust to outliers. So both can be useful. For sigma, the spread in the data, the most often used estimator is the sample standard deviation. But we can also use the interquartile range, which is a difference between Q3, the third quartile and Q1, the first quartile. Or we can use the range, the difference between the maximum and minimum. Again, it depends on the situation, which estimator is most useful. The sample standard deviation is very precise as it uses all data values in your sample. However, just like the range, it can be sensitive to outliers. The interquartile range is more robust to outliers. For more details on this statistics, see the video descriptive statistics. For now, remember, that you can use various estimators to obtain an estimate of the population parameters. Now that we have estimates for the population parameters, we can ask the question, how precise are these estimates? They are only based on 40 measurements, and that is not a very large sample. Well, statistics is about making inference under uncertainty. And to quantify this uncertainty in your estimates, we can calculate two boundaries, in between which we are fairly confident that the actual population mean will lie. And this is called the confidence interval. And the most often selected confidence level Is 95%. Minitab computed for us such a confidence Interval. We see that we are 95% confident that the actual population location will lie in between 0.0779 and 0.0884. Thus, the mean caffeine percentage in the population will be in between these two values with 95 percent certainty. You can also construct confidence Intervals for the for the median or the standard deviation or any other statistic that you would like. Of course, if we would have a larger sample, the estimates will be more precise and our interval, will become smaller. In summary, we use estimates to approximate an unknown population parameter. These estimates will never be exactly equal to the population parameter because it is based on sample. Therefore, to quantity the uncertainty around the estimate, we calculate confidence intervals. A confidence interval gives you boundaries in which a population parameter will lie with a certain level of confidence.