[SOUND] Does eating nuts help you live longer? Does Zika virus cause a serious birth defect? Does regular exercise improve memory? Does burning fossil fuels accelerate global warming? We can use regression to answer questions like these. In the last few lessons we focused on building confidence intervals and comparing different population. That is what we call inferential statistics. Using sample information to get some insight about the population. Now, we can go one step further. Think about it, look at the first question on this slide. From what we have learned so far, I know I can do a study comparing people who eat nuts to those who don't. And then find whether the mean life expectancy of the two groups are different or not. While this is an insight worth having, you may face the next set of questions that may come up, like how much of the increase is due to eating nuts? Maybe one group was healthier than the other group to begin with. Or how much nuts does one need to eat to get this benefit? In another word, do you have a way of explaining how consuming nuts impacts longevity among all other factors that could be impacting longevity. We can use regression to find the answer, and once we know how nut consumption and longevity are related, now we can go ahead and start predicting. For example we can say eating 100 grams per day will reduce the risk of dying, by the way this is a result of a real study. Regression builds on everything we have learned so far in the class and will allow us to move to prediction based on sample information. The main objectives of regression analysis is first the main instant prediction like making inferences about possible cause and effect relationships and extrapolating them into future. Then gaining the ability to address questions of why so much, or why so many cause this difference. Taking the new understanding and improving our business. One caveat here is the fact that we expect future behavior of the process to be similar to the past. At some time in the future, the relationship that we have discovered may weaken or cease to exist. So watch how well the predictions are turning out and adjust the model when needed. Regression analysis will generate an equation to describe the statistical relationship between one or more predictors and the response variable and then we can use the regression equation to predict the outcome of new observations. In the field of statistic, regression is a big topic on its own and there are many models available for regression analysis. The simplest model is the simple linear regression and that is what I will introduce you to in this module. There are two different variables that make up a simple linear regression model. The dependent variable and independent variable. The independent variable, which is also known as the explanatory or predictor variable. This is the variable we are going to use to try and understand how it impacts the dependent variable. In a business example, this might be advertising level or time period. We can control the independent variable, for example one can decide how much to spend on advertising. Simple linear regression most often starts with a scatter plot. Which will allow us to visually see a relationship exists between the two variables, and whether or not the relationship has a linear form. For the scatter plot, we place the explanatory variable on the x-axis. The dependent variable is also known as the response variable. It is the variable that we wish to understand or predict. In a business example, this might be sales or demand. For the scatter plot, we place the response variable on the y-axis. A regression model is considered a simple linear regression when there is only one dependent variable and one independent variable, and the relationship between these two has a linear form. You have to be very careful not to confuse the two. When you step on a weight scale do you ever wonder that according to the weight chart you have a height problem? You should be seven feet tall for the number you see on your scale, of course not. We understand the height is used to estimate our weight. So height is the predictor and weight is the response variable. While in this example roles variables play maybe easy to see. There are many times where you can confuse the two. It is not always easy to detect which variable is the explanatory variable and which variable is the response variable. Imagine a study that wants to establish a connection between K-12 spending and economic growth. If you get a report that says, states that spend more money on K-12 education have higher rates of economic growth than states that spend less. Then the study is making K-12 spending as the explanatory variable and the states rate of economic growth as the prediction variable. Meaning, investing in K-12 causes economic growth. On the other hand I could argue that states that have a good rate of growth have more money and thus spend more on K-12. In this understanding the economic growth rate is causing how much money is spent on K-12. What you see here is an example where the direction of cause and effect is not that clear. The two variables are hopelessly tangled since they both can affect one another at different times. In the first case, the struggling state decides to increase investment in K-12 so that its residents are ready to be employed, and at some time in the future when this investment pays off, we see the second case occurring. Which is, since the state is doing really well and has money, it will start spending more on K-12, thus the variables switch roles. Avoid these types of studies. When you pick your variables you should have reason to believe that the explanatory variables affect the dependent variable and not the other way around. Just because a strong relationship between two variables exist doesn't necessarily mean that they are functionally related. Any two sequences, y and x, that are related. If x increases then y either increases or decreases will always show a strong statistical relation while functional relationship may not exist. Look at this graph which shows cheese consumption is highly correlated to a number of people who died by becoming tangled in their bed sheets, no kidding. Data is from National Geographic. So make sure that you pick variables that are functionally related. So now lets practice. Imagine conducting these studies. In one study we are collecting data on number of surviving fish in the tank and water temperature in the tank. In the second study we are recording the year and the bushels of corn harvested in Illinois, and in the third study is collecting data on medication dosage and time elapsed for total relief. Identify for each study the response variable. The response variable in each study is shown in red color font. In the first study the number of surviving fish in the tank is responding to the water temperature in the tank. In the second study bushels of corn harvested in Illinois is being tracked year after year so the harvest is the response variable. And in the third study time elapsed for total relief is the response variable, and depends on medication dosage. We learned about scatter plots early on. This is a great visual tool for detecting whether or not the variables we have identified show any special relationship with one another. For instance, here is a scatter plot for income versus educational attainment for all 50 states. Looking at the scatter plot where each diamond represents a state, we can see that there is a linear relationship between the predicted variable, education level on the x-axis, and income, the response variable on the y-axis. Since as the education level goes up so does the income, then we will conclude that these two variables are positively correlated. We can even look at the spread of the points and see that they are fairly clustered which suggests a strong relationship. So it is well worth doing the full analysis to know more in details by how much does education level impacts the income level. Plotting the two variables we can detect whether we have identified two variables which don't have any correlation at all in which case you don't need to pursue building a predicative model. Or we can detect a positive correlation or a negative correlation. These two will be suitable for simple linear regression analysis. So again, let's practice. This scatter plot shows the correlation between per capita income in a state and the percent of the 2012 presidential election votes that went to Romney. What type of correlation do you see here? State median income is negatively correlated with proportion of the states votes in the 2012 election that went to Romney. When you have a scatter plot, first thing to look for in a scatter plot is the direction of the association. A pattern that runs from the upper left to the lower right is said to be negative. A pattern that running from the lower left to the upper right is called positive. The second thing to look for in a scatter plot is its form. If there is a straight line relationship, it will appear or a cloud or swarm of points stretched out in a generally consistent straight form. This is called linear form. Sometimes the relationship curves gently, while still increasing or decreasing steadily, sometimes it curves sharply up and down. If you think this pattern then a linear regression model will not be suitable. Third feature to look for in a scatter plot is the strength of the relationship. Do the points appear tightly clustered in a single stream or do the points seem to be so variable and spread out that we can barely discern any trend or pattern? As the points get more and more spread out then the relationship becomes weaker and weaker. A good model will have most of its points closely clustered providing a strong relationship. Finally always look for the unexpected. An outlier is unusual observation, standing away from the overall pattern of the scatter plot. Here's the graph which shows income versus percent of votes for Romney. In this graph, this point can be defined as an outlier. Sometimes, we remove the outlier from the data before doing the regression, but other times we may want to know why the outlier is there before deciding what to do next. Scatter plots are useful in visualizing the relationship between the two variables you're studying. Once we have a sense that the relationship exists, then we can move to the next step which involves defining a mathematical equation, which describes the relationship between these two variables. This will be called the regression equation.