Hi welcome to subunit where we'll talk about how we analyze structured data. Now we're not going to actually do any analysis right now. But I'll just give you some examples of what kind of analysis we can do. What kind of things we can derive from analyzing the data. because I'll be going through a few examples, and in the process introduce some of the concepts. The concepts we'll revisit later when we actually do hands-on exercises with those analyses. So the basic descriptive statistics you can do are very simple things, so when you have data, you can create what's called descriptive statistics. And what it means is, you're basically just describing what you have, the data that you're holding. And to do so, you tend to do things like distribution analysis. So frequency table, or a chart where we represent how many times something is occurring. It's a very simple thing, just counting the other thing that is very common is describing what's called central tendency. So things like mean, which is also called average, or median. So you may have heard about this often when you talk about things like median income of a community, of a state, of a country. And so that's a part of central tendencies. So what is the most common income for people in this population? So that's what medium represents. Mean is average, so what is average income? But that's a description of the data and that falls under the central tendency, and then there's dispersion which shows how things are distributed so you could find average income but that they may not tell you things like have inequality. So just because there are some errors, that doesn't mean everybody or most people have that income. There could be huge variation. Some people could be making a lot of money and many people may not be making as much. So the dispersion or deviation could give us that idea about how things are distributed. Now of course, we can, in addition to numbers, we can also visualize these things. So here's an example where there is a pie chart, and you may have seen this kind of chart where Facebook users are represented by age. So different colors here indicate a different age range, and here's an accumulation of those things. This percentage, of course, adds up to 100%. And you can see this green pie, it's 26% and that's age 26 to 34. So about a quarter of Facebook users tend to be in that 26 to 34. And the so again, this is the description of the data and to do so we are using this visualization. Here's another example of a descriptive statistics and in this case, also we are using some visualization. This is a bar graph and we're using that to represent where things fall. I won't go into details, but it's a number of retweets and favorite per retweet distribution. And so, this is simply a description of the data, so it's called descriptive statistic. So we're not doing anything more than just simply using these charts or these numbers to describe the data that we have. But of course, we want to do something more than that. So once we have some description of the data we start to get some idea about how maybe some things could be related or connected. And so we build something called correlation analysis. So here's an example. Here's what we are plotting here. This is the x axis, and this is the y axis. We plotted height and weight. And so each dot represents a person, and for that person can see there corresponding height and weight in pounds. So what we can do, and what should be done, there is a founder, there is a positive correlation between height and weight. So what does it mean? Is the height and weight are connected positively means as the height goes up, the weight goes up. If those a negative correlation which of course, really make sense here, then as the height goes up, the weight goes down. So there are two factors one is, are these two variables related and if they are related then which way is the direction of relation? Is it positive or negative? Positive relation means as one goes up the other also goes up. Here's another example where we have negative correlation. So here we have the grade point average on the x axis, the GPA. And on the y axis we have number of video games, hours of video games played. So here, these two variables are actually highly correlated. But in a negative way, which means as the hours of video games played goes up, the GPA goes down. And so, this is another relationship, another example of relationship which is called correlation and correlation essentially describes how two variables are connected, related. And if they are related, then which direction is the relationship, positive or negative? Now once we discover that two things or multiple things are related, then we can do something more. So in addition to just saying, okay, well these two are related. We could also figure out how exactly are they related? Because once we learn that, we can use that information to do some prediction. So this analysis is called regression. So what it involves, and here I'm showing this graph. It doesn't matter what the variables are. But you can see that we could draw a line approximately, that shows kind of a relationship between x and y axis. And so this line is called regression line. And what we have done here, and I'm not going to go into details here, but essentially, we are trying to figure out this y which is called dependent variable. How that is related to x which is the independent variable? So there could be, say for instance, this could be height versus weight. And so on the x axis we have height and on the y axis we have weight. So the question is, well those two are connected but exactly how? So this equation and this line tells us about that relationship. And again, we're not going into details. I'm not showing you how we can derive this, we'll look at it later, but this is just to introduce a concept that regression allows us to find the exact relationship, or at least approximate relationship between two or multiple variables. And that analysis, that relationship, that model, can then be used to predicted one variable using others. So if we know the actual equation of the relationship between height and weight, what it means is, if we know somebody's height, that is the x axis. We could know their weight, or we could predict their weight. And so that's what this means. So this is the regression analysis, and this is just to introduce you. But later, we'll see that we can actually use this analysis to do those predictions. So that is just to introduce the concept here but this are the kinds of things I will actually doing. So just to a kind of give a summary of how we analyze structured data which often tends to be numerical in nature like height and weight and income and GPA, SAT score. So what we typically do is first we have the data so we create some kind of a descriptive statistics. And so that's simply to describe what this data is. And that often helps us create our path for the analysis. One of the things we often do next is see how different variables are connected, and so that's done using correlation analysis, and if you find correlation then we can do a number of things one of a common thing to do is regression through which we find the real relationship between those variables. So correlation just tells us that things are related and if they're related they're positive or negatively related but regression analysis actually gives us a mathematical model, a regression model, that can be used to do predictions. Because once we have that model, once we have that representation we could use knowledge about one variable to predict the other variable. So we'll come back to some of these things as we get hands on experience with data from social media. So that's it for this subunit.