One of the key steps in data analysis is understanding the data by examining how the data is distributed or how it's spread out. Now, there are two primary ways of looking at the distribution of data. One is by means of its frequency and the other by its probability. Both of these approaches are fundamental tools that we use in statistical data analysis. If we're looking at a frequency distribution, we're considering how frequent a particular outcome is for a random variable. Now, that variable could either be a discrete variable or it could be a continuous variable. If we're dealing with a discrete variable, this will usually be used to represent categories of information. For example, on the slide here, you can see that we have a chart that is indicating fruit frequency. And we've got several different categories here of fruit and these categories could be represented by a discrete variable in our data set, where one value represents apples and other bananas and so on. On the left, we have the quantity of each type of fruit that has been sold and so we're recording the actual discrete values of each type of fruit. So here you can see that we've sold about 90 apples, here we maybe have sold about 19 pairs. The type of chart that is good for visualizing categorical variables like that is called a bar chart, and the bar chart will have one bar for each category. We don't want to have too many categories or it can become very difficult to read and clutter it. One thing you'll note is that each category is distinct from the other, so there's a definite gap between each of the bars and this makes it easier to read and shows that each one of these are just an independent measurement of the absolute quantity of that value without comparison directly to the other values. Now, in some cases, we may not want to plot a discrete variable like a category, but instead, we may want to plot a continuous variable as shown here. Now, in this case, because we have this continuous variable, we have a challenge because while you maybe could plot individual data points and form a curve, really here with a continuous variable because it could range anywhere, we wouldn't have a bar for each of those values. Instead, what is normally going to be applied here is we're going to bin the range of the values, and so we've created bins that will hold any value. Perhaps it falls between 40 and 45, will land in the first bin. A value that falls between 65 and 70 will fall in this middle bin, and so on. This type of chart, because it is continuous, is called a histogram. The histogram shows the way that this continuous variables frequency occurs; so how many people are between 65 and 70 inches in height? Well, in our sample set, there was, I don't know, somewhere around 95 people that were between 65 and 70 inches in height. How many people were between 85 and 90 inches in height? Well, that's less common, and so you can see there was only 40 people in our sample set that were of that height, and, of course, very few people that were between 95-100 inches, maybe only about eight people, something like that. A histogram is a good way of representing the frequency of a continuous variable, but remember what we're measuring there is just the actual values or quantities of each of those binned values within our sample data set. This is useful to see how much of something you've got going on in a sample. But if we want to try to extrapolate this to a larger data set, then it may be more useful to use a probability distribution. If we go back to our example with the fruit, the fruit probability distribution can also use a bar chart. Why? Well, because again, it's categorical, isn't it? When we do this, you'll notice that these bars are of different lengths than we had previously, at least somewhat different. Why? Well, because now we're not mapping the actual number of each fruits that existed in the sample set, but instead, we're recording the probability of each one of those fruits within the sample set. Here you can see that we've got our value of 0.3, perhaps. Here we've got a value that's probably 0.25, here we have a probability that maybe is 0.17. This probability for oranges might be 0.2, and our pears might be 0.08, we'll say. Now, these values are different than the values we saw before, which simply represented how many apples, how many bananas. Now we're looking at the percentage, and the way that these are generated is, of course, by determining the proportion of apples to the sample set. In other words, if we add up 0.3 plus 0.25 plus 0.17 plus 0.2 plus 0.08, that is always going to equal one. In other words, these will always add up to 100 percent probabilities. If we have a 30 percent probability of an apple in our sample set, then we could extrapolate that to the larger collection and say that probably 30 percent of the fruits in our larger sample will be 30 percent. It's a lot more likely that we're going to sell an apple than a pear. That's borne by these probability distribution numbers. Of course, we can apply this also to our height problem, where we were measuring the height of people. If we look at the height probability here, you'll see now that instead of recording the actual number of people of a particular height, we're recording the percentage of people within the sample set of that height. Whereas we used to have about, I think it was 95 people that were between 65-70 inches in height, the probability of that perhaps is 0.18 or 18 percent. Over here, we might have a four percent probability of this height here. For 80-85, we have an eight percent probability of that particular height. We can now take those values, extrapolate them to a larger data set , and make predictions. You can see that both the frequency as well as the probability distributions are good ways of looking at our data and its distribution. Remember that we'll normally use bars for categorical data and we will use histograms for continuous data, but either of those could be applied to either the frequency or probability type problems.