A histogram is a bar chart which shows the frequency of a given phenomena. A great example are probability distributions. For instance, in the first course in this specialization, we touched on the difference between the random, uniform, normal, and chi squared distributions. Probability function can be visualized as a curve, where the y-axis holds the probability a given value would occur, and the x-axis is the value itself. This is called a probability density function. The y-axis values are limited to between zero and one, where zero means there's no chance of a given value occurring and one means that the value will always occur. The x-axis values are labeled in terms of the distribution function. In the case of the normal distribution, this is usually in terms of standard deviations. So a histogram is just a bar chart where the x-axis is a given observation and the y-axis is the frequency with which that observation occurs. So we should be able to plot a given probability distribution by sampling from it. Now, recall that sampling means that we just pick a number out of the distribution, like rolling a die or pulling a single card out of a deck. As we do this over and over again, we get a more accurate description of the distribution. Let's pull some samples from the normal distribution and plot four different histograms as subplots. First I'll create our 2 x 2 grade of axis objects. In this case, we don't want to share the y-axis between the plots since we're intentionally looking at a number of different sizes of samples. We're mostly interested in how uniform the distribution looks. Then we can iterate through a list of four different values, 10, 100, 1,000 and 10,000. And we'll pull samples from the NumPy. Remember that the normal function of random just creates a list of numbers based on the underlying normal distribution. We can then plot these to a given axis object using the hist function. And set the title as appropriate. Well, there we go. The first plot only has ten samples, so it looks pretty jagged. And in my version here I don't think anyone would say this is obviously a normal distribution. When we jump to 100 samples, it gets better, but still quite jagged. Then it seems to smooth out a bit on the plots for 1,000 and 10,000 samples. But if we look closely, we can see that the bar of the 10,000 plots are actually wider than those of the 10 or the 100 plot. What's going on here? By default, the histogram in Matplotlib uses ten bins, that is ten different bars. Here we created a shared x-axis, and as we sample more from the distribution, we're more likely to get outlier values further from our mean. Thus, ten bins for n=10 is at best capturing ten unique values, while for n=10,000, many values have to be combined into a single bin. Let's do the same function with the bin set to 100 Now we see that the 10,000 sampled plot looks much smoother than all of the others. And the 10 sample plot shows that each sample is basically in its own bin. So I think this brings up an important question of how many bins you should plot when using a histogram. I'm afraid that the answer isn't really clear. Both of these plots are true, one is a visual of the data at a coarse granularity, and one at a more fine grain granularity. When we look at the finest granularity in our data, plotting with 10,000 bins then the histograms became basically useless for decision making. Since they aren't showing trends between samples as much as they're just showing the sample size themselves. This is similar to using aggregate statistics like the mean in standard deviation to describe a sample of a population. These values are coarse and whether they are appropriate depends highly on your questions and interests. I've linked here to an optional reading which considers the issue of how many bins are appropriate from a couple of different statistical stances. But keep in mind this understanding that there's just no right or wrong, just useful and useless for a given need. I want to use our newfound knowledge of histograms and subplots to introduce you to a more flexible GridSpec layout for subplots. The GridSpec allows you to map axes over multiple cells in a grid. For instance, let's create a scatter plot where the y values come from the normal distribution, and the x values come from a random distribution. It's not totally clear from looking at this plot what the actual distributions are for each axis, but we could add two smaller plots, each histograms, to make this a bit more clear. I'm going to define a 3x3 grid, nine cells in total. I want the first histogram to take up the top right space, and the second histogram to take up the far left bottom two spaces, rotated on its side. The original scatter plot can take up a two by two square in the bottom right. To use the GridSpec, we first import it, then create a new GridSpec, the overall shape that we want. When we add new items with the subplot, instead of specifying the three numbers of row, column and position, we pass in the elements of the GridSpec object which we wish to cover. And very important here. Because we are using the elements of a list, all of the indexing starts at zero, and is very reasonable to use slicing for the beginning or ends of lists. The GridSpec is indexed as rows and columns using the indexing operator, or square brackets. So we'll create the first subplot and histogram in row 0, covering off the first element, the middle position, and going to the end of the row. We'll create the side_histogram starting in row 1, and continuing to the end of row elements, but limited to the 0 column. Finally we'll create the scatter plot in the lower_right. When we execute this, we see our basic set of plots, great. Now, let's fill this with some data. First the scatter plot, then the top histogram, which is for the x values, then the side histogram for the y values. And we want the side histogram to be lined up, so let's rotate the layout by setting the orientation. There are actually a few ways of doing this, but since it's so common, Matplotlib has an orientation parameter which we can set to horizontal. Okay, this is close. We don't really care about the value of the y-axes for the top histogram or the value of the x-axes for the bottom histogram. Since these are probability density histograms, we care only about the relative values. One handy function of an axes object is that you can clear it. So we don't have to create a new figure and do all the GridSpec work all over again. So let's just call the clear function on the two histograms. Then we can call the hist function with the normed=True argument. Which tells Matplotlib to scale the frequency data in the histogram between 0 and 1. Also we can flip the axes of the left-hand side histogram using the invert_xaxis function. This gives us a rough solution. We can see that the axes values are not aligned. Unfortunately, Matplotlib requires that you share axes when creating plots, and we can't do that post hoc sharing. We can however, set the range of the axis to clear things up a little bit more. And we do this using the axes set_xlim and set_ylim values. I'll just hard code some of these in here to demonstrate. And there we go. We can see that our top histogram is random, pretty uniform, and that the histogram for the y values indeed looks like the normal distribution. In fact, this image looks sort of crudely like the image we looked at in the first module of this course, from the MIT and Harvard study on MOOC engagement. And it should be pretty clear to you now how we might adjust various aspects of this plot so that if we wanted to look even more like that engagement image, we could. In the next lecture, we're going to take a look at another kind of plot, the box plot.