Hi, I'm Senthil Veeraraghavan. It's a great pleasure to welcome you to week three of Modeling Risk and Realities. We often look at historical data to model future realities. In such cases we often use distributions of random variables to model the future. This week, we will look at some approaches to understanding how to choose distributions to model future. How well does a distribution fit our data? Can we use it for our model? The answers to these questions depend on a mixture of theory and practice. In session one of this week, we will think about examining data and its graphical interpretations. Let's take a look. >> Hi, welcome to Modeling Risk and Realities. This is week three of our course. I'm Senthil Veeraraghavan, I'm an Associate Professor in Operations, Information, and Decisions Department at the Wharton School. In this week, week three, we're going to look at choosing distributions that fit your data. For the first step, we're going to look at data and its visualization, then choosing among the family of distributions, it could be discrete or continuous. And then how good does a certain distribution fit? We'll do a hypothesis testing and goodness of fit analysis in the last session. So Session 1 we'll do visualization, Session 2 we'll look at distributions and Session 3 we're going to look at goodness of fit. In Session 1, we're looking at data and visualization, we'll look at graphical representation of data. Now we will look at how to fit distributions to your data. Fitting distributions is important because many future events in the world are inherently uncertain. You can ask many questions like, at what value will a firm be acquired? Or, how many new units of a new smartphone will be sold when it's introduced? How much dividend will a firm pay its shareholders in the next quarter? Or we could ask, will a firm meet its growth target next year? If not, by how much will it fall short? As long as there is uncertainty in the world and that's stationary, what do I mean by stationary? The future looks fundamentally not very different from the past. As long as the world is stationary, we can use the past data to build models of future realities. So how do we model uncertainty? Real world uncertainty can be modeled using random variables. Some real world uncertainties may include profit values from an investment, percentage growth in revenues, customer orders for a new product. Or service time, for example, time to process a loan application. A random variable is a numerical description of such outcomes. And typically, we use a capital letter such as capital X or capital Y to denote a random variable. Random variables can be discrete or continuous. But all random variables we can characterize by their probability distribution. So, what's a discrete probability distribution? What's a continuous probability distribution? You'll see soon. How do you model uncertainty, let's continue. Before we model run the variables using distributions, let's ask the following question. How important is the task actually? How important is it to use the right distribution? I'll make the case for choosing the right distribution by using two example data sets. These are very simple and straightforward data sets, but quite powerful in explaining the idea. You can find these datasets, they're called Dataset1_Template and Dataset2_Template. They're both available on the course website. What we will do in the Excel video is we will generate histograms and descriptive parameters for these two Excel files. And save them as Dataset1_histogram and Dataset2_histogram. One good way to understand and describe uncertainty is to visualize it. Generating a histogram is a good method to visualize data. A histogram is a graphical representation of the distribution of the numerical data. For instance, a histogram helps us to understand the shape and the scale of the distribution and to understand the random variable. As we go forward, we will examine a family of distributions and their parameters. Now, let's look at the Dataset1_Template file. On the template file I have a data set which has 250 data points that came from some distribution. And let's try and understand how to vet a distribution. This is our first step of visualizing this and we're going to generate a histogram. As a part of the first step, I will do the following, I will calculate the minimum, maximum, mean and standard deviation. And these are easy to calculate. Minimum is going to be the min function of the entire column. And to fix a column, all you have to do is select it and press F4, which introduces a dollar sign in front of a, and dollar sign in front of the numbers. And that will fix the column. And we can have min and similarly you can calculate max. Max is going to be the max value in this data set. And the average function gives you the mean, the sample mean from the data set. And the standard deviation is given by STDEV function. So if you use the STDEV function, you get the standard deviation. And we can represent everything in two numbers here, so you have min, max, mean and standard deviation. However, there is an easier way to calculate mean and standard deviation and other descriptive statistics. We do that by clicking on the Data tab and looking at Data Analysis link. This tool link may not be available right away. And if it's not available, you don't see it, you go to File>Options and look at Add ins. And as you see, it's inactive. And it can make it back to by moving it up or adding it in. Anyway, once you add it in you go to Data tab, you should be able to see Data Analysis link. So let's go look at the Data Analysis link and click on it, and it gives me a bunch of options. I'm going to pick Descriptive Statistics. And say, OK, and I want statistics for the input range right here. And I want it to go up to 251. And input range, again, picking up the input range and I want also to click on the summary statistics and the conference interval and the output range I want on the same file of this location. So I say, OK, and you'll see Mean, Standard Deviation and the Minimum, Maximum, are all provided here. And this is a quick and easy way to calculate what I just calculated using formulas. Once we have the statistics, we can start visualizing our dataset. One way to visualize it, as I said yesterday, is to look at the histogram. And again, for the histogram, we just have to do an initial setup. Here's the initial setup. I'm going to divide this data that goes from 0.09, almost 0, to 100, into bins. And I'm going to choose bins of 10. And you can pick the size of the bins. You don't want it to be too narrow and you don't want it to be too large. And you can use trial and error to arrive at the bins you want to make for your visualization. So, you have the bins here. So now, let's go click on the data analysis link and you go down and you'll see Histogram here, click on the Histogram > OK. And the input range is your data, that's your data, that's your input range. And your bin range you just created it, that's your bin range. And you want the output to be presented at some location, and you can choose, I clicked on the input range, let me reset the input range quickly again. The input range should be those 250 data points. The bin range should be these ten bins we created. Ten bins and the output range can be here. And I also want to click on Cumulative Percentage in the chart output for recent, so let's do that and say OK. So it gives me the frequency histogram and the cumulative chart. And you can see the blue column, the blue line is the frequency chart. It gives you how many data points are in each bin. Between 0 to 10, between 10 to 20 and so on. And the red curve, which we will define very soon in the next session, is a cumulative distribution from the data. And that tells you how many data points are below 10, below 20, below 30, and so on. So we have the entire data visualized as a histogram, and you can see just by inspection, that the probability of being in the bin of 20 is very similar to the probability of being in bin of 40, or bin of 80, or bin from 90 to 100, and so on. So that tells you a little bit about how the random variable that is generating this data behaves. Let's save this file as Dataset 1 Histogram. Using the Excel sheets and the Histogram analysis, now we can look at histograms from two datasets. Here is Dataset1 with 250 data points. The blue chart uses the frequency of the observations at low values and high values. The red line uses the cumulative distribution. What's a cumulative distribution? We will formally define it in the next session. But now let's look at the blue chart, which is the frequency chart. Looking at the frequency chart, you can see the low and high outcomes are more or less comparable. The number of outcomes at 20 is comparable to a number of outcomes, let's say, at 80. We open the dataset to template file and we have history of past outcomes again, 250 data points. Slightly different data points and you will see why in a bit. And again, we first want to derive some of these statistics for this dataset and we can do this very quickly as we saw from dataset one. Click on Data tab > Data Analysis, make sure you add in if you don't have it. And you can add in very quickly by going to File > Options and looking at Addons and looking at whether Analysis ToolPak is added in or not. And once you add it in, it should appear as Data Analysis right here under the Data tab. Click on the Data tab > Data Analysis > Descriptive Statistics > OK. And you want to pick the input range with this, your data. And you want an output range in the same worksheet, for example, and that could be, sorry, you want to make sure you're selecting the correct input range. And you go to output range and make sure you have an output range close by. And you want to click some of these statistics and maybe confidence level and so on, and say, OK, and that will give you the summary statistics. You have the mean, you have the standard deviation, standard error, everything and we can, if it's appealing to you, reduce the number of decimals for everything to two decimals. Which makes analyzing numbers a bit easier. So we have standard deviation is 15.7, and mean is 47, minimum is 3 and maximum is 95. Let's visualize how this distribution looks. Again, to visualize this distribution, you go to Data > Data Analysis > Histogram, select Histogram. Before we do that, we need to define our bins. To define your bins, you need to think about how your data is ranging from a minimum to a maximum. We see the minimum is 3, it's pretty close to 0. Maximum is 95, close to 100. To things simple, I pick bins of ten. So I get zero, ten, all the way up to 100, and once I've selected the bin, I can generate a histogram. I go click on Data Analysis > Histogram > OK, and the input range is my dataset. Very quickly and Bin Range is the bin they selected. The selection of bin can be done by trial and error. You don't want it too narrow, you don't want it too broad. And the output range, you want the output range to be close by here, and let's say, again, we want cumulative percentage and the child output to be coming out too. Say OK, and we have the child output. Let me make it bigger for visual appeal, and we can see it's quite different from the earlier data that you saw, data set one, which is pretty flat here at the shape like a bell cub. We will look at the same figure in our slides, and we understand these two data sets look very different. So the underlying random variable that is generating this data might be behaving differently. So we need different distributions. Let's do that in the next session but let's save this file right now as data set two histogram. Unlike in histogram one, you will see here, there's quite a bit of a difference between low, and medium, and high outcomes. It looks quite different. We have the same 250 data points, but you can see the low occurrences and very high occurrences are much less frequent than the outcomes in the middle. In fact, the histogram is shaped like a bell curve. So, fundamentally speaking, we have 250 data points generated from different uncertainties that look quite different. The idea of choosing a distribution is to find the distribution that best explains the dataset. The datasets we saw are different. How can we model them differently? Now, recall, the real world uncertainty, we can model using random variables denoted by capital letters X and Y. We can characterize these random variables by their probability distribution. Using the visualizations that we saw in the last two slides, we can easily argue, and I hope we can strongly argue, that this two data sets have two different distributions. As we saw random variables can be quite different in their shapes as seen in the data sets. In fact, random variables can be discrete or they can be continuous. Discrete random variables are countable, think of number of M&M candies in a box. We don't know how much there are, but we can count them. And since we do not know the exact number in a box, that's a random variable. It can change from box to box. Number of shares of a firm that were sold in a day, that's a random variable that's discrete. Number of investors who attended a meeting, number of cars that went by a toll booth in an hour, all of these are countable and therefore, discrete random variables. On the other hand, random variables can be continuous. For example, the exact rainfall that falls on a particular region during the monsoon season. The heights of men and women in a region, speeds of vehicles passing by in a road, or the length of the fabric that's yarned from a ton of cotton, all of these can take any value and, therefore, can be continuous random variables. A random variable is typically described by parameters. Some descriptive parameters are as follows. The most common descriptive parameter is the mean. The mean represents average value of a random variable or a number of repetitions. Standard deviation is another parameter that's often used to describe a random variable. It measures the deviation, it measures the spread of outcomes, how far the random variable could be from its average. Variance is another measure of spread. Variance is nothing but square of the standard deviation. Other common descriptors are median and mode. Median, literally, gives the midpoint of the distribution. The value of the random variable that separates the upper half from the lower half, that's your median. It's the 50th percentile. What's mode? Mode is the most likely value of a random variable. The most likely possibility that can happen, that's mode. All of these descriptors are useful in describing a random variable. In the next session, what we will do is look at families of distributions that are often used to model realities, and try to parametize them. In session one of week three of our course, we examine how to look at data and examine graphical interpretations of data. These gives us a good spring board to go further and explore several families of distributions used model realties.