This week, we're going to think about visualizing data in sports. Now looking at data by drawing graphs, charts, plots is a really useful thing to do. And anybody who studies data will tell you that that's really the first thing you should always think about doing with any data set that you're examining. Look at it first to try and see if you can spot any particular patterns in the data. And we're going to look at the data for three different sports this week. We're going to look at basketball, baseball. And to begin with, we're going this week, today we're going to look at cricket. So in terms of thinking about that, the main thing to think in terms of visualizing data is, firstly, to think about whether you can spot correlations in the data. Whether you can see relationships between particular variables. That's something that plots are really useful for. The second thing I think to think about is to look at data ranges and particularly try to look to see where the data fits in the range of possible values. So think about the range that the data can take and what kinds of values are possible. And then the third thing which relates back to the second idea, is that you want to look at outliers. You want to see whether a particular values stand out. And then to try and understand why they might differ from the ordinary pattern in the data. So we're going to look at three different examples. There are endless ways to graph and plot data. So we're going to look at really only a limited subset. But hopefully, by looking at this subset, you will get an idea of the sorts of things you can do with sports data in a package like python. And then go on to really start developing your own graphs, plots and so on to visualize the data in ways that you find interesting and useful. Okay, so let's start looking at cricket. So cricket, if you're not familiar with the game of cricket, then you need to know some basics about the rules. It's a bat and ball game. The idea is to score the most runs possible. You have a number of wickets, which is like outs in baseball. If you think of it as having each team has ten lives and you can lose those lives. And when you've lost all your lives that that's it, your innings is over. And we're going to look at a particular data for a particularly the look the Indian Premier League that we looked at in Week 1. So we're going to revisit that data. And this the format of this competition is something called 2020. It's a relatively recent invention and 2020 cricket. Each side gets up to 120 balls or pitches. If you are familiar with baseball terminology, each to score the most runs possible. One team bats first. So in essence, sets a target for the other team. And the other team then tries to reach that target as quickly as they can. And so the whole question of the whole issue in the game is can the team batting second reach the target? And how do they progress through the game as with each ball in order to get closer and closer to that target that's been set for them? Okay, so if you're not familiar with cricket, you should read the explanation. And hopefully that will give you some idea of what's going on. But I'm not going to explain it in detail now myself. So what we're going to do is we're going to start by looking at the performance of the teams in terms of the runs totals that they score and make some comparisons of runs totals between the teams. And then we're going to actually look at graphing the run, chase itself and compare how each team in a game performs according to the runs scored and the wickets the else lost during the inning. And so we can make comparisons and see how the game evolves just by graphing it. Let's begin by looking at the runs totals for teams in the games in the Indian Premier League. We're going to look at the 2018 season. And as you might recall from Week 1, there were 60 games in total played in that season. So our first step is ever is to import the packages. We need to run the data. And here is the data that we are going to use. This lists each of the games played and indeed includes the results to each of those games. If you want to see what our data is, we can just print out the column names. And here you can see the variables that are included in the data. I would spend a lot of time looking at each of those for the time being. And let's just look at simply the run scored in each innings for each team. So the first thing we're going to do that by creating a histogram. And so we're going to look at innings one. So that's the first innings in each game. And how many runs were scored. And we're going to create a histogram using dot hist. And we're going to create 10 bins. So we set the number of bins that they're going to be on the histogram. And if we run that, we can see the following. And here you can see it looks something like a bell curve. In some ways, you've got a peak somewhere around a total of 175 runs scored in a game. And then you have variants around both sides, team scoring fewer and team scoring more runs in the innings. So now let's look at the histogram for the second inning. So this is the team chasing the total set in the first inning of the game. And let's look at what the histogram of that looks like. Now the first thing to say about this is that if we compare these two histograms. If we look at them at the same time, might be tempted to think that there's something strange going on in the sense that the peak of the innings 2 seems to be the right of the peak 2 innings 1. But that should already make you stop and wonder whether there's a problem here. Because, clearly in innings 2 you only have to score one run more than the team scored in innings 1 in order to win. So it's not obvious why you would see a total that would be much higher. But if we stop for a moment to look at the two charts, though, we can see that actually, the axis the horizontal X axis has two different scales on the top chart. For innings 1, we see the scale runs from 80 to 240 runs. Whereas on the bottom scale it runs from 60 to 220. So in some sense, the second chart has been shifted to the right. And that's giving a misleading impression of the data. So again, always important to check the axes and ranges of the data. So why don't we write a command, which requires each graph to have the same range? And we can do that and plot them alongside each other, using the following. So we actually specify the range on each axis with the command plt.x slim and plt.y limb. And we set those values to be the same for the two charts. And if we run that now, you can see here the two charts, one underneath each other. And now you can see that in fact, the highest point is really the same for the two innings. So the most likely thing is that the two scores are going to be close together at the end. And in fact, what you can also see is that in innings 2 there's more weight to the left, which means scores lower than the mean score in the innings and fewer to the right. And that reflects the nature of a run chase. Team's batting second either failed to reach the total, in which case they're going to be to the left of the peak or they succeed. But they only exceed the total innings won by a small amount, in which case there's less weight to the right. And, of course, when the score in innings 1 is very high and to the right of the chart, teams in innings 2 almost never succeed in reaching that total. So there's very little weight towards that far right hand end of the chart for innings 2. Okay, so we can see there the comparison between the two histograms. And in fact, we can also draw this in a single chart. So superimpose the two distributions on top of each other. And that's what we do in the next chart. And again, you can see how we can right both innings into the same chart using plot.hist. And when we look at that, we can see here the distribution. And again, what you can see is that in reality what the innings 2 score is really sort of slightly shifted to the left of the innings 1 score. And that reflects the nature of the run chase. If you like, there's a truncation here in the distribution of the innings 2 scores. Because teams stop when they passed the total of innings 1. Okay, so we can also look at the distribution of runs, according to which team won the game and which team lost the game. So first we were looking at the order of the game. But now let's redefine variable, which refers to whether the score of the team in the innings was a winning score or a losing score. And clearly that's just defining a variable where if the score of the innings is higher than the other team's innings, then it's a winning score. And if it's lower then it's a losing score. So we define those two variables here. And now we shot those on the innings. And again, really, not surprisingly, at all. The losing score is really a leftward shift of the winning score. So you can see that the winning score is always going to be higher. Note that, of course, you can win a game with a low score if your opposing team hasn't even lower score. So there are observations to the left hand side of the chart for winning score, but clearly it's more likely. If it's a low total, it's more likely to be a losing score than to be a winning score. And likewise, you can get a very high score and still the other team can beat you. But that's less likely. And we see generally when the score is high, that tends to be more like to be a winning score than losing score.