Continuing on, we're now in the data modeling notebook that we created. Just to touch real quick on a nice feature of Deepnote. Similarly, if you were developing on your local machine and you were using Jupyter, for instance, you can see active notebooks that are running. Right here with the green dot, I can see that the data acq clean notebook is still running. If I wanted to cycle back to that, do something, it's still there and easy to access. Just another nice little feature of Deepnote. Jumping in to that batters CVS that we created at the end of the data acq and cleaning, this is every season for every player in Major League Baseball. For the career head, at least 100 hits. That was what we used as our limit, basically to speed up the data processing. But then also, players that don't have at least a 100 hits, you could make the argument that you could go a lot higher than that before they would even be considered for the Hall of Fame realistically. Regardless, we use 100 as the limit. You can see there's 43,278 unique seasons, which are captured in our data set. Then we have all those features that we spent time collecting and cleaning from before. Then after having that, we then had to make a function that would allow us to accumulate career stats. We also wanted this to be dynamic enough so that we could take segments of a player's career, so as we pass in a data frame, it performs group by as it's iterating through. In that group by, it'll take the player, field and then smush that together, so if we pass in a data frame where we're limiting the other data frame to only the first five seasons of a player's career, it's going to take that batters.csv, and the data frame that is created from that, and it's going to filter out every single season that isn't one through five. The same is applicable for any season that we want to pass in. But this particular function is extremely important for the final visualization that will show, which allows you to walk through a player's career year by year. Essentially, as you're walking through a player's career year by year, we're running these career stats and iterating through, so as they get to season five, seven, nine, for instance, it creates a new summary. That new summary is what's ultimately used in the model. Then this function just recreates creating a similarity score utilizing how Bill James did it, for the most part. Because we have these overall career stats and then this df bet, the first thing we're going to do, so we can train our model and creator model, is take that and then run career stats. We're looking at the entirety of player careers, no filters, or anything like that. Then we're also grabbing from this link, the Hall of Fame, and whether or not they made it in the Hall of Fame, and then merging here. If we take a quick look at this, we can see what the end result is. We have the season, Hank Aaron played 23 seasons, so it's showing the max number of seasons they played, which during the aggregation, decides whether or not it needs to be a summation or an average mean. Skip over. We can see this field here. This is Hall of Fame. This indicates whether the player made the Hall of Fame. This is extremely important for building the model, obviously. After doing that, we want to look at the value counts, all that Hall of Fame, and decide how they made the Hall of Fame. As I mentioned earlier in the slides, by looking at this, we can see that 85 players were elected by the Baseball Writers Association of America, and then some of the other methods are being inducted, whether it's the veterans, old-timers, and other special elections are also shown here. We filter out all the players that weren't inducted by the Baseball Writers Association of America, which ultimately gets us our split of 4,136, which are part of the negative class, so not Hall of Famers, and then 85 which are part of the Hall of Famers. The next aspect was feature engineering that we had to perform in order to create new features that didn't exist statically in the data already. With this, we were looking at All-Star games, appearances, MVP awards. Then also with the MVP, we had built in the ability so you could specify their ranking. If you wanted to look at were they in the top 25 of MVP voting? That's just an argument you pass in. However, if you want to filter it down to the top 10 in MVP voting, which is ultimately what we did, that argument is already there and can be consumed easily for you. To get an idea of some of the additional statistics and what they are, I've attached a link to the mlb.com glossary, which goes into more in-depth explanation of what every individual stat represents and means. These particular functions we created to engineer some new features involve looking at the awards, and then a rolling window, as well as taking maxes of different columns and then the best of. One thing we wanted to be able capture. One tried to identify the peak of the prime of a player's career. We want to do it via rolling window. You have to show from your seventh through your 13th season for instance. You had this amount of war. This is your orphans of war. This was your runs above replacement, things like that. Then also whether or not you are near the top of the league and hits, home runs, runs, RBIs, batting average. By doing it this way, you can't take into consideration player injuries, or if you serve time in the military for instance, which was highly prevalent in World War I era players as well as World War II era players. We also created an additional statistic which looks at the best seven years, the best five years of a player's career. So that if you do have an injury plague season, and it's sandwiched between seven amazing years like for the best category, it's going to factor in your best performances. Running this ultimately creates a new data frame with the additional features added in for ease and for exploration for anyone that's watching this or wants to play around. I've also created a CSV of this so that you don't have to perform all the feature engineering yourself. If we look at this and then scroll, we can see some of these additional features which were created. Offensive war, best five, best seven. Then the individual war and then best. This war seven represents a rolling window, whereas this war best seven represents the seven best years. If we use Hank Aaron as an example, it makes sense that his best seven is higher than his rolling seven. Early iterations of this, we were able to do say, really stupid error in our code in that it was hardcoded to accept only five. This war best 7 actually showed up is lower than war 7. That's just part of the debugging, and the collaborative environment of you working on things can concurrently with someone else. Anthony one of the things that I think is really interesting about the new features that you've created are that the process that you're modeling isn't as simple as games won or games lost. It's really a social process inside of this writer's voting. We saw this in the class also when we looked at the heart trophy for the NHL, it's really a bunch of journalists who get together and have some criteria which they may share with one another or they might not. Then they put it to a vote and they decide who's in and so over a 100 years or a 120 years, you're looking at this rich dataset and trying to think what are the signals that we might use there that reflect on some of those journalists ideas as to who really deserves to be in the Hall of Fame. Yeah, so there's a lot of different opinions on the Hall of Fame. As I mentioned earlier in the slides, at the beginning, I'm a huge fan of Bill James's work and like he's done a lot of exploration into the Hall of Fame itself. Some of these stats that we zeroed in on with respect to war and other things. He's built additional statistical measures which are captured on Baseball-Reference.com as well. With respect to lack ink for instance, which is just another metric of trying to quantify whether a player is comparable to other Hall of Famers and that it takes the similar players and what is likely a Hall of Famer, based on the players that are in the Hall of Fame. That was our starting point. Then we ultimately ended up creating far more features than this and iterating through them. As I get into the grid search, there's additional things we did for hyperparameter tuning. But it was a fun process to try to get into the heads of the talking pundits and the talking heads that discuss all of this and ultimately make the decisions, to try to factor in what we thought would be relevant and important. Then also, the not wanting to overfit or underfit was another key concern of ours because if you use 1000 features, you could probably get a really nice predictive model in the train portion. But when you actually try to put that to scale or put into production, it's probably not going to boat so well for you, so that's another thing. Especially when you're modeling with grid search, where you're actually tuning your hyperparameters by looking at many different models, you have this chance of getting really too tight with your data if you haven't segmented it correctly. That's something that we tried to make sure we accounted for. But given the short nature of the course and like everything can be improved upon. This is also something that we plan on improving upon. But for the most part, we were happy with it. Then moving on to this, we now have your all the different features that we think we're going to care about and the other we're going to need for our model. At this point we're narrowing down that list of features. Ultimately we settled on these features. You'll put this into our, "Hall of Fame final" which I put this into a CSV as well after initially creating it, so that it could be played around with and explored more. Ultimately, this involves almost exclusively all the features that we created and engineered. Then we have also our game appearances MVP. Then we grabbed our field so that we had a statistic from fielding that by itself was there, obviously, with war and wins above replacement like that's self is taken into consideration there also. Then if we move on from here now we're getting into actually running and creating our model. In order to do the train test split, as mentioned earlier, we use stratify. We pass in the column that holds our labels for the Hall of Fame within that. Then that creates our Hall of Fame train split. Then if we look at the value counts for the train and the test splits, we can see it's an equal distribution. That's an 80, 20 split with respect to being in the Hall of Fame and not being in the Hall of Fame for our train and our test set. Then moving on here, this cell contains the parameters that we used in our grid search and our hyperparameter tuning. But I've commented them out, and what you see here are the ultimate parameters that we leveraged for our SVM and our random forest classifier. We would also initially explored Canon, but the performance of Canon versus these two was worse. Ultimately, we decided to further tune SVM and RF. As we go through here, here we're creating our x and ys, and then we're using a MinMaxScalar to standardize all of our data. Then you give higher weight to those that are at the top versus at the bottom, but then also to normalize the features, so throughout everything we're using the MinMaxScalar. Then after we scalar, we have to fit the transform, run the fit transform method. Then here, we're going through and we're iterating through the different models. Then we're expanding on our DataFrame so that we can throw in the different predictions and the scores. When this is ran, it holds information, and then because we're using fold cross-validation, we're essentially taking our model and randomly splitting it into different segments as well, and then running that. We see our results here. Collectively, we can see we have an F1 of just under 0.8. For the SVM and just over 0.8 for the RF. Then the AUC for both is right around 0.88. A couple of these instances, if you look at the individual, F1 scores and the AUC scores, you can see there's some, like a 0.6 here, 0.66 and then 0.88. Depending on the actual split, sometimes the score is different, but that's one of the benefits of the cross-validation. But there you can see different splits and try to account for that. Then inside here, we included a snapshot of one of the trees from the random forest. They'll be a full image of this included, so you could actually zoom in and you see what that represents. But in this, it's basically just walking you through the decision tree. Then here we're iterating through the model and the model keys to throw in the prediction and prediction probability into the DataFrame. Ultimately, this is the odd results. In our Hall of Fame final train DataFrame, if we scroll over, we can see that we have a new DataFrame or where we know whether they made the Hall of Fame and then we have the SVM binary prediction of whether they're in that class or not. Then the probability prediction. Then for random forest it's also there. Then we have the player name. If we move on to some of the false negatives for instance, these players are in the Hall of Fame, however, SVM predicted they were not in the Hall of Fame, which is this score right here. This is just using different filtering within Pandas to filter this out and sort it. One thing that's in common with a lot of these players, is that they either had shorter careers or were very very good players. In some cases there were great players and you can't possibly predict every player, but one in particular that comes to mind is Roy Campanella. He was a Negro League player and then his status, as I mentioned earlier, like from the Negro Leagues are not included in all this, so it makes sense that within our model that he isn't predicted as being in the Hall of Fame because a good portion of his career doesn't exist with respect to what the model is interpreting. Whereas other players that have shorter careers but are fully in the Major League Baseball sector. They can be captured. Then after this, this is where you can see some of the differences. This is all the players that were not predicted to be in the Hall of Fame with the Random Forest classifier. Jackie Robinson is another amazing example of the external factors that could come into play with not being inducted in the Hall of Fame, or in his case being inducted in the Hall of Fame. He was a phenomenal player. I wasn't alive to witness him playing and everything, but I have seen some clips and his stats are unreal, but he also spent time in the Negro Leagues before joining Major League Baseball. With respect to Jackie Robinson now, if we scroll over because we're including the probability prediction here, we can see that even though Random Forest didn't predict from a binary classification that he would be in, his probability is a 0.5, and if had it been 0.501, this binary classification would have instead been one. It would have predicted him as a Hall of Famer. Then you can see that for every individual player, and this information is used in the visualizations which will show later. Then you're moving on here. This shows the players that the SVM and Random Forest are predicting the same values for. In this case, all these players shown here are predicted to be in the Hall of Fame by the SVM model, the RF model, and are actually in the Hall of Fame. Of all the 68 cases, 50 of them fall in that, and then for the next set here, this is showing the SVM and RF prediction is the same, but it doesn't necessarily match the Hall of Fame. There are some players where both SVM and RF predict that they will not be in the Hall of Fame like Roy Campanella, however, they are on the Hall of Fame. This is just another way of looking at some of the data that comes from the model that we created. Then here, this is predicting opposite, and then here now we get into some of the false positives. In the train set, some of these false positives, whether it be stared allegations or they were just very good players and ultimately didn't get elected, these players fall into that particular bin. Then this is another look of additional players. Similarly to just above, these players are all very good players in their own areas and for circumstances that aren't exactly known, they're not in the Hall of Fame as the criteria from Baseball Writers' Association of America for what you have to have in order to be inducted doesn't exist. There could be a myriad of reasons as to why these players haven't been elected yet if they're still eligible. Then after that, we can move into some additional false positives for the other model, the Random Forest classifier. Again, prominent names listed here, but could be many different reasons as to why they're listed. Then here we're just looking at the actual performance and you're looking at their precision and recall. These particular values here, like we're seeing a precision of around 0.85 and 0.81 and then recall around 0.8. After that, we can jump into actually looking at the important features and visualizing those. That's what this is looking at. Based on our model, this is analyzing all the different features that are included in the model and we can see their importance. It runs above replacement, all-star appearances, offensive WAR, and WAR are the four most critical features, and then we can see some of the other ones as we tell off the importance, lessons a little bit, but we did find that having these additional features did add some clarity for some players. Ultimately, we felt it was good to keep those in. After finalizing the train set and basically getting to a model that we felt was good, we then move on to the test set. With the test set, we're taking that classifier we built above, and then we're going to be passing in our test data. When we do that, our ultimate scores here for our F1 and our AUC, are 0.87 and then 0.92. Obviously, the test set is smaller than the train set. That will account for some increase in the variance of whether or not these scores would change or not, but basically how we judge the model was, the players from the test set that were put in the Hall of Fame class like we agreed with and thought they were good, and then the players that were not put in that class, if they were close, ultimately we agree that they should be in the conversation in our eyes, but obviously we have no bearing on whether or not they get into the Hall of Fame. This is what that data frame looks like. Then looking at some of the false negatives in the test set, we have Willie McCovey, George Sisler, and Willie Keeler. Then they are in both the SVM and the RF predictions. Then this looks at all the false cases here, so we can see Hank Greenberg also added in. Then at this point, here now, we're looking at all the players where the SVM and RF matched. Here we can see that all of them belong and are listed appropriately. Then for false positives, the only player listed here as a false positive, Shoeless Joe Jackson, which he's banned from baseball due to the Chicago Black Sox Scandal and everything, so It's not possible for him to be elected to the Hall of Fame even though there's public support for that. Moving on to that, we can now look at the final results of the model. Looking at the test set, here with our precision and recall, again the number of positives and the class dispersion makes these statistics a little bit harder to think that they're perfect or that they are ideal but overall, we felt pretty good with the model. Then we wanted to get into looking at current players. All the players that don't fall within the category of having retired within the last five years ago, we wanted to look at all of them. We created this data_prep function so we could add in all the additional features we engineered above into a current data frame with the players that exist in the game now or have recently retired. Then we run some additional little bit of cleaning. Then in order to avoid the data_prep function, I wrote them off into a CSV, so I could pull it. Then here we're passing in the same information into the classifier that we used above. It's just redoing it here to keep everything in one cell. But it's the exact same specs we used above. We're only running the random forest right now with respect to this. This essentially will go and create, and then we have this final data frame here. This final data frame represents all the players that are either projected to be in the Hall of Fame based on their statistics, or they're very close. By using the filter of 0.47, we're looking at those edge cases. If we scroll over a little bit here in the RF prediction, we can see up until this cell, which is represented by Adrian Beltre, all the players that come above are projected to be Hall of Famers based on the model that we created. My personal opinion on all of these players is that I think all of them have cases to be in the Hall of Fame for sure. They're either amazing ball players and the best of the current era, or they were just phenomenal baseball players in general. If we scroll down a little bit, after Adrian Beltre, we see a long list of players that are all very good. Then we see the 2016 World Series MVP, Ben Zobrist, who is about the best utility player you could ever ask for, in my opinion. But all of these players that are listed here and on the cusp. Again, I think are worthy of being in the conversation, personally, and the model itself picks up on that, and they're deemed as such. Then the most fun and the part that we wanted to get to, and the part I spent the most amount of time here looking at and analyzing myself was predicting Hall of Fame worthiness throughout a player's career. With this, we're starting with that original, the full data frame, and then with the engineered features in it. Then we're going through here, and then we're running this function here, which is iterating through player careers given different inputs. It ultimately creates a data frame which gives you a full binary or probabilistic view of a player's career. Then ultimately creates a visualization like this. If we look at this visualization here, we can look at Craig Biggio, for instance. Then year 1, 2, 3, 4, 5, 6, 7, 8, 9, and then year 10, it flips from being red colored, which is not a Hall of Famer based on the model, to be in blue, which is a Hall of Famer. You can see as you scroll through, the exact year which your player became a Hall of Famer, or in some cases like Edgar Martinez, you can see it flip to being a Hall of Famer and then as he continued playing, whether it's a couple of years or whatever. It flips the other way and then flips back. I guess it's just really, in my opinion, really cool to walk through and see, okay, this is pretty interesting. Then, you've got a player like Alan Trammell who has one year, according to the model where he's the Hall of Fame, and then essentially the model is saying that it was a detriment for him to continue playing. But he's awesome baseball player and ultimately isn't the Hall of Fame. Then, the other view of this and this is only looking at Hall of Famers, currently, is a nice Alpha view of this, so this is taking out probability of them being elected to the Hall of Fame. Deep red means it's a zero on the probability, whereas one gets its shading more blue, and there are no players that are one exactly. As it gets bluer, that means the odds of them being inducted in the Hall of Fame are higher with respect to the Baseball Writers Association. But as you go through here, you can see a lot of different players, and sometimes it takes until the very end of your career before the model prediction as being a Hall of Fame, but ultimately, you're judged on your entire career. The thing I like about this particular visualization is you see, looking back and if I were to imagine that somebody were tired, let's take Tony Gwynn for instance, if I were imagined Tony Gwynn retired after his seventh year, based on the model, he may not be in the Hall of Fame. But again, those external factors he may ultimately get in the Hall of Fame anyways. But it's just nice to go back and help try to quantify the impact a player had throughout their entire career, and then highlight when they became a Hall of Famer. That particular view is really nice when we look at every player in Major League Baseball that it has had at least 25 career war. A lot of these players, the model will never predict as being in the Hall of Fame, however, you can see as they trend, the opacity within the color red and the shading is dip down, and there's a lot of players are like that, so you can tell there's a lot of fringe players that are very good baseball players, and in order to have 25 career war, you have to be a good baseball player, and have had a pretty good career. But ultimately it just another way of looking through it, and there's a ton of different players included in this. This is one aspect of deep note that I'm not super keen on, and that I would love for this to be an interactive visualization where you can select a set number of players and then have it dynamically display just them, so you can do some additional exploration and give the ability to change the features that you're using and create a brand new model, for instance, if you wanted to. But also just to explore the players you think are similar and how good are you at predicting how similar a player is. This is one way that you can visualize that. This is one aspect in our capstone where we're designing a website that's around this and we'll give users more functionality and features that, is impossible within deep note. But on my personal workstation, as I was working in developing all of this, I'd created a Plotly dashboard and then also leveraged IPython widgets so that I can interactively explore all of this. Then, this is basically how we ended our project for this section, so this is a key piece in our second milestone within the MADS program that we enjoyed a lot. But we also had an unsupervised portion which focused on players similarity, which combining all of that along with our work on umpire visualization and umpire exploration. All that's going to make its way in our capstone and then ultimately I will probably share that code within the specialization as well. There's won't be any walk-through through it. Anthony, one of the things that I love about your project is it brings up a critical issue and you and I have talked behind closed doors about this, but let's talk about steroids. There's some amazing players on the list of non Hall of Famers, and I grew up collecting ball cards in the 80s, in the 90s, and I remember the news reports talking about why are there so many home runs and so forth. This is really a social process that you're modeling, but you're trying to use analytics to do it. What other datasets do you think that you might be able to expand this to, to bring in, in order to capture some of the other signals that may be the baseball writers are giving off. One thing that we didn't actually leverage in this particular milestone because of time constraints, , the actual year-by-year voting for players in the Baseball Writers Association of America. All that can be collected and utilized. You can see if a player is trending one way or the other. They do have a five percent cutoff. If you're garnering no support, then you're just going to be left off the ballot for future consideration. But if you are garnering support, you can leverage that. Then you can also start looking into things like Twitter because social media is huge. You know who the Baseball Writers Association members are. You could try to look up their personal pages and you scrape information about them. Maybe they're a fan of the Red Sox and maybe they aren't as impartial as others may be. You may get some additional votes for a player like David Ortiz, for instance. Although in my opinion, he should be in the Hall of Fame anyway, but leveraging social media. Then the actual voting statistics are a couple of things that I think would be nice expansion opportunities. I think that that's really interesting, that insight about Twitter, because we're actually started to look at that with the NHL MVP, the Hart Trophy as well. Again, it's a baseball writer, in that case, hockey writers who are making those votes. I think about the longitudinal nature of your dataset here too, which is really interesting, being over 100 years old. What was the Twitter equivalent of the 1940s and '50s in newspapers? How are people who are in the eligible voting class aligned with one another and others? It's really just a phenomenal opportunity, I think. What about salary information? When this whole course, this whole specialization started, Stefan and others started to look at some of the issues of like Soccernomics and so forth and salary and what a signal salary is, do you think salary is a useful piece of information to bring in here? Yeah. We initially scraped salary. We would have it as part of our dataset. But given the nature and the condensed timeline, it wasn't something that we explored heavily just due to the nature of salaries of exploded and at a much different rate than inflation in a lot of that is due to collective bargaining. The players union, getting the players more of what they deserve and everything. My opinion only is not a statement of fact. Because of that with the baseball luxury tax, the arbitration system. There's so many other extraneous factors that would go into it that you can't just do an apples-apples comparison of like, let's say I made $30,000 in one year back in 1920. As opposed to Mike Trout who has the richest baseball contract in history at north of 40 million a year. There's absolutely no way to do an apples-apples comparison because there isn't a salary cap. It's not like you can say, based on the cap, which is how football does it, basketball, does it, hockey, does it. You can't just say based on the cap, they're taken up this percentage of the salary. It would make creating a method to analyze that and more finesse work. It was something that I think would be in order to do it right and to do it justice. It would be a time-intensive process and it wasn't something we had the luxury of doing. But for future work it's something that we wanted the flexibility to explore, which is why in collecting the data, we don't want to have to go back and re scrape the data. We wanted to collect everything we could possibly imagine wanting in the future. It's there and it's now on our AWS instance. We can query it whenever we want in the future. We're happy about that at least. But to your point, salary I think could be a very good indicator, like creating a method to actually explore that and take that in consideration appropriately is something that would be needed though. Well, thank you very much for sharing this project with us. I think to me, I found it enlightening. I don't follow baseball as you know very closely. I'm definitely more on the hockey side of things. Maybe that has to do with losing the expos in the '80s and just have never come back from that. But I really appreciated this. I think it's a very thoughtful approach. A lot of the things that we've talked about in this course, the different kinds of models, grid searching, building ensembles, and so forth are all built in there. But you went further and you did much more, and we didn't have much of a chance to talk in this course about scaling data and how important scaling data can be in rotation of data and so forth. Of course, some of the more high-performance computing or parallel computing aspects of it, especially when you're scraping or dealing with really large datasets. Anthony thanks again for sharing this great project with us. Thank you. It was a pleasure and it's been a fun ride working on the course with you as well in a support role. But thanks for having me.