[Historian's Department] What makes a HoF career?
|
Canadice
SHL GM SHL GM Code: Upon posting this thread it has been revealed that Taylor McDavid and Ace Redding is in the HoF, but was missing in the thread I scraped from. Also Aidan Richa[b]n[/b] has been mentioned as not being in the HoF when they have in fact been inducted! Code: Before delving into this thread, I would recommend reading my previous Historian's post where I introduced the decision tree method used and expanded upon in this analysis. Introduction
Many of the players in the SHL strive towards greatness. One of the more prestigious rewards you can achieve for a great career is being selected into the Hall of Fame. As of writing this post, only 136 of more than 2000 players have been honored this way making it a very exclusive group. So what makes a Hall of Fame career? Is it longevity? Multiple individual awards? High productivity? Or is it something intangible that is difficult to predict? Within this post I hope to find a model that can shed some light on different aspects of a great career and correctly predict whether or not a player is destined for the Hall of Fame or not.Data set
The data that is used throughout this analysis is the career data from both STHS and FHM aggregated by Luke and available for download via the SHL Github under csv/history_skaters.csv. The same data was used in the earlier post on predicting the position for a skater. The list of HoF players are found via HOF Players by Draft thread. I have also scraped the Awards History thread to acquire the number of wins and nominations for every individual trophy the league awards. This results in a final data set comprising of 65 variables, 29 of which are awards information, and one class variable that is of interest to predict; whether or not the player is in the Hall of Fame. As mentioned earlier there are only 136 players that have been inducted in the Hall of Fame, and only 112 of these are skaters. Compared to the number of players (2014) that have played at least one season in the SHL as a forward or defenseman up until season 56, this is only around 5.5%. This small number of HoF skaters might prove troublesome to predict, but more on that later. Basic visualizations
The initial assumption is that the longer career a skater has, the higher the chances of them being inducted in the HoF, so we should see an association between high values for the career statistics and presence in the HoF. The first statistic to look at is the number of games played:Fig. 1 - Violin plot showing difference in distribution of games played between non-HoF and HoF skaters. The white markers are the quartiles of each group of observations. The violin plot shows the distribution of observations along the y-axis, where a wider form means a larger percentage of observations in the area. The bottom triangle shows the first quartile, meaning that 25 percent of the observations in the respective group lies below this value. The circle show the median or midpoint of the observations. 50 percent lies below and above this value. Finally the top triangle shows the third quartile that indicates that 75 percent of the observations lie below this value. As we would expect, the majority of skaters (more than 75 percent) in the HoF have played around 100 more games than 75 percent of the skaters not in the HoF. There are some few outliers in both groups; in the HoF, Turd Ferguson, Sergei Karpotsov and Brandon Holmes have all played less than 200 games; and Brandon Cant, Brendan Gibbon and Taylor McDavid all have played more than 1100 games without being inducted in the HoF. The outliers in the HoF class might be attributed to a lack of data from the first seasons. Another possible reason is that the first seasons of the league did not have as many games, meaning that players from these draft classes has accumulated fewer games in the same amount of seasons compared to skaters from later seasons. Fig. 2 - Violin plot showing quartiles and difference in distribution of points earned between non-HoF and HoF skaters. The distribution of points also gives a clear indication that a long and consistent point production should be one of the more important aspects for a HoF induction. As was the case for the number of games played, the number of points also has some outliers; with Turd Ferguson, Brandon Holmes and Alex Reay from the HoF class having less than 150 points, and Chester Cunningham, Trevor Wilson, Ace Redding and Taylor McDavid surpassing 700 points but are not in the HoF. Fig. 3 - Violin plot showing quartiles and difference in distribution of number of award wins and nominations between non-HoF and HoF skaters. When it comes to awards there is a clear difference once again on the number of nominations and wins between the two groups of skaters. There are around 20 skaters in the HoF who has never won an award, but only Jed Lloren, Niklas Stryker, Pavol Skvoznak, Nikolaus Scholz, Nicholas Pedersen, John Falkirk and Dave Smith have never received any nominations for awards. In the other group Theo Morgan (currently active), Damian Littleton and Danny Foster have won more than 5 awards each and is not currently in the HoF. Methodology
The method that will be used for this analysis is decision trees, as explained in the earlier post. One important aspect of this method is that the algorithm will only include the variables that it deems splits up the data best into its homogenous groups. What this means in practice is that if you include multiple variables that in essence should explain the same thing, the resulting model will still become relatively simple and only select one, or a few, of the variables with the best predictive power. Unbalanced labels
I mentioned earlier that there will arise problems on account of the very small number of skaters in the HoF. The goal of the classifier is to create a model that has predicts the lowest amount of erroneous labels, but if the original data set already has a very small amount of one label, the classifier starts off with a very low misclassification rate. In this case if the root node is used as the final model, all skaters will be predicted as not being in the HoF resulting in a misclassification rate of the ratio of Hall of Famers to skaters, around 5.5% (accuracy of around 94.5%). Any split that the model does will have a hard time improving on this number unless the data contain very specific relationships that can easily be modeled, which is rare. One way to handle this type of error is to change the way we value different kinds of false predictions. By default we value both kinds of errors the same, predicting a non HoF skater as in the HoF is just as bad as predicting a HoF not being one. We can instead say that one type of error is worse than the other, for instance: Tab. 1 - Example of a loss matrix This loss matrix defines the cost of predicting a HoF skater (row) as not being in the HoF (column) as 10 times worse than vice versa (non-HoF as HoF). Another way is to look at another measure than the accuracy (1 - misclassification rate) when evaluating the model as the measure will heavily favor the majority label. We still base the evaluation on the predictions from the model which can be presented in a confusion matrix containing the number of observations in every combination of true and predicted labels. Tab. 2 - Example of a confusion matrix The accuracy is counted as the sum of the diagonal frequencies (f_00 and f_11) divided by the sum of all frequencies. As is the case here, if one of the labels is much more frequent than the other the accuracy will be more influenced by this term. Instead we can look at the label specific accuracies, called sensitivity and precision. The sensitivity looks at the rows of the confusion matrix, whereas the precision looks at the columns of the matrix. sensitivity of label 1 = f_11 / (f_10 + f_11) so the rate of true labels that have been predicted to the label. precision of label 1 = f_11 / (f_01 + f_11) so the rate of predicted labels that actually are from the true label. A model suffering from unbalanced labels will have two very varied values for sensitivity and precision, one much higher than the other. So instead of trying to maximizing the overall accuracy of the model, the model evaluation can instead focus on getting as high values as possible for the sensitivity or precision of both labels. Redefining the loss matrix can help the algorithm focus in on evening out the values of these measures (ex. sensitivity) over the different labels, but unfortunately it will also negatively impact the other measure (ex. precision). If one were to used the example loss matrix in table 1, sensitivity of the minority label will become larger as the cost of predicting a HoF into non-HoF is increased, i.e. more predictions will correspond to the true label. On the other hand the sensitivity of the majority class and the precision of the minority class has a risk of becoming worse, as the cost of predicting a non-HoF skater as a HoF is now weighted lower in the evaluation. So if we still use the accuracy as the main measure of evaluating a model's performance, the two earlier methods of mitigating the problem of unbalanced labels will unfortunately not be completely solved. However, this is enough methodology for this thread. One thing to note before we get into the results is that the data has been split into 70% training set and 30% test set, containing 1423 and 591 skaters respectively. Results
Given the initial look at some of the data, the initial assumption is that the model will be a simple one that hopefully can perform well but you never know what type of hidden relations exist. Career statistics
I first started by just modelling the aggregated career statistics from the regular season against the response variable.Fig. 4 - Tree diagram showing the result of the decision tree model based on aggregated career statistics from the regular season. The darker the color of the respective side, the more homogenous node. This model was surprising in two ways. Firstly the number of earned points and games played only occur twice as a splitting criteria and not on a large section of the data. The first split is actually done on power play points which separates 7% of the data to the right, in turn containing 62% skaters in the HoF. One can argue that points and power play points are of a similar nature and looking at these two variables, the Pearson correlation coefficient between them is 0.97, indicating that they do in fact have a very strong (linear positive) relationship and in essence describe the same information. Another interesting note is that the number of penalty minutes and fights won is used as a splitting criterion in a few nodes, where lower values lead to the model predicting the skater as not being in the HoF. Only two of the nine total splits that the model makes says that larger values is more probable a HoF skater, which relates back to the initial assumption that a long career gives a player a good chance of being inducted. Tab. 3 - The confusion matrix from the model based solely on career statistics from the regular season. As can be seen in the confusion matrix there are not that many skaters that are predicted wrong, and the overall accuracy is around 97.9%, however the model is only able to predict around 76.8% of the HoF skaters as such. Examples of the 19 skaters that are in the HoF but were not predicted to the same label are Aidan Richan, Jeff Dar, Sergei Karpotsov, and Turd Ferguson. Many of them are from the first couple of seasons of the league and haven't amassed the same number of games or productions as the majority of the HoF. Looking at the 11 skaters who were erroneously predicted to be in the HoF, skaters such as Brandon Cant, Taylor McDavid, Lord Vader, The Dude, they all have amassed around or far more than 700 games and have a relatively good production. Award winners and nominees
The second model I tested was only using the award data, both the individual awards and the aggregated number of wins and nominations for each skater. Fig. 5 - Resulting decision tree based only on the awards data. Once again the resulting model is quite simple with few splits. As expected the number of wins is a very important splitting criterion with three of the five first splits using it. Considering the individual awards, being nominated for the Aidan Richard trophy actually reduces the chance of being inducted in the HoF, and this might be in connection with the namesake not being in the HoF. Tab. 4 - Confusion matrix for the awards data model. Given the simple nature of the data, only counting the number of wins and nominations, the relative good performance of the model surprised me. The overall accuracy (96.5%) and sensitivity (62.2%) is worse than the first model. Combine the data
So the third and final model that I tested with the default loss matrix is when the two data sets were combined. As they independently could produce good-quality models with some difficulties in predicting the minority label, maybe combining the two could explain the minor discrepancies in relation found in the respective models.Fig. 6 - Resulting decision tree from the combined data set using both career statistics and awards. The resulting model performs as expected, using variables from both types of data. Specifically the number of awards won is an important aspect being used quite early in the split. Surprisingly a larger number of penalty minutes and penalty majors gives a skater a larger chance of being in the HoF, but once again the interpretation of these splits is that it is an effect of a longer career. Tab. 5 - Confusion matrix for the model with combined data. The confusion matrix in table 5 show that the model actually outperforms the others by a mile. The accuracy is around 98.7% but the largest improvement can be seen in the sensitivity of the minority label, a whopping 93.9% of the HoF has been predicted to the same label. The HoF skaters that the model failed to predict was Daniel Merica, Jackson Rogers-Tanaka, Lucas Smith, Phil Schenn and Turd Ferguson. The model still had issues with predicting some of the same non-HoF skaters as previous, mislabelling Taylor McDavid, Brandon Cant, and Tor Tuck, but also current players such as Tony Pepperoni, Ola Wagstrom, Lil' Manius, and Piotr Czerkawski. Changing the loss matrix
So initially I thought that the loss matrix had to be revised in order to get a good prediction of the minority class, but after correctly figuring out the data set and combining the two types of data the model presented just above performed above expectations. Nonetheless I wanted to test if the result can be improved upon with changing the cost of different false predictions and I used the loss matrix defined in table 1. The model becomes at once more complex:Fig. 7 - Resulting model when using the adjusted loss matrix on the same data as figure 6. The splits that are done initially in the model are more conservative compared to the earlier model, resulting in larger parts of the data being moved to the right hand side of the figure. This conservative approach means that the model requires multiple splits to end up with leaves that are more or less perfect, at least for the majority class. We can see that the leaves have different predictions compared to what we are used to, for example the third leaf from the left contains only 36% of HoF skaters but the entire node has been predicted as HoF. This is because of the adjusted loss matrix, where a HoF skater is worth 10 times more than a non HoF-skater and thereby needs to be predicted correctly. Tab. 6 - Confusion matrix for the model with adjustments made to the loss matrix. Looking at the confusion matrix it is very clear that the model is perfectly predicting the minority label, a perfect 100% sensitivity, however knowing that a predicted HoF skater actually is in the HoF is much worse than before. Only two thirds of the predicted HoF skaters is predicted correctly, and this was one of the aforementioned drawbacks of using this proposed solution for unbalanced labels. Overall the model also became worse with only around 97.1% accuracy. Using per game statistics
One of the aspects that was prominent in the earlier models was that players from the earlier seasons where there were a comparably smaller number of games played had some difficulty being predicted correctly if they were in the HoF. One way to improve the model is to transform all of the career statistics (except games played) to a per game basis, i.e. every value is divided by the number of games the skater has played in their career. This transformation is done to only the career statistics, awards and games played are kept as a summation over their career. Fig. 8 - Decision tree with per game statistic instead of sum. Tab. 7 - Confusion matrix from model with per game statistics. The resulting model is a bit more complex than the earlier one, and performs worse when it comes to the sensitivity of the minority label (81.7%) and overall accuracy (98.4%), but better on the precision of the minority class (89.3%). Only eight skaters have been predicted as HoF caliber, and they include Taylor McDavid (once again), Buster Killington, Raven Silverwing and Yuri Boyka. Checking overfitting on the test set
Before we delve into the conclusions of all this modelling it is imperative to check whether or not the chosen model(s) have been overfitted to the data. During the fitting of the model a validation process has been used to define when to stop splitting nodes, so the evaluation metrics on the test set should be similar to the earlier presented values. Tab. 8 - Summary of evaluation metrics on the test set The models performance is more similar on the test set compared to the values and differences seen in the training. This result indicates that the models generally explains the same relationships in the data and has similar difficulties in correctly predicting the minority label. Both models failed to predict 8 out of 30 skaters that are in the HoF, and similarly 11 and 12 skaters respectively was mislabeled as being in the HoF. Conclusions
So what can we learn from all of this modelling. Well two things I found very interesting was that if you have a long career with a good production pace, you are not a shoe-in for the HoF, you also need seasons where you win some individual hardware. A high number of games played and points were present in both the chosen models, but for the summed data special teams and penalty minutes had a positive relation with the chance of being inducted in the HoF. No specific award was deemed important for this model, just that the skater had won something, but in the per game model specifically winning the Jeff Dar Trophy increased your chances of being in the HoF. Also a nomination for the Lance Uppercut Trophy was seen as important for increasing the chances. Looking at the individual skaters, one can argue that the models have been fooled when current players are present in the model, meaning that they haven't had the chance of being inducted in the HoF and so the model will not be able to predict them correctly if they are currently on a HoF career pace. Theo Morgan (@Otrebor13), Andreas Kvalheim (@raymond3000), Ola Wagstrom (@StamkosFan), Tony Pepperoni (@"TommySalami"), Lil' Manius (@Bonk), Oliver Cleary (@Buster), Rex Kirkby (@Acsolap), and Piotr Czerkawski (@majesiu) has been predicted as HoF inductees based on these models. In order to create a proper model, more filtering of the data should have been done, filtering out skaters who haven't had a chance of being inducted. A couple of players stood out as having a Hall of Fame careers but alas have not had the privilege. Aidan Richan (who has a trophy named after them), Taylor McDavid, Corey Bearss, Maxim Horvat, Jack Tanner are present in both models' predictions of being in the Hall of Fame when they are not. Some reflections I had now when writing all of this was that I should have used the historical data for training the model and the current players as the data set to test who has the highest chance to be inducted into the Hall of Fame when given the chance. This way it would be clear to the model that the true labels are present in the training data and what the model can be used for becomes much more well-defined. Code: Words: ~3800 x2 week |
« Next Oldest | Next Newest »
|
Users browsing this thread: |
1 Guest(s) |