Goaltenders are important, but volatile. How can statistical forecasts overcome considerable uncertainty in future performance? By embracing it.
Why Project Performance?
The first instinct upon seeing goaltending projections should be: why? Goaltending is a notoriously volatile position making little sense to those even with a deep understanding of the game. In any given season, both high-pedigree and also-rans goaltenders are seemingly as likely to deliver top performances. A perceived star having a poor season can sink a promising season.
But that magnitude of impact makes it an interesting and useful exercise. Goaltending, though volatile, exerts an outsized influence on games and seasons, for better or worse. If your goaltender is on, the game is easy, and if they are off, everyone invested in the team is just waiting for something to go wrong.
And that’s the crux: a volatile but important position is still important. It is often useful to use data to project future results, no matter how difficult and frustrating the process can be.
In the Business of Results
It’s important to preface that this analysis deals with goaltender statistical profile rather than true goaltender ability. In a perfect world, we could successfully derive a metric that aligned the two in a meaningful way, and current methods do their best to isolate goaltender performance by adjusting for the quality of shots their team allows. However, there are latent variables characteristic of certain teams. For example, some teams may allow a higher-rate of screened shots or cross-ice passes relative to the recorded shot attributes might suggest. I’ve estimated that team-level latent effects on shot quality can be 0.2% at even strength and about 0.6% on the powerplay.
Therefore, all projections suggest which goalies will likely return the best results rather than which goalies are definitively better. Results are influenced by ability, health, age, opportunity, coaching, and team-effects, all contributing to the difficulty of prediction.
What Result Do We Care About?
In order to create a projection, we first must decide what to measure over the upcoming season. Some fantasy websites might project games played or wins, or standard save percentage. However, we want a metric that best isolates goaltender performance, given the available data. Publically, this is currently best done by adjusting save percentage by expected save percentage using an expected goal (xG) model. Each shot is weighted by the probability of it being a goal given what we know about the shot and measured against actual goals against.
However, rebound shots weigh heavily in expected save percentage calculations, and rightfully so. Rebound shots are about 4 times more dangerous than initial shots (a shooting percentage of 26% and 6%, respectively). However, rebounds are not necessarily independent of the goaltender, in theory to goaltender has some control over rebounds opportunities against them.
Results are often complicated by sample size. More shots mean more information. While we could only include goalie-seasons with X number of shots, these cut-offs can be arbitrary and can be fiddled with to create spurious results (1000 shot minimum looks like this, but 1200 looks like this). My approach is the add a regressor to observed results to bring the goaltender back a single number prediction headed into the season based a simple linear model. The model inputs are last seasons results, shots against, partner performance, age, and whether it was a rookie season. This prediction acts as the Prior (prior probability distribution, red line below), our best guess of how the goaltender will perform that particular season before allowing evidence (results) to pile up.
If a 25-year old rookie is brought up from the AHL, we will probably expect below average results, say an extra goal against every 100 shots (-1%). If they post a 30-save shutout in their first game, the evidence (the shutout) wouldn’t necessarily overwhelm the prior, so combining our prior beliefs and evidence (realized save percentage) into a posterior (posterior probability distribution, blue line below), our updated estimate of their results will better than the prior of -1%, but not by much. However, after 10 games of superb results will begin to move into positive territory.
Piling on the evidence
How quickly does the evidence overwhelm the prior? That depends on the prior strength. We can imagine the prior as a synthetic goalie put in net for a set number of shots recording the same results as the prior expectation of them. So if we have a strong prior, we might ‘simulate’ close to a season of data before considering actual results. A weak prior might only be a hundred shots. The weaker the prior, the quicker the actual results and posterior results converge, as seen above.
What prior strength best stabilizes results over the season in order to best use in prediction? We will test that out later.
Regressed – using a Bayesian approach we will test various prior strengths in order to create a metric with a good balance between efficiency and workload.
Rebound Adjusted – Removing some of the noise that rebounds can add when using expected goal models to measure shot quality faced by a goalie.
This metric satisfies both philosophically and statistically. Philosophically, we are measuring goaltender performance based on what they do with each initial (non-rebound) shot against, based on the features we know about it and index their results to league average.
Statistically, when trying to predict future results, this metric performs better than raw save %, save % over expected unadjusted for rebounds, and unregressed save % over expected adjusted for rebound. Though this isn’t always a high bar to clear.
The easiest way to forecast the future is to look at the past. But how far into the past? Is yesterday more relevant than the day before it, and by how much?
A standard method to forecast athlete performance uses the marcel framework, which has its roots in baseball and has been adapted for hockeynumeroustimes. Results from prior seasons are aggregated and given less weight the further in the past they are.
A two-season marcel projecting 2018-19 results might weight 2017-18 results by 75% and 2016-17 results by 25%, totalling 100%. If we wanted a feature to represent goaltender shots faced, and in 2017-18 they faced 2,000 and in 2016-17 they faced 1,000 shots, using the 75-25 weights, our representation of shots faced would be 1,750 ((2000 * 0.75) + (1000 * 0.25)).
Like our prior strength parameter, to best parameter capture history (look back seasons) and recency (how to weight each season) can be tested.
Building the Grid
The goal of the analysis is to best predict future performance, and we have a few parameters we want to test to best generate model inputs and targets – prior strength, marcel lookback seasons, and relative weighting of lookback seasons. For each parameter, we can test various values (i.e. 100, 400… 3000 prior shots, 1…5 lookback seasons, 10 different weighting configurations) and then test model performance for each of the unique 350 combinations of parameters.
Under the Hood
Each parameter combination is used to create the:
Target variable – regressed, rebound adjusted save % over expected
Marcel-weighted regressed, rebound adjusted save % over expected
Marcel-weighted shots against
Marcel-weighted even-strength rebound adjusted save % over expected
Marcel-weighted rebound adjusted save % over expected of partner goaltenders
For each, test season we calculate the target variable and aggregate the input metrics from prior seasons. We can then train a few different models exploring the relationship between marcel-weighted prior metrics and unseen future results.
Each model splits out 80% of the 576 goalie-seasons from 2010-11 to train a model. The caret package is used to create a cross-validated model by splitting the data into 5 folds, repeating the process 5 times, in order to the find the optimal tuning parameters. The remaining 20% of the data is held out and the model performance is measured on that unseen data. Four models are fit.
Random Forest Model (4 inputs) – input features of regressed results, shots against, prior even-strength results, and age. This decision tree looks for splits in the data that might be useful in predicting future performance.
Linear Model (3 inputs) – input features of regressed results, shots against, and prior even-strength results. Simple model solely based on prior results.
Linear Model (4 inputs) – input features of regressed results, shots against, prior even-strength results, and age. The model hopes to balance performance with age.
Linear Model (5 inputs) – input features of regressed results, shots against, prior even-strength results, age, and performance of partner goalies.
Each model is then applied to the about 60 goaltenders with NHL experience likely to be on opening day rosters. For each of the 4 model predictions, we have 350 different parameter calculations, considering only models with good out-of-sample testing scores. Those out-of-sample scores are then used to take a weighted average of the prediction along with each of their confidence intervals. Finally, the 4 model prediction and confidence intervals are averaged together to represent reasonable forecast for the upcoming season.
Each goaltender has a forecast presented with a range of results, given their statistical profile and the modelling process. A lower peak and wider plot distribution represent a more uncertain prediction. It appears that age and prior inconsistency generally increase the uncertainty, which makes intuitive sense. However, due to the nature of the modelling process, the exact relationship is a bit obfuscated.
It’s also important to note that this metric represents both efficiency (per shot) and workload. Goaltenders that have demonstrated the ability to handle a heavy schedule, like Frederik Andersen, are given more credit since their above average results will likely be across more shots (overcoming the regressor). Taking extra starts from a back-up or replacement-level goaltender will likely benefit the team.
Thinking About Uncertainty
There’s obviously a lot of overlap between many goalies, which might make it unclear how exactly a decision-maker might glean information from the analysis. It might more helpful to simulate seasons by ‘drawing’ results from the calculated distribution and comparing results to peers like we would in the card game ‘War.’ If we sample from the distributions of Braden Holtby and Peter Budaj 1000 times, Budaj would post superior results about 3% of the time.
This exercise can be done for each team with veteran goalies in their system against 2 veteran free-agent goalies, Kari Lehtonen and Steve Mason. While goalies like Greiss and Darling are projected to only outplay Steve Mason in about 20% of simulated seasons, this apparent gamble could also factor in things like contract status, age, or injury risk. In any event, we can capture the uncertainty and provide the opportunity to make a calculated decision.
An alternative calculation is to simulate absolute goals prevented over expected for each team. Based on rostered goaltenders forecasted outcomes we can create a distribution of possible outcomes by simulating their season thousands of times. As a point of reference, last season that range was about +/- 40 goals, representing about a 15 point swing in the standings. There are no certain outcomes, but you can maximize the probability of ending up in positive territory.
Every season brings its own hard lessons on how difficult it can be to predict goaltender performance. Therefore it makes sense any forecast shouldn’t avoid uncertainty, but rather try to embrace it.
Teams and decision-makers are best aided by understanding that future performance is only probabilistic. Carey Price might be one of the most talented goaltenders in the league, but how likely was his poor performance last season? Unlikely, but certainly not zero. That’s true of every goalie heading into the 2018-19 season.
The universe of goaltenders are more talented than ever, so it’s no surprise that the top talents in the world when indexed to each other, are not separated by much. The means as the upcoming season unfolds, the results we observe will quickly deviate from what is expected in many cases. In some of those, they will reconverge, but others might see that opportunity lost to injury or an opportunistic teammate.
But it is important to know what to expect from goaltenders. Evaluators might have an easier time forecasting bottom-6 skater performance, but the impact on the outcome of the season is considerably less.
Teams only get a few chips a season on goaltenders, the edge might be small but the payoffs compound over the course of the season and often season-defining. A statistical forecasting approach that incorporates uncertainty can help them quantify that bet.
Thanks for reading! Any custom requests ping me at @crowdscoutsprts or firstname.lastname@example.org. Code for this and other analyses can be found on my Github.
Are goaltender’s big-time playoff performances best explained by their regular season or historical playoff results? Can some goalies just turn it on when it counts more?
Clutch Off the Bench
The 2018 1st round series featuring the Columbus Blue Jackets and Washington Capitals was an interesting case study in playoff goaltending performance.
Starring for Columbus was Sergei Bobrovsky. The reigning Vezina Trophy winner was coming off another very good season, hoping to continue rolling in the playoffs. However, despite Bobrovsky’s accolades he has never advanced past the 1st round and has been uncharacteristically below average preventing goals in all of his prior playoff appearances.
In another subplot, Washington actually began the series with Philipp Grubauer. He had been excellent in the regular season but relatively untested in the playoffs (he played parts of two clean-up games, neither went great). This decision put Braden Holtby on the bench, who had a very pedestrian regular season but had been at or above average in all 5 of his previous playoff appearances. Cumulatively playoff Holtby has prevented about 1.1% more goals than expected, 2nd only to Jonathan Quick of goalies entering the playoffs with at least 1,000 playoff shots to their names.
Everyone knows what happened next. Grubauer wasn’t great, while Washington dropped their 1st two games at home. Holtby came in and delivered 4 straight above average performances, while Bobrovsky ended the series with 3 straight below average games. Washington took the series 4-2.
A few interesting questions can come out of this series I hope to explore. Was Washington coach Barry Trotz right to go with the ‘hot-hand’ over the ‘proven-vet’ by starting Grubauer? Is it likely that a goalie might be good in the regular season, but below average in the playoffs? More generally, if we are trying to explain goaltending performance in the playoffs, what matters more? Past playoff performances, regular season results, or just career results?
Can someone simply turn it on in the playoffs after a below-average regular season?
High Stakes Noise
Let me preface most of this exploration with the understanding the idea of ‘clutch’ or ‘performance-when-it-matters’ is problematic from a statistical perspective. A few bounces over a playoff series might dictate whether the outcome is perceived as ‘clutch’ or ‘choking.’ In reality, a good game or bad game doesn’t have much effect on the outcome of the next game, but if you flip a few heads in a row (bad games) you are out of the playoffs, while a few tails mean you advance. Someone has to advance, so a ‘clutch’ narrative might be created from chance outcomes alone.
With a small sample like a playoff series, a bounce or two can change the narrative of those outcomes. Analysts can deal with this by framing the outcome with a range of uncertainty. Fewer shots or games mean more uncertainty. Ultimately, we can’t be too sure the outcome of a series reflects the ‘truth.’ Holtby could have come into the playoffs with his game in top condition and his vitals in the optimal range to deliver a clutch performance, but if a few Seth Jones’ shots bounced off of someone’s ass in game 3 or 4, the narrative is completely different. Drilling down further (tied in the 3rd period only, etc) only compounds the problem of insufficient sample size.
Is Winning a Skill?
It’s important to the scientific process that we assume our hypothesis is null then work to prove it with data. A ‘clutchness’ factor is no different, we should assume it doesn’t exist. It might not exist as a differentiator at the NHL-level for good reason, a propensity to fold in critical moments would likely prevent them making it.
Regardless whether you think being clutch might be an innate skill some have or whether those differences are incredibly tiny at the NHL-level, we have to acknowledge that the finite and imperfect nature of the data will likely be a limiting factor.
What Does the Data Look Like?
The objective of this analysis is to explain goaltender playoff performances using data available prior to round 1, game 1. The target of interest is playoff goal prevention per shot, save % less expected save %. If a goaltender faced 25 expected goals on 250 shots, but only conceded 20 actual goals, this would be a 2% lift (5 / 250 or 92% – 90%). Actual save % may deviate wildly from expected save % in small sample sizes like the playoffs. A few bad goals and/or unlucky bounces against will likely prevent a chance of redemption.
To help explain the selected measure of playoff performance for each season, the save % lift can be calculated for:
the regular season performance prior to that playoffs
entire career regular season performance prior to that playoffs
entire career playoff performance prior to that playoffs
a proxy for goaltender workload at the onset of the playoffs
Visualizing the relationship between the save % differences we see a small relationship and correlation between each. As predicted, the variance in playoff results (y-axis) is higher than the explanatory variables (x-axis) with a higher sample size. Initially, it appears regular season results are most correlated with playoff success (a perfect correlation would be equal to 1 with each point falling along the grey diagonal line). Career regular season results have the least variance and lowest correlation.
Do these any or all of these metrics matter when explaining playoff performance?
The Weight of the Playoffs
In order to understand how each of the explanatory inputs matter we can use a multiple linear regression. This helps us quantify the direction and strength of the relationship between the explanatory variables and playoff performance.
Running a regression of 122 goalie-seasons facing at least 100 shots in the playoffs and 1000 shots in the respective regular season results in the model below.
Residual standard error: 0.01536 on 124 degrees of freedom
Multiple R-squared: 0.1342,
Adjusted R-squared: 0.09341
F-statistic: 4.292 on 4 and 124 DF, p-value: 0.002752
Notably, this is a pretty weak model, confirming the intuition that playoff performance is tough to explain. But directionally, regular season results are more significant and the coefficient is larger than career playoff results. Also noteworthy, career regular season results have no significant effect (though directionally positive) on the playoff results once the current season and career playoff results are controlled for. Workload has a no significant effect, though directionally negative. Being a playoff rookie also has no effect, but is directionally negative too.
Formula For Success
Dropping the insignificant variables and re-running the regression creates the formula below to (loosely) calculate the expected playoff results.
So, for example, Holtby entered the 2018 playoffs with a save % lift of +1.14% in prior playoffs, but only -0.18% in an uncharacteristically mediocre regular season. The regular season results are weighted about 4 times more important in the formula, resulting in an expected -0.2% save % lift in these playoffs, which he’s exceeded to date.
Bobrovsky’s prior playoff results (-2.27%) pulled down his regular season results (+0.5%), expecting a -0.33% performance. He finished with a -1.22%.
Despite a great regular season, Grubauer’s expected save % was about 0% due to poor prior playoff appearances pulling it down.
If there’s anything to take away from this analysis is that explaining playoff performances is difficult. This was likely obvious to anyone who’s watched playoff hockey. Small sample sizes, survivor bias, and out of control narratives, playoff hockey has everything to confound a good analysis.
That said, some things do matter directionally. Entering the playoffs after a good regular season is probably more important than a good playoff track record. Braden Holtby may have bucked this trend playoffs-to-date, but it was probably more likely his regular season results were lower than his true talent suggests.
The results also suggest that waiting for a goalie’s playoff results to regress to a career average is generally fruitless. This makes sense intuitively, a goaltender may change teams and systems. They develop and regress. Regular season results likely give enough of a snapshot of where their game is at that entire career regular season results are unnecessary. Marc-Andre Fleury entered the 2018 playoffs with excellent regular season results, average career regular season results, and below average playoff results. This was a recipe for success based on the basic model (expected +0.5%, chart above) and he’s subsequently delivered with excellent results (he currently has the best save % lift of goalies with over 500 shots in the dataset going back to 2011).
With all of these considerations, there is nothing to suggest a goalie can simply turn it on for the playoffs. Proven experience certainly helps, but it’s more important to have posted good results with the most current team and defensive conditions.
Was Trotz right to start Grubauer? Probably. Playoff series are short and Grubauer had played excellent during the regular season. However, past playoff results do have a partial explanatory effect, partly because there are other considerations in the playoffs. Playing styles can change, physicality around the net can increase, and facing a well game-planned opposition for 4 to 7 games means that tendencies and tempers can amplify. Holtby had experience in those situations, not enough to completely offset the difference in their regular season, but close.
Bobrovsky can take comfort in the fact that his playoff results should have been better than they turned out this season. There’s likely no use in him re-visiting these playoff letdowns, his best bet is to look forward, focusing on another big season and carrying that performance forward. Either the results will come naturally or maybe he will be carried up by some positive unexplained variance.
Goaltending tactics have evolved considerably in the last 30 years, confirmed by rising save percentages. The Reverse Vertical Horizontal (RVH) is a relatively new goaltender tactic, now widely-adopted. Is there a meaningful impact in the data?
What is the RVH?
Growing up almost every coach I had wanted me to stand-up ‘more’ – more being a relative term. Most made peace with the fact that I was going to try to make the same type of saves as Patrick Roy or Dominik Hasek but since I was still a kid, it would probably help if stood up once in a while. Still, they had to choose their battles wisely, most picked the same hill to die on – bad angle shots. It was simple geometry really, with the right stick positioning an adolescent goalie could stand there and cover 100% of the net. However, this just led to the terrible experience of having people hack away at your feet waiting for either a goal or a teammate to save you – glued to the post you couldn’t cover the puck and dropping to your knees with the puck that tight would create a hole anybody could hit.
By the time I got to junior and had a goalie coach we worked in the Vertical-Horizontal (VH) tactic to deal with shots from sharp angles. The short-side pad would seal the post (vertical) and the back leg would drop sealing the ice (horizontal). There was always a risk of getting your stick tied up and/or getting beat between the post and skate, but used properly it was pretty tough to beat from range, however, there were trade-offs. Leading with the pad tied up the hands a bit, meaning rebounds were more difficult to control. If there was a rebound the VH was configured to push off the post, but only in one direction. If you had kept your knee tight to the goal line, but needed to push to the top of the crease, too bad, you were pushing across the goal line.
The Reverse Vertical Horizontal (RVH) flipped the configuration of the pads, so the strong pad seals the ice (horizontal) and back leg remains anchored (vertical), freeing up the hands and stick more to make plays and allowing rotation with the back leg and push off with the post leg (I would have never dreamed of this, most nets growing up were easy to knock off whenever you needed a convenient whistle). The back leg can anchor or drop into a butterfly quickly which gives the RVH more flexibility when repelling a play originating from a sharp angle compared to the VH.
This added flexibility has meant RVH has mostly supplanted VH as a tactic for sharp angle shots, but it’s not perfect either since it leaves a few holes along the post above the pad, particularly over the shoulder. Additionally, because of its flexibility, some goaltender’s become too reliant on it, defaulting to it prematurely or in situations that don’t call for it. Shooters are also able to pick up on trends. After all, throughout the VH and RVH it was always an option to play sharp angle shots more passively by standing up as long as possible (perhaps anticipating a pass or change of angle) or more aggressively by moving off the post and squaring up. The RVH is a great tactic, but it’s up to the goalie to assess the shooter speed, handedness, passing options, and defensive support and making a read rather than simply defaulting to the RVH.
Looking at NHL play-by-play from 2010-2018, we can isolate shots where the RVH has been presumably been used properly and possibly improperly to see if there are any patterns in the:
Share of shots resulting in goals (obvious why this matters)
Number of shots attempted per game (perhaps RVH has encouraged or discouraged bad angle shots)
Share of shots resulting in rebounds (are some tactics more prone to rebounds than others)
Shooting % on rebounds, or calculated expected goals on rebounds if the sample size is too small (are some tactic more prone to bad rebounds that are more likely to be converted in goals or possibly leave the goalie less likely to make the rebound save)
Observing these metrics over the last 8 seasons might reveal a meaningful change in success rates, but it important to caution that while this might appear to be a testable tactic, in a complex game like hockey, effects can be hard to pin down. We don’t have passing data to reveal if, for example, a more aggressive tactic led to more passing from the sharp angle and consequently more dangerous locations, though the number of attempts per game might lend a hint.
Either way, it’s possible macro trends don’t reveal anything meaningful since there’s much that unobserved and the data itself is imperfect (though the coordinate data has been adjusted to hopefully improve the accuracy of shot location). That said there may be potentially meaningful and interesting information in the data that might inform a more concentrated deep-dive later.
What does the data look like?
For this analysis, we will focus on bad angle shots where a goalie might select the RVH tactic, either properly or improperly. To do this we can limit to shots taken from a 45° angle or less and within 10 feet from the goal line (visualized below). Further, we will want to breakout combinations of:
‘Close’ vs ‘Long’ Shots – using the cut-off of 12 feet from the net, look at how goalies have dealt with shots where they wouldn’t have time to react, and compare to longer shots.
‘Poor Angle’ vs ‘Decent Angle’ Shots – the RVH is generally recommended on poor angle shots (0°-22.5° from the goal line) but could be over-used when the puck is at a decent angle(22.5°-45°).
Identifying these combinations of angle types for analysis can be visualized on a rink (cumulative shooting percentage labelled). The average shooting percentage across all shots is about 6.6%, so shots closer than 12 feet from a poor angle are about as dangerous as the average shot while getting a few feet out to a decent angle improves the shooting percentage by 2%. Another important consideration is that I crudely bucketed this data, which is generally not ideal, but for the purpose of the analysis helpful (. The coordinate data itself isn’t perfect either, but some home-rink bias adjustment has been applied, so hopefully won’t be systemically biased across zones or time.
Trends By Season
A quick note about the charts below. They focus on shots at 5v5 and 5v4 play since the distribution of the type of shots we might see from each of the zones above would be different on a 5v4 or 4v4. On a 5v4, we’d expect to see more one-timers as a share of total shots from these zones, increasing the expected shooting percentage and might be the result of changing powerplay tactics. Shots while gameplay is 4v4 or 3v3 are also more likely to be dangerous, if a shooter shoots from a sharp angle in 3v3 overtime, for example, it’s probably because they expect to score.
We must also deal with both signal and noise in the data, are fluctuations in shooting percentage caused by anything material or just randomness? Our default assumption is that the RVH likely hasn’t had any impact on bad angle shots and the burden of proof would be on the analysis to discover a statistically significant difference in the data. Ideally, we’d have some sort of intervention period where all NHL goalies adopted the RVH. Unfortunately, this would never be the case, so we can only observe loose trends over time at the macro-level.
Without a clear way to compare a “before and after” period for all goalies, we can create uncertainty bars for each season by considering sample size. Say we had observed 10 goals on 100 shots from a particular area, we wouldn’t be too sure in that 10% shooting percentage, a post here or there, it may have been 6% or 14%. What if we observed 100 goals on 1,000 from the same spot? We can be increasingly sure in that 10%. To reflect the impact sample size has on certainty, the analysis will use the standard deviation of beta distribution to convey uncertainty by using error bars +/- 1 standard deviation.
The primary job of a save or tactic selection is to stop the puck, so naturally, the first trend to look at is shooting percentage from each segment of the ice.
Starting with 5v4 shots, the first trend that jumps out is the rise in shooting % on shots within 12 feet up until after the lockout-shortened season, falling dramatically, and then slowly rising again. This presumably reflects a cat and mouse game between shooters and goalies, but may also involve powerplay and penalty kill strategies countering each other. Shooting percentages from longer bad angle shots followed a more muted version of this trend.
At 5v5, trends are less pronounced. Interestingly, shooting percentage on close shots from a poor angle jump above those from a decent angle in 2014-15, which is strange, before normalizing again.
Looking at just 5v5, rebounds have been generally less prevalent than the average shot (3.4% of shots in 2017-18 resulted in rebounds), but has been trending upward in the 4 areas of the ice. It’s tough to infer a definite trend because we have some more uncertainty (rebounds are rarer than goals) but it seems the rebound rate is not falling.
Rebound Shooting Percentage
Rebounds are a problem because they are very dangerous, they are converted to goals about a quarter of the time, 4 times as dangerous as a non-rebound shot. Are rebounds on shots from poor angles getting more dangerous?
At 5v5, two things are apparent. Rebounds from poor angles are generally ‘safer’ than average, the shooting percentage has about 10% lower than average. This suggests goalies have generally done a good job of keep rebounds on the strong side, preventing pucks from getting to the middle of the ice or weak-side and more dangerous extra chances against.
Secondly, because we are dealing with a fraction of a fraction our sample size is quite small and the error bars are large.
Alternatively, we could look at the expected goal value of the rebounds to reclaim some of this sample, where we can calculate factors such as the total distance the puck travels between shots and the angular velocity the goalie might have to deal with.
Both of these views suggest that there isn’t really a definitive trend since we are working with 3% of the original data (already limited to bad angle shots) making the results pretty noisy. An interesting finding is that rebounds on shots from poor angles can be more dangerous on shots from slightly better angles, possibly due to goalies not being square to the initial shot in these cases.
Bad Angle Attempts as Share of Total
It’s also important to check to see if shooters are attempting more shots from bad angles as a share of total shots. This might be the result of defensive pressure, but it also might signal shooter’s seeing and testing holes.
It appears most 5v4’s are moving away from bad angle shots, notably on shots over 12 feet. However, at 5v5 there had been a modest increase in attempted bad angle shots from further than 12 feet away (until this season). Some players are definitely happy to test goalies from the seemingly impossible angle – and why not? They don’t have to chase down the puck if they miss.
We can also look at shooter handedness and how it has impact goal, rebound, and attempt rates over time. Generally, shooters are more trigger happy when they are on their strong side (meaning the shooter’s stick blade is closer to the centre of the ice, if they are on their forehand), though success rates for shooters on their weak side are in the same range. Shooters have become less successful on their weak side on the powerplay, but attempts haven’t fallen considerably to reflect this.
Goaltender Specific Trends
These general trends might have some interesting nuggets and reveal things we might want to explore further, but they can’t reveal much regarding tactical usage of the RVH because each goaltender will have implemented it at different times, if at all. While it would be nice to have a definitive list of when goalies might have adopted the RVH tactic that might be a little simplistic. Early adopters might have had an advantage since shooters hadn’t picked up on its relative weakness. It’s also possible (and pertinent to our analysis) that goalies have become over-reliant on it in more recent seasons, defaulting to it in improper situations which could have an undesirable effect.
Without a completely clean solution, one way to test individual goaltender effectiveness from poor angle shots is to treat each off-season (where tactical changes would normally be implemented) as a divider between a ‘before’ and ‘after’ period. We can calculate save percentage in each period and compare them, testing for statistical significance in each sample. Where the save percentage has a statistically significant difference from the before to after period it might draw interest and warrant a deep look. Was there a tactic change or something else causing a meaningful change in results?
For this part of the analysis, we will limit to the 24 goaltenders that have faced at least 100 bad angle shots a season in at least 5 of the 8 seasons we have data for. Each off-season will be treated as a ‘split’ or intervention period. We’ll only focus on save percentage since rebounds are rarer, making the task of finding meaningful differences tougher.
Sergei Bobrovsky had quite the change in the 2012 off-season going from the Philadelphia Flyers to the Columbus Blue Jackets (along with some time in the KHL waiting for the lockout to end), and ultimately winning the 2013 Vezina Trophy. In Columbus, he also had a new goalie coach, Ian Clark (I both attended and worked at Ian’s goalie schools in the past, full disclosure). Among other things, Clark helped Bobrovsky implement the RVH. If we look at Bobrovsky’s 5v5 save percentage on all bad angle shots 2010 – 2012 and compare it to 2013 – 2018 is it materially different?
In the ‘before’ period, Bobrovsky allowed 14 goals on 194, for a 92.8% success rate. Since then, he’s conceded 25 goals on 743 shots, his save % rising to 96.6%. If we run a test of statistical significance to check to see if there is enough evidence (shots) to determine that these proportional are meaningfully different, we get a p-value of 0.029. Stated otherwise, this difference would happen by chance alone about 2.9% of the time (showing my work below).
2-sample test for equality of proportions with continuity correction
data: c(180, 718) out of c(194, 743)
X-squared = 4.7966, df = 1, p-value = 0.02852
alternative hypothesis: two.sided
95 percent confidence interval:
prop 1 prop 2
In the soft sciences, convention suggests the ‘cutoff’ for statistical significance is a p-value is 0.05, so we can say with some certainty that this difference is likely not due to chance. We can never be sure from the data alone, but it seems that it’s likely that some combination of the move to Columbus, new goalie coach, and adoption of the RVH probably had a positive effect on his save percentage from bad angles.
Complete Goalie Splits and Save Percentage Results
We can do the same thing with Bobrovsky’s other 6 off-seasons and all 163 unique goalie-offseason splits. The results below will label any goalie-offseason where the p-value is less than 0.05. Goalies that experienced a significant change received their own color, the rest of the 24 qualifying goalies are represented by the ‘Other’ green.
Holtby, Rask, Elliott, Miller, and Bobrovsky all saw a notable rise in save percentage occurring somewhere between 2011 and 2014. Some of this might be attributable to tactical changes, though without talking to goalies, their coaches, and/or grinding on video we can’t necessarily assert for sure. However, it’s possible that RVH adoption helped drive some of this effect.
Luongo, Varlamov, and Price have all experienced a notable drop in success rate. Luongo stands out because he likely adopted the RVH in 2013, but saw a drop in the 2014 split.
Price particularly struggled this past season on shots >12 feet and <= 22.5°. He gave up more goals from that area last season (5) than from 2010-2017 (4). This helps identifies a particular pain-point in Price’s poor season. Bad angle goals are easily preventable, and going back to the tape would help identify if tactics, luck, and/or laziness were at fault, which can help inform the proper adjustments.
If you are stricter and insist on a p-value of 0.01, only 2012 Brian Elliott in the ‘Close-Decent Angle’ area, 2012 Jonathan Quick in the ‘Long-Decent Angle’ area and aforementioned 2017 Price in the ‘Long-Poor Angle’ area saw significant changes at that level.
A weakness of this analysis is the ‘bucketing’ of data into specific areas, so it is possible a lot of borderline goals or shots from one area ended up in other area or another by chance. Unlikely, but something to keep in mind.
Capturing the full impact of the RVH is a near impossible task since we don’t observe when it is actually deployed. But we can look at a proxy for when it might be deployed and investigate if there were any meaningful impact on results. While incomplete, it might help ask smarter questions and help concentrate the proper video application. Carey Price struggled from poor angles last season? Those clips can be isolated and analyzed to re-affirm the trend and possibly reveal why.
Analyzing tactical usage in hockey is often frustrating since everyone on the ice is basically playing a complicated version of rock-paper-scissors on skates trying to gain an advantage. We can observe goals, rebounds (kind of), and total attempts, but what if shooters have adapted by making more passes that lead to even more dangerous shots? There are rarely clean test and control cases we can use to attribute some change in results to a specific tactic.
We can, however, attempt to use data to help guide a more informed approach and use the framework above to begin to create and explore additional questions. Often looking from just a video perspective misses part of the equation. If you looked at all bad angle goals against when goalies where using the VH and compared to the RVH, you wouldn’t have the complete picture. You want to look at all bad angle shots using each tactic then look at the success rate of each. Of course, we can’t do that easily, so identifying proxies and exploring the data can help paint a more comprehensive view and sharpen the focus on where meaningful differences may exist.
Goaltending can frustrate fans and coaches alike because results from game to game can be inconsistent. Goalies can’t necessarily dictate the game, rather have to let the game come to them while employing tactics that give them the best chance to succeed – ‘playing the percentages.’ The evolution of goaltending tactics has largely been positive, as save percentages suggest. It appears the RVH has probably helped goaltenders deal with bad angle attacks, but this isn’t a one-way effect. There is evidence to suggest that rebound rates are rising and some goaltenders have had notable falls in save percentage from poor angles. Shooters will always adapt, so it’s important for goaltenders to critically assess the tactics they employ and continue to stay a step ahead.
Thanks for reading! A notebook with code for the analysis can be found here. Any custom requests ping me at @crowdscoutsprts or email@example.com.
The Stanley Cup Playoffs magnifies all the frustrations people have with the goaltending position. Tough to predict and performances with an outsized impact on game outcomes, what can you expect over a playoff series?
Playoffs, Pads, and Positional Paradoxes
Goaltending is a volatile position. With so much out of the goaltender’s explicit control, it’s extremely difficult to consistently deliver positive results. This can be true over the course of a season, but it is especially true over the course of a playoff series. Generally starting goaltenders in the playoffs have had good seasons, so at the margins, there isn’t much separation between most starters, usually not enough to predictably manifest itself over 4 to 7 games.
But this helps frame the paradox of goaltending. Tough to project, but few positions have more control of the outcome of the game. Looking at 2017-18 Wins Above Replacement, thanks to Corsica Hockey, 11 of the top 30 contributors were goaltenders. That’s also in aggregate too, goaltenders don’t play every game of the regular season like they normally do in the playoffs – when normalizing by games played, goalies make up the majority of the top 30 (17) and 10 rank above the most impactful skater, Connor McDavid.
So to the cynical and casual fan alike, the playoffs can simply appear to be a competition in waiting to see which goaltender gets hot at the right time. And endless frustration and soul-searching when the opposite happens.
What’s the best (healthiest) way to think about what you’ll get from your goalie in the playoffs?
Taking It One Game At a Time (TIOGAT)
Most starting goalies in the playoffs have a pretty good body of work during the regular season. Some good games, some bad, but probably more good than bad if their team made the playoffs. We can calculate a game-level performance by taking the difference between the actual goals against and expected goals, the number of goals an average goalie historically would concede given what we know about those shots against, (adjusting for rebounds, which goalies have some control over) and normalizing by total shots (or a percentage between actual and expected, sometimes referred to deltaSv% or Save % Lift Over Expected). So, if a goalie faced 50 shots against, totalling 4 expected goals, but only gave up 3, that game would have been 1 goal prevented on 50 shots, or 2 per 100 shots (2% better than expected). It’s also important to note that, unlike save %, expected goals attempts to weights shots by situation, so a 5v5 shot and 5v3 shot can be each compared to their relative, historical probabilities of a being a goal (though not perfect). Using all-situation results, opposed to just even-strength, creates a more reliable metric.
We can use a histogram to visualize the distribution of John Gibson’s 2017-18 performances, where one game is placed into each bin. I highlighted his first 2 playoff performances (as of 4/15/2018). About 63% of the time he made more saves than an average goalie would (a positive Sv% Lift Over Expected). In his 2 playoff games, he’s had 1 game where his Sv% was the same as we’d expect from an average goalie, given what we can quantify about shot quality against and another about 1.7% better than expected ((3.96 xG – 3 GA) / 57 shot attempts).
However, having to bin each game is a little awkward and to compare across multiple goalies the y-axis might need some scaling. Since most playoff goalies have 50 plus games this season, we can smooth and scale the distribution using a density curve, showing the probability of each outcome, without loosing too much information. Doing this smooths over Gibson’s lack of games with only slightly (~1-2%) better than expected results, which is partially strangeness, partially a result of binning, since he missed plenty to the left and right.
Shuffling the Deck
Armed with game-level performances for each goalie, we turn to the playoffs where each game is critical. We can use each goalie’s regular season results (partially attributed to team defensive performance) as a template of what to expect in the playoffs. Think of it as a deck of cards: we draw one card for a deck of that goalie’s performances and place it on the table. Do that again for the opposing goalie and their results. What does that look like after 4 to 7 games? What is the probability your goalie outplays the opposing goalie in a series?
With game-level performances, we can attempt to answer that. Below are Connor Hellebuyck’s and Devan Dubnyk’s regular season performances. When Dubnyk was good he was about as good as Hellebuyck, but when he was bad he was worse. In sum, Hellebuyck was better than Dubnyk on the year.
We can play this like you would the card game ‘War’ (with replacement, meaning a card or game goes back into the pile randomly and can be picked again). Tracking who ‘wins’ and by about how much a few thousand times, we can figure out what percentage of the time we might expect Hellebuyck to outplay Dubnyk, or vice versa.
Using Hellebuyck’s and Dubnyk’s results, Hellebuyck outplays Dubnyk about 57% of the time. In a short series we likely wouldn’t notice the difference, and it’s entirely possible Dubnyk outplays Hellebuyck (as I’m writing this, that appears to be the case in Game 3), but in a game where the marginal probability of winning is small and possibly with an upper bound of 62% accuracy, this is probably a welcome advantage to Winnipeg. I’m assuming the management and other’s with skin in the game would be interested in that edge.
Looking across all series we can calculate the same probability. We can also overlay the 1 or 2 playoff games performances over the distribution of season results. We can see Matt Murray and Brian Elliott oscillate between their season’s best and worst and Frederik Andersen pull a card he didn’t even know he had.
What’s likely and what actually happens are 2 different things. But it helps to understand how likely something is, which can give some important context to the results a game or two, even if those results might put your season in peril.
There are a few assumptions to address with this analysis:
The Black Swan Game – Just because something isn’t in the data doesn’t mean it can’t happen. In game 2 against Boston, Frederik Andersen was pulled 12 minutes into the game, posting a save% 40% below expected, which he hadn’t done during the season. Part of this is artificial, he likely would have worked himself back into something less extreme by finishing the game. However, other games during the regular season where Andersen or any other goalie was pulled, would functionally look the same, whether it was -40% or -20%. A loss is quite likely. We’re more interested in: how often do they shit the bed?
Independence of Sampling – This also assumes goalies compartmentalize game performances, opposed to some sort of lagged effect of a bad game leading to a higher probability of a bad game the next time out. In the playoffs, it certainly feels this way, because if you draw 2 – 3 bad games in a row, that’s usually the end of the season. However, in aggregate the last game has little effect on the current game. Even controlling for workload, a simple linear model found no effect for one game to another. Each season for the 16 starters looks pretty flat.
However, this is still a little naive. Confidence, health, team play (with or without your best players) might mean some stretches are more favourable than others and have less relevance to the series or game at hand. Additionally, matching up against a single team might result in some otherwise minor details to be exploited, perhaps creating an even wider distribution of outcomes (to the frustration of all).
The Playoffs and Regular Season are Comparable – Do teams really tighten up defensively in the playoffs? Do goalies generally step up and play better? Maybe, and if so it would be unwise to sample from regular season games, where shots were more likely to be dangerous due to pre-shot passing plays or screen, and the goalie hadn’t really locked in yet. While goals actually do come a little harder in the playoffs (about 1 less goal than expected per 380 shots, or about 12 games), some of this is because the remaining goalies are, almost by definition, getting good results. Comparing goalie-season’s regular season results to playoff results, there’s generally no lift from regular season performance to playoffs, but stronger goalies are likely make it a little further.
On the other side of the ledger, for every goalie that is perceived to raise their game in the playoffs, another will struggle, due to some combination of luck, health, and psychology, but they don’t last long in the sample.
Like most problems, ignoring it won’t go away. If we create a distribution of game-level results from non-rookie goalies with less than 1000 career shots (a replacement-level type goalie) and compare to a goalie with very average results (Devan Dubnyk this season), Dubnyk’s edge is about 60%. Against a good season of results, like Jonathan Quick this season, it increases to about 75%. Not impossible odds, but it seems unlikely that the $4M saved could be put to better use and make up that margin.
Goaltending will continue to frustrate and mystify. But teams and their playoff fates (and possibly reactionary franchise decisions) will always be linked. With so few games and the marginal difference between high-end goaltenders true ability being so small, differences in results will rarely manifest themselves in a series like they do on paper. However, most winning in hockey comes on the margins, so these edges are important.
And they are highly visible, everyone in the rink knows when a star goalie with a big contract has a bad game in the playoffs, it’s not as clear if a 2nd line center has a net negative game, since goalies deal with the currency of goals directly, while for skaters measuring goals over a small sample is best avoided.
I also think this is a good opportunity to re-frame how we think about goaltending results. In the past, I’ve erred and misled by posting projections as some sort of point-estimate (i.e. based on past results, my model expects a 1% save% lift over expected) but it’s fairer to frame projections as a distribution of possible outcomes. Carey Price is a supremely talented goalie, and there were no on-ice results to suggest a poor year, but it was possible. I don’t have the data to interact physiological factors with Price’s age, but that would have helped. Understanding and framing this uncertainty would be helpful when locking someone into an 8-year deal.
Goaltending analysis often gets ignored because of these projection issues, but if we can properly quantify and convey uncertainty it would be a helpful step forward. Skater projections might offer some more certainty (though it’s rarely presented with uncertainty bounds), but their impact is generally one degree removed from the actual goals. Watching the playoffs, it will be clear goalies often have actual goals, wins, and loses hanging around their neck, so understanding these edges and how they might manifest themselves in a playoff series seems warranted.
Edmonton played him 86 games last season and though I don’t know his exact physiological profile, it stands to reason it may have been too many games. Could this burn out be impacting his performance this season? Impossible to say, but his performance has been down (albeit with a smaller sample size, represented by the wider error bars above) and he has battled some injury issues. Regardless, this prediction has been wrong so far.
All of this is to say the ability to create a ‘true,’ comprehensive, and accurate xG model is limited by the nature of the public data available to analysts. Any evaluation and subsequent projection is best viewed through a more sceptical lens.
With this in mind, back to Cam Talbot. Talbot has underachieved this season for some combination of the following:
His ‘true’ ability or performance has been worse. This is probably the case directionally, but it’s important to explore by how much worse he has been. Is this an age-related decline that can help inform future projections? Is the decline small enough to be considered luck? Do we expect a bounce back? and so on.
The random nature of outcomes has made him look worse this season (or alternatively, really good in prior years). Using the beta distribution, we can calculate the standard deviation we might expect for each sample size. More shots mean more certainty in the outcome. Simulating seasons and treating each expected goal as a weighted coin flip accomplishes something similar. Understanding and quantifying this uncertainty is an important aspect of any analysis. As you can see below, there is a (small) chance Talbot is a completely average goalie who concedes as many goals as an average goalie would (his ‘true’ talent put him along the grey vertical line) but just had some improbable results.
Edmonton is giving up tougher chances against this year that goes undetected by an xG model. They might be allowing more cross-ice passes, moving net-front players out of the front of the crease, or employing unfavourable strategies. How this might manifest itself at the team, goalie, or shooter level is discussed below.
One Night in Edmonton, A Deep-Dive
While east coasters, like I currently am, were out for New Year’s Eve 2017, Edmonton was hosting Winnipeg. NYE is overrated, but it had to be better than watching the Jets win 5-0. Talbot was in for 5 goals against, not even getting a mercy pull in the 3rd period to start planning the night.
Giving up 5 goals is almost certainly going to look bad next to the expected goals, and this game is no exception. Winnipeg had 52 unblocked shot attempts that totalled 3.3 xG. However, removing the xG from 2 Talbot rebounds and replacing with expected rebounds drops that to about 3 xG. However, this game demonstrates specific weaknesses of the model but can also show how the model compensates for that lack of information.
The goal highlights are pretty much a list of things expected goal model can miss:
Goal 1: A 3-on-1 off the rush, a pass from the middle of the ice to a man wide-open to the side of the net. Unfortunately, play-by-play picks up no prior events and the xG model scores it as seemingly low 0.1 xG, recognizing only a wrist-shot 9ft away from a sharper angle. Frankly, this was a slam dunk, even if Talbot adjusted his depth to play a pass, the shooter had time and space in the low slot.
Goal 2: A turnover turns into cross-ice pass and immediately deposited for a goal. Fortunately, the turnover is recorded and the angular velocity can be recorded. This is scored as 0.19 xG, which seems low watching this highlight, but has to be representative of all chances under this circumstances and most aren’t executed so cleanly. Still probably low.
Goal 3: A powerplay point shot is deflected. Both of these factors put this shot at 0.17 xG. This is decently accurate, the deflection had to skip off the ice and miss Talbot’s block, if that point shot is taken 100 times, scoring on that deflection about 20 times sounds about right. Also note there’s some traffic, which in a perfect world would be quantified properly and incorporated in the xG model.
Goal 4: A pass from the half wall to a man wide open in front of the net. This is scored as a 0.16 xG, a deflection tight to the net after an offensive zone turnover. However, this is more of a pass than true deflection, neither passer nor shooter was contested and I would think that they would convert more than 16 times if given 100 opportunities.
Goal 5: A rush play results in a cross-ice pass and a shot from the hashmarks. Talbot makes the original save, but the rebound is immediately deposited into the net. The rebound shot, changing angle rapidly, is scored as 0.43 xG. However, that rebound is conditional on Talbot not deflecting the puck into the corner so the play is scored from the original shot: 0.15 xG on the original shot plus 0.02 xG on a potential rebound – a 6% chance of a rebound times the observed goal probability of a rebound, 0.27%.
On the surface, the expected goals assigned to each of these goals are low. A lot of this is confirmation bias – we see the goal, so the probability of a goal must have been higher, but in reality, pucks bounce, goalies make great saves, and shots miss the net.
So what does the model do without that helpful information? In the specific goals discussed, the model is conservative, under-estimating the ‘true’ probability of a goal, but across all shots, the number of expected goals roughly equal the number of goals. This means that impactful, but latent events, like cross-ice passes and screens, are effectively dispersed over all shots.
We assume a un-screened shot from the blueline will be saved 999 times out of 1000 (implying 0.001 xG), but the model, not knowing if there’s a screen or not, might assign an xG of 0.03, an average of all scenarios where there may be a screen. This might increase to 0.05 xG if the shooter has a powerplay, or 0.08 xG if there’s 2-person advantage, since the probability of a sufficient screen increases. Note that these model adjustments for powerplay shots are applied evenly to all shots on the powerplay, unable to determine which shots are specifically more dangerous on the powerplay, though a powerplay indicator may be interacted with specific discrete factors (i.e. PP+1 – slap shot, PP+1 – wrist shot, etc).
This is fine if the probability of a screen (or other latent factors) is about even across all teams. However, it doesn’t take much video analysis to know that this is almost certainly not the case. All teams gameplan to generate traffic and cross-ice passing plays, but some have the personnel and talent to execute better than others, while some teams have the ability to counter those strategies better. Some teams will over or underperform their expected goals partially due to these latent variables.
Unfortunately, there isn’t necessarily the data available to quantify how much of that effect is repeatable at a team or player level. Some xG models do factor in individual player shooting talent. My model, borrowing from Dawson Sprigings, shooting talent is represented by regressed shooting percentage indexed to player position. A shot from Erik Karlsson, who is more apt at controlling the puck and setting up and shooting through screens would have a higher xG than a similar shot from another defender. Shots from Shayne Gostisbehere might receive a similar (though smaller) adjustment if consistently aided by the excellent net-front play of Wayne Simmonds.
So while a well-designed xG model can implicitly capture some effect of pre-shot movement and other latent factors, it’s safe to say team-level expected goal biases exist. By how much and how persistently is ultimately up for discussion.
Talbot’s 99 Problems
Focusing on Talbot again, the big thing holding him back is performance while Edmonton is shorthanded. While his save percentage is about what we’d expect given his shot quality at even-strength, it’s abysmally 5% below expected on the penalty-kill.
What kind of shots is Edmonton giving up? On the penalty kill, non-rebound shots are about 1% less likely to be saved than shots conceded by other penalty-kills (rebound shots are about 8% more dangerous, but some of that may be a function of Talbot himself, so let’s focus on non-rebound shots).
So it seems like Talbot just is poor while down a man. Let’s compare his penalty-kill numbers adjacently to his past performance and recently replaced backup, Laurent Brossoit.
Much like team shot metrics wisely moved from ‘score-close’ to score-adjusted measures, it’s always desirable to use the entirety of data available. About a quarter of goals (23.5%) are scored on the man advantage (only 17% of the shots), and when we talk about measuring goaltender performance and finding advantages on the very slim margins, including that data becomes very important.
However, we’re still fairly certain there are latent team-level biases in our expect goal model and Edmonton’s penalty-kill hints at that. But we don’t know what existent, if at all.
Fixing the Penalty Kill
One approach to quantifying the relative impact of coaching and tactics is a fixed-effects model, basically holding constant the effect of a Todd McLellan penalty-kill or Jon Copper powerplay as they relate to the probability of a goal being scored on any given shot. (Note: I begin to use team-level xG bias and coaching impact a little interchangeably here. I’m interested in team-level bias, but trying to get there by using coaches as a factor.)
This method is rather crude (and obviously not perfect) and wouldn’t necessarily differentiate between marginal goals the result of net-front play or effective pre-shot passing, though an inquiring team might breakout discrete sequences they’re interested in, like how their penalty-kill performs on shots from the point or against 4 Forward-1 Defender powerplays. However, quantifying specific strategic factors like this is not necessarily easy to do, since both teams are simultaneously optimizing and countering each other’s tactics, so for simplicity, it’s probably preferable to consider the holistic penalty-kill or powerplay for now.
Creating fixed-effects for a Todd McLellan penalty-kill will attempt to capture goal probability relative to other teams penalty-kill, presumably capturing some of the effect of pre-shot passes or screens. Because Cam Talbot has played the majority of the games for Edmonton recently, the current model is goaltender agnostic to avoid potential multi-collinearity rather assigning a numerical value to the quality of goaltender playing, hopefully properly apportioning systemic strengths or weaknesses to the penalty-kill variable rather than Talbot or Brossoit specifically.
One limitation of this approach is the effect will be fixed for whatever time period we select, unable to capture tactical changes that may or may not have worked or changes in personnel available to the coach.
It’s also important to reiterate attempts are made to adjust for home rink scorer bias, but any of this uncaptured systemic bias would possibly be wrongfully attributed to coaching (on home-ice at least).
Running the model on the 2016-17 and 2017-18 seasons, the coefficients can be transformed to represent the relative probability of a goal then indexed to the league average. McLellan’s penalty-kill appears to concede penalty-kill shots about 2.4% more dangerous than the basic expected goal model might expect, putting him 32 out of 38 coaches in the 1.5 season period. Using just the most recent season McLellan finds himself last with Dave Hakstol for conceding the toughest shots in the league while down a player. Cam Talbot’s save percentage on the penalty-kill is about 5% below his expected save percentage, but it appears about half of that (2.4%) might be attributable to McLellan’s tactics – which is a nice, clean compromise.
More important than an indictment against specific coaches (this is just reflects shot quality, it doesn’t factor in shot quantity or counter-attack measures on the penalty-kill), it might provide a reasonable team-level margin of errors for expected goal models. A distribution of results from the last 4 seasons suggests that their effect on even-strength shot quality is smaller per shot than special teams by a factor of about 2 to 3. Some of this is definitely the result of smaller sample sizes but makes some intuitive sense – special teams are more reliant on tactics than relatively free-flowing even-strength play.
This provides a decent rule of thumb, an expected goal percentage used might vary about 0.2% at even strength and about 0.6% on the powerplay due to coaching and tactics. This, of course, is over a few seasons, over a shorter time frame that number might fluctuate more.
Model testing suggests that even a good projection of goaltender performance would have a healthy margin of error. Misses will happen, but they should be as small as possible and be accompanied by a margin of error.
Of course, misses can be a helpful guide to improvement, looking at things with a different perspective. Results like Edmonton’s penalty-kill can elicit a deep-dive.
None of this is meant to totally reallocate blame from Talbot to McLellan’s penalty-kill, rather explore if how you might quantify that re-allocation, and how that ties into a more general discussion of the limitation of shot quality metrics.
‘True’ shot quality will likely never be completely captured, but there will be incremental improvements, feeding incremental improvements predicting future performance of goaltenders and other useful insights into performance.
But the job of the analyst will be the same: miss small as possible and quantifying uncertainties. With a complex game like hockey and limited available data those uncertainties can be daunting, but hopefully, this has illuminated some of the weaknesses of current xG models (and maybe reveal a few strengths) and how much of a concern those limitations are. Expected goals are a helpful tool and are part of the incremental improvement hockey analytics needs, but acknowledging their limitations will ultimately make them more powerful.
 In my opinion, the marginal difference between goaltender skill is very tight at the NHL-level, making identifying and predicting performance very difficult, some might argue pointless. But important trends do manifest themselves over time, even if there are aberrations in smaller, but notable periods (like a season or playoff series).
 One of my concerns are trackers being biased toward certain players or team, or worse yet, being more likely to record passing series that lead to goals. Imagine out-of-sample model scoring being ‘tipped off’ to predict a goal if a passing series exists for that shot, that data is not truly ‘unseen.’
Pictured: Dominik Hasek, who made 70 saves in a 1994 playoff game, beating the New Jersey Devils 1-0 in the 4th overtime. Hasek didn’t receive goal support for the equivalent of 2 full regulation games, but he won anyway. What is the probability of Hasek winning this game and what does it tell us about his contribution to winning?
A Chance to Win
I was lucky enough to attend (and later work at) the summer camps of Ian Clark, who went on to coach Luongo in Vancouver and most recently Bobrovsky in Columbus. Part of the instruction included diving into the mental side of the game. A simple motto that stuck with me was: “just give your team a chance to win.” You couldn’t do it all, and certainly couldn’t do it all at once, it was helpful to focus on the task at hand.
You might give up a bad goal, have a bad period, or two or three, but if you can make the next save to keep things close, a win would absolve all transgressions. Conversely, you might play well, receive no goal support, and lose. Being a goalie leaves little in your control. The goal support a goalie receives is (largely) independent of their ability and outside of rebounds, so are most chances they face. Pucks take improbable bounces (for and against) and 60 minutes is a very short referendum on who deserves to win or lose.
Think of being a hitter in baseball and seeing some mix of fastballs down the middle and absolute junk and the chance to demonstrate marginal ability relative to peers on every 20th pitch.
Smart analysis largely throws away what’s out of the goalies control, focusing on their ability to make saves. This casts wins, whatever they are worth, as only a team stat.
Taking a step back, there’s two problems with this:
A central purpose of hockey analytics is to figure out and quantify what drives winning, and removing wins from the equation to focus on save efficiency feels like cruising through your math test and handing it in, only to realize you missed the last page. So close, yet so far.
Goalies, coaches, fans, primarily care about winning, so it’s illuminating to create a metric that reflects that. Aligning what’s measured and what matters can be helpful and interesting, at the very least deserves some more advanced exploration.
Analysis is at its strongest when we can isolate what is in the goaltender’s control, holding external factors constant the best we can. For example, some teams may give up more dangerous chances than others, so it is beneficial to adjust goaltender save metrics by something resembling aggregate shot quality, such as expected goals. Building on this we can evaluate a goaltender’s ability to win games as a function of the quality of chances they face and the goal support they receive.
To do this we can calculate the expected points based on the number of goals a team scores and the number of chances they give up. Because goalies are partially responsible for rebounds, we can strip out rebounds and replace with a less chaotic, more stable expected rebounds. The result is weighing every initial shot as a probability of a goal and a probability of a rebound, converting expected rebounds to expected goals by using the historical shooting % on rebounds, 27%.
A visual representation of the interaction between these factors supports the expectation – scoring more goals and limiting chances (expected goals) against increases expected points gained. Summed to team-level this information could be used to create a Wins Threshold metric, identifying which goalies need to stand on their heads regularly to win games.
The expected points gained based on goal support and chances against will be used to compare to the actual points gained in games started by a goaltender. How does this look in practice? Earlier this season, November 4st, Corey Crawford faced non-rebound shots that totaled 2.4 expected goals against, while Chicago only scored 1 goal in regulation. Simulating this scenario 1,000 times suggests with an average goaltending performance Chicago could expect about 0.5 points (the average of all simulations, see below). However, Crawford pitched a shutout and Chicago won in regulation, earning Chicago 2 points. This suggests this Crawford’s performance was worth about 1.5 points to Chicago, or 1.5 Points Above Expected (PAX).
Tracking each of Crawford’s starts (ignoring relief efforts) game-by-game show he’s delivered a few wins against the odds (dark green), while really only costing Chicago one game, against New Jersey (dark red).
The biggest steal of the 2017-18 season so far using this framework? Curtis McElhinney on December 10th faced Edmonton shots worth about 5 expected goals (!) and received 1 goal in support. A team might expect 0.05 points under usual circumstances, but McElhinney pitched a shutout and Toronto got the 2 points.
Other notable performances this season is a mixed bag of big names and backups.
Summing to a season-level reveals which goalies have won more than expected. Goalies above the diagonal line (where points gained = points expected) had delivered positive PAX, goalies below the line had negative PAX.
For simplicity, games that go to overtime will be considered to be gaining 1.5 points for each team, reflecting the less certain nature of the short overtime 3-on-3 and shootout. This removes the higher probability of a goal and quality chances against associated with overtime, which is slightly confounding, bringing the focus to regulation time goal support.
This brings up an assumption the analysis originally builds on – that goal support is independent of goaltender performance. We know that score effects suggest a team that is trailing will likely generate more shots and as a result are slightly more likely to score. A bad goal against might create a knock-on effect where the goaltender receives additional goal support. While it is possible that the link between goaltender performance and goal support isn’t completely independent (as we might expect in a complex system like hockey), the effect is likely very marginal. But it this scenario a win would be considered more probable, further discrediting any potential win or loss. Generally, the relationship between goaltender performance and goal support is weak to non-existent.
However, great puckhandling goalies might directly or indirectly help aid their own goal support by helping the transition out of their zone, keeping their defensemen from extra contact, and other actions largely uncaptured by publicly available data. Piecemeal analysis suggests goalies have little ability to help create offense, but absence of evidence does not equal evidence of absence. This will have to be an assumption the analysis will have to live with, any boost to goal support would likely be very small.
Taking the Leap – Icarus?
The goal here is to measure what matters, direct contributions to winning. This framework ties together the accepted notion that the best way from a goaltender to help is team win is to make more saves than expected with the contested idea that some are more likely to make those saves in high leverage situations than others, albeit in an indirect way. To most analysts, being clutch or being a choker are just some random processes with a some narrative applied.
However, once again, absence of evidence does not equal evidence of absence. I imagine advanced biometrics might reveal that some players experience a sharper rise in stress hormones which might effect performance (positively or negatively) during a tie game than if down by a handful of goals. I know I felt it at times, but would have difficulty quantifying its marginal effect on performance, if any. A macro study across all goalies would likely be inconclusive as well. Remember NHL goalies are a sample of the best in the world, those wired weakly very likely didn’t make it (like me).
But winning is important, so it is worth making the jump from puck-stopping ability to game-winning ability. The tradeoff (there’s always tradeoffs) is we lose sample size by a factor of about 30, since the unit of measure is now a game, rather than a shot. This invites less stable results if a game or two have lucky or improbable outcomes. On the other hand, it builds in the possibility some guys are able to raise their level of play based on the situation, rewarding a relatively small number of timely saves, while ignoring goals against when the game was all but decided. I can think of a few games that got out of control where the ‘normal circumstances’ an expected goals model assumes begin to break down.
All hockey followers know goalies can go into brick-wall mode and win games by themselves. The best goalies do it more often, but is it a more distinguishable skill than the raw ability to prevent goals? Remember, we are chasing the enigmatic concept of clutch-ness or ability to win at the expense of sample size, threatening statistically significant measures that give analysis legs.
To test this we can split goalie season into random halves and calculate PAX in each random split, looking at the correlation between each split. For example, goalie A might have 20 of their games with a total PAX of 5 end up in ‘split 1’ and their other 20 games with a PAX of 3 in ‘split 2.’ Doing this for each goalie season we can look at the correlations between the 2 splits.
Using goalie games from 2009 – 2017 we randomly split each goalie season 1,000 times at minimum game cutoffs ranging from 20 to 50, checking the Pearson correlation between each random split. Correlations consistently above 0 suggest the metric has some stability and contains a non-random signal. As a baseline we can compare to the intra-season correlation of a save efficiency metric, goals prevented over expected, which has the advantage of being a shot-level split.
The test reveals that goals prevented per shot carries relatively more signal, which was expected. However, the wins metric also contains stability, losing relative power as sample size drops.
Goalies that contribute points above expected in a random handful of games in any given season are more likely to do the same in their other games. Not only does a wins based metric make sense to the soul, statistical testing suggests it carries some repeatable skill.
Goalie wins as an absolute number are a fairly weak measure of talent, but they do contain valuable information. Like most analyses, if we can provide the necessary context (goal support and chances against) and apply fair statistical testing, we can begin to learn more about what drives wins. While the measure isn’t vastly superior to save efficiency, it does contain some decent signal.
Exploring goaltender win contributions with more advanced methods is important. Wins are the bottom line, they drive franchise decisions, and frame the narrative around teams and athletes. Smart deep dives may be able to identify cases which poor win-loss records are bad luck and which have more serious underlying causes.
A quick look at season-level total goals prevented and PAX (the metrics we compared above) show an additional goal prevented is worth about 0.37 points in the standings, which is supported by the 3-1-1 rule of thumb, or more precisely, 2.73 goals per point calculated in Vollman’s Hockey Abstract. Goal prevention explains about 0.69 of the variance in PAX, so the other 0.31 of the variance may include randomness and (in)ability to win. Saves are still the best way to deliver wins, but there’s more to the story.
When I was a goalie, it was helpful to constantly reaffirm my job: give my team a chance to win. I couldn’t score goals, I couldn’t force teams to take shots favorable to me, so removing that big W from the equation helped me focus on what I could control: maximizing the probability of winning regardless of the circumstances.
This is what matters to goalies, their contribution to wins. Saves are great, but a lot of them could be made by a floating chest protector. While the current iteration of the ‘Goalie Points Above Expected’ metric isn’t perfect, hopefully it is enlightening. Goalies flip game probabilities on their head all the time, creating a metric to capture that information is an important step in figuring out what drives those wins.
Thanks for reading! I hope to make data publicly available and/or host an app for reference. Any custom requests ping me at @crowdscoutsprts or firstname.lastname@example.org.
 I personally averaged 1 point/season, so this assumption doesn’t always hold.
 Adequately screaming at defensemen to cover the slot or third forwards to stay high in the offensive zone is also assumed.
 If a goalie makes a huge save late in a tie game and subsequently win in overtime, the overtime goal was conditional on the play of the goalie, making the win (with an extra goal in support) look easier than it would have otherwise.
 Despite it partially delegitimizing my offensive production in college.
 Note that the split of PAX is at the game-level, which makes it kind of clunky. Splitting randomly will mean some splits will have more or less games, possibly making it tougher to find a significant correlation. This isn’t really a concern with thousands of shots.
Pictured: Marc-Andre Fleury makes an amazing save at the end of game 7 to win the 2009 Stanley Cup, moments after giving up a rebound. Did he need to make this dramatic save? Should he be credited for it? Looking at the probability of a rebound on the original shot can help lend context.
A few years ago I was a seasoned collegiate goaltender and a raw undergrad Economics major. This was a dangerous combination. When my save percentage fell from something that was frankly pretty good to below average, I turned to an overly theoretical model to help explain this slip in measured performance, for my own piece of mind and general curiosity. The goal was to measure goaltending performance by controlling for the things out of their control, like team defense. Specifically, this framework would properly account for shot quality (of course) and adjust for rebounds, by not giving goalies credit for saves made on preventable rebounds . The former considered things out of the goalies control, the later considers what is actually in the goalies control. Discussing the model with my professor it was soon clear that I included a lot of components that didn’t have available data, such as pre-shot puck movement and/or some sort of traffic index. However, this hasn’t stopped analysts, including myself, from creating expected goals models with the data available publicly. But a public and comprehensive expected goal model remains elusive.
One of the first iterations and applications of an expected goals model was Michael Schuckers’ Defense Independent Goalie Rating (DIGR). This framework has been borrowed by other analysts, myself included. The idea being the shots goalies face are largely out of their control, they can’t help if they face 3 breakaways in a period or Ovechkin one-timers from the slot. However, goalies can assert some control over rebounds. How much and if this makes a difference is something we will explore.
Regardless of the outcome of the analysis, logic would suggest we discount credit we give goaltenders for facing shots that they could have or should have prevented. Bad rebounds that turn into great saves should be evaluated from the original shot, rather than taking any follow-up shots as a given.
Rebounds Carry Weight
It’s important to note that rebound shots results in higher observed probability of a goal, which makes sense, and expected goal models generally reflect this. However, this disproportionate amount of an expected goal can be confounding when ‘crediting’ goalie for a rebound opportunity against when it could have been prevented. Looking at my own expected goal model, rebounds account for about 3.2% of all shots, but 13% of total expected goals. This ratio of rebounds being about 4 times as dangerous is supported by observed data as well. Shooting percentage on rebounds is about 27%, while it is 5.8% on original shots.
In the clip above and using hypothetical numbers, Luongo (one of my favorite goalies, so not picking on him here) gives up a bad rebound on a wrist shot from just inside the blueline, with an expected goal (xG) value of ~3%, but the rebound shot, due to the calculated angular velocity of the puck results in a goal historically ~30% of the time. Should this play be scored as Luongo preventing about 1/3 of a goal (~3% + ~30%)?
What if I told you the original shot resulted in a rebound ~2% of the time and that the average rebound is converted to a goal ~25% of the time? Wouldn’t it make more sense to ignore the theatrical rebound save and focus in on the original shot? That’s why I’d rather calculate that Luongo faced a 3.5% chance of a goal, rather than ~33% chance of goal. An xG of 3.5% is based on the 3% of the original shot going in PLUS 0.5% chance of a rebound going in (2% chance of rebound times ~25% chance of goal conditional on rebound), and no goal was scored.
Goals Given Up
Total xG Faced
xG 1st shot
xG 2nd Shot
Raw xG Calculation
Historical probability of goal *given* rebound occurred
0.05% = 25% * 2%
Removing Credit Where Credit Isn’t Due
As to not give goaltenders credit for saves made on ‘bad’ rebound shots we can do the following:
Strip out all xG on shots immediately after a rebound (acknowledging the actual goals that occur on any rebounds, of course)
Assign a probability of a rebound to each shot
Convert the probability of a rebound to a probability of a goal (xG) by multiplying the expected rebound (xRebound) by the probability of a goal on rebound shots, about 27%. This punishes ‘bad’ or preventable rebounds more than shots more likely to result in rebounds. Using similar logic to an expected goals model, some goalies might face shots more likely to become rebounds than others. By converting expected rebounds (xRebounds) to xG, we still expect the total number of expected goals to equal the total number of actual goals scored even after removing xG from rebounds.
To do this we can create a rebound probability model using logistic regression and a similar set of features as an xG model. My most recent model has an out-of-sample area under the ROC curve of 0.68, where 0.50 is random guessing (or assuming every shot has a 3.2% chance of rebound, which is the historical rate). Compare this the current xG model out-of-sample ROC AUC of 0.78, suggesting rebounds are tougher to reliably predict than goals (and we’re not sure there either). A weak rebound model is fine, reflecting the idea an given shot has some probability of turning into a dangerous rebound, maybe a bad bounce or goaltender mishap or fortunate forward, we just have a tough time knowing when.
This does make some sense though, unlike goals where the target is very clear (put the puck in the net), rebounds are less straight forward, they require the puck to hit the goalie and find a opposing players stick before the defense can knock it away. Some defensemen might be able to generate rebounds from point shots more than random, but despite what they might tell you after the fact, players are generally trying to score on the original shot, not create a rebound specifically.
It is also true that goals are targeted, defined events (the game stops, lights go on, goalie feels shame, and the score keeper records it), whereas rebounds escape an obvious definition. Hockey analytics have generally used shots <= 2 seconds from the shot prior, so let’s explore the data behind that reasoning now.
Quickly: What is a rebound?
It’s important to go back and establish what a rebound actually is, without the benefit of watching every shot from every game. We would expect the average shot off of a rebound to have a higher chance of being a goal than a non-rebound shot (all else being equal) since we know the goalie has less time to be able to get set for the shot. And just hypothesizing, it probably takes the goalie and defenders a couple seconds to recover from a rebound. To test the ‘time since last shot’ hypothesis, we can look in the data to see where the observed probability of a goal begins to normalize.
Shots within 2 seconds or less of the original shot are considerably more likely to result in goals than shots than otherwise. There is some effect at a 3 second lag, and certainly some slow-fingered shot recorders around the league might miss a ‘real’ rebound here and there, but the naive classifier of 0-2 seconds between shots is probably the best we can do with limited public data. At 3 seconds, we have lost about half of the effect.
Can your favorite goalie prevent rebound compared to what would be expected? If so great, they will be credited with excess xG (xRebounds multiplied by the observed probability of rebound goals 27%) without having to face a bunch of chaotic and dangerous rebound shots. If they give up more rebounds than average, their xG won’t be inflated by a bunch of juicy rebounds, rather replaced by a more modest xG amount indicative of league average goaltending considering what we know about the shots they’re facing.
Which goalies are best at consistently preventing rebounds according to the model? Looking at expected rebound rates compared to actual rebound rates (below), suggests maybe Pekka Rinne, Petr Mrazek, and Tuukka Rask have a claim at consistently being able to prevent rebounds. Rinne has been well documented to have standout rebound control, so we are at least directionally reaching the same conclusions through prioranalyses and observations. However, adding error bars consistent with +/- 2 standard deviations dull this claim a little.
Generally, the number of rebounds given up by a goalie over the season loosely reflect what the model predicts. The ends of the spectrum being Rinne with great rebound control in 2011-12 and Marc-Andre Fleury in giving up almost 40 more rebounds than expected in 2016-17. Interesting, Pittsburgh has some of the worst xGA/60 metrics in the league that year and ended up winning the Cup anyway. High rebound rates by both goalies (Murray’s rebound rate was about 1% higher than expected himself) definitely contributed to the high xGA/60 number, perhaps making their defense look worse than it was.
Goal Probability Assumptions
I’ll admit we’re making a pretty big assumption that if a errant puck is controlled and a rebound shot is taken the probability of a goal will be 27%. Maybe some goalies are better than consistently making rebound saves than other goalies, either through skill or ability to put rebounds in relatively low danger areas. Below plots, with +/- standard deviation error bars observed goal % (1 – save %) on rebound shots for goalies with at least 5 seasons since 2010-11.
Devan Dubnyk and Carey Price have been consistent in conceding fewer than 27% (the average for the entire sample) of rebound shots as goals. However, considering the standard deviation we can expect from this distribution given the sample size, this may not be ‘skill.’ It’s also important to explore if their rebound shots are less dangerous than average, whether due to skill, luck, or team defensive structure. This appears to be the case, when adjusted for the xG model, they perform about as well as the model predicts in some seasons, and exceed it in others. Certainly not by enough to suggest their rebounds should be treated any differently going forward.
Looking at intra-goalie performance correlation supports the idea that making saves on rebounds is a less repeatable skill than the original shots. From 2014-2017, splitting each goalies shots faced into random halves, the correlation between the split 1 performance and split 2 is about 0.43. On rebound shots, this correlation falls to 0.24, suggesting that there is considerably less signal. While there is some repeatable skill, its not enough to treat any goalies differently in our model post-rebound due to remarkable ability (or inability) to make saves on rebounds.
Controlling Rebounds, Summary
To reiterate, the problem:
Expected goal models are valuable in measuring goaltending performance, but rebounds are responsible for a disproportionate share of expected goals, which the goalie has some control over.
Remove all expected goals credited to the goalie on rebound shots.
Develop a logistic regression model predicting rebounds, the output of which can be interpreted as each shots probability of a rebound.
Explore goalie-level ability to make saves on rebounds shots, to support the assumption that 27% of rebound shots will result in a goal, regardless of goalie.
Replace ‘raw’ expected goals with an expect goal amount based on the probability of goal PLUS probability of a rebound shot multiplied by the historical observed goal % on rebound shots (27%), considering initial, non-rebound shots only.
Finally it’s important to ask, does this framework help predict future performance? Or it just extra work for nothing?
The answer appears to be yes. My RITHAC work attempted to project future goaltender performance by testing different combinations of metrics (xG raw, xG adjusted for rebounds, xG with a Bayesian application, raw save %) and parameters (age regressors, Bayesian priors, lookback seasons). Back testing past seasons, the metrics adjusted for rebounds performed better than the same metrics using a raw expected goal metric as its foundation.
This supports the idea that rebounds, particularly in expected goals models, can confound goaltender analysis by crediting goaltenders disproportionately for chances that they have some control over. In order to reward goalies for controlling rebounds and limiting subsequent chances, goalies can be measured against the amount of goals AND rebounds a league average goalie would concede – which is truer to the goal of creating a metric that controls for team defense and focuses on goaltender performance independent of team quality. Layering in this rebound adjustment increases the predictive power of expected goal metrics.
The limitations of this analysis include the unsatisfactory definition of a rebound and the need for an expected rebound model (alternatively a naive 3.2% of shot attempts result in rebounds can be used). Another layer of complexity might loose some fans and fanalysts. But initial testing suggest that rebound adjustment adds incremental predictive power enough to justify it inclusion in advanced goaltending analysis where the goal is to measure goaltender performance independent of team defense with the publicly data available.
But ask yourself, your coach, your goalie, whoever: should a goalie get credit for a save he makes on a rebound, if he should have controlled it? Probably not.
Thanks for reading! Goalie-season xRebound/Rebound data is updated often and can be downloaded. Any custom requests ping me at @crowdscoutsprts or email@example.com.
 Rebound xG actually can’t be added to the original shot like this since we are basically saying the original shot has a 3% chance of going in, so the rebound will only happen 97% of the time. The probability of the rebound goal in the case is 97% * 30%, or 29.4%. But for simplicity I’ll consider the entire play to be a goal 33.3% of the time. The original work and explainer by Danny Page: (https://medium.com/@dannypage/expected-goals-just-don-t-add-up-they-also-multiply-1dfd9b52c7d0)
If you’re reading this, you’re likely familiar with the idea behind expected goals (xG), whether from soccer analytics, early work done by AlanRyder, Brian MacDonald, or current models by DTMAboutHeart and Asmean, Corsica, Moneypuck, or things I’ve put up on Twitter. Each model attempts to create a probability of each shot being a goal (xG) given the shot’s attributes like shot location, strength, shot type, preceding events, shooter skill, etc. There are also private companies supplementing these features with additional data (most importantly pre-shot puck movement on non-rebound shots and some sort of traffic/sight-line metric) but this is not public or generated in the real-time so will not be discussed here.
To assign a probability (between 0% and 100%) to each shot, most xG models likely use logistic regression – a workhorse in many industry response models. As you can imagine the critical aspect of an xG model, and any model, becomes feature generation – the practice of turning raw, unstructured data into useful explanatory variables. NHL play-by-play data requires plenty of preparation to properly train an xG model. I have made the following adjustments to date:
Adjust for recorded shot distance bias in each rink. This is done by using a cumulative density function for shots taken in games where the team is away and apply that density function to the home rink in case their home scorer is biased. For example (with totally made up numbers), when Boston is on the road their games see 10% of shots within 5 feet of the goal, 20% of shots within 10 feet of the goal, etc. We can adjust the shot distance in their home rink to be the same since the biases of 29 data-recorders should be less than a single Boston data-recorder. If at home in Boston, 10% of the shots were within 10 feet of the goal, we might suspect that the scorer in Boston is systematically recording shots further away from the net than other rinks. We assume games with that team result in similar event coordinates both home and away and we can transform the home distribution to match the away distribution. Below demonstrates how distributions can differ between home and away games, highlighting the probable bias Boston and NY Rangers scorer that season and was adjusted for. Note we also don’t necessarily want to transform by an average, since the bias is not necessarily uniform throughout the spectrum of shot distances.
Figure out what events lead up to the shot, what zone they took place in, and the time lapsed between these events and the eventual shot while ensuring stoppages in play are caught.
Limit to just shots on goal. Misses include information, but like shot distance contain scorer bias. Some scorers are more likely to record a missed shot than others. Unlike shots where we have a recorded event, and it’s just biased, adjusting for misses would require ‘inventing’ occurrences in order to adjust biases in certain rinks, which seems dangerous. It’s best to ignore misses for now, particularly because the majority of my analysis focuses on goalies. Splitting the difference between misses caused by the goalie (perhaps through excellent positioning and reputation for not giving up pucks through the body) and those caused by recorder bias seems like a very difficult task. Shots on goal test the goalie directly hence will be the focus for now.
Clean goalie and player names. Annoying but necessary – both James and Jimmy Howard make appearances in the data, and they are the same guy.
Determine the strength of each team (powerplay for or against or if the goaltender is pulled for an extra attacker). There is a tradeoff here. The coefficients for the interaction of states (i.e. 5v4, 6v5, 4v3 model separately) pick up interesting interactions, but should significant instability from season to season. For example, 3v3 went from a penalty-box filled improbability to a common occurrence to finish overtime games. Alternatively, shooter strength and goalie strength can be model separately, this is more stable but less interesting.
Determine the goaltender and shooter handedness and position from look-up tables.
Determine which end of the ice and what coordinates (positive or negative) the home team is based, using recordings in any given period and rink-adjusting coordinates accordingly.
Calculate shot distance and shot angle. Determine what side of the ice the shot is from, whether or not it is the shooters off-wing based on handedness.
Tag shots as rushes or rebound, and if a rebound how far the puck travelled and the angular velocity of the puck from shot 1 to shot 2.
All of this is to say there is a lot going on under the hood, the results are reliant on the data being recorded, processed, adjusted, and calculated properly. Importantly, the cleaning and adjustments to the data will never be complete, only issues that haven’t been discovered or adjusted for yet. There is no perfect xG model, nor is it possible to create one from the publicly available data, so it is important to concede that there will be some errors, but the goal is to prevent systemic errors that might bias the model. But these models do add useful information regular shot attempt models cannot, creating results that are more robust and useful as we will see.
Current xG Model
The current xG model does not use all developed features. Some didn’t contain enough unique information, perhaps over-shadowed by other explanatory variables. Some might have been generated on sparse or inconsistent data. Hopefully, current features can be improved or new features created.
While the xG model will continue to be optimized to better maximize out of sample performance, the discussion below captures a snapshot of the model. All cleanly recorded shots from 2007 to present are included, randomly split into 10 folds. Each of the 10 folds were then used a testing dataset (checking to see if the model correctly predicted a goal or not by comparing it to actual goals) while the other 9 corresponding folders were used to train the model. In this way, all reported performance metrics consist of comparing model predictions on the unseen data in the testing dataset to what actually happened. This is known as k-fold cross-validation and is fairly common practice in data science.
When we rank-order the predicted xG from highest to lowest probability we can compare the share of goals that occur to shots ordered randomly. This gives us a gains chart, a graphic representation of the how well the model is at finding actual goals relative to selecting shots randomly. We can also calculate the Area Under the Curve (AUC), where 1 is a perfect model and 0.5 is a random model. Think of the random model in this case as shot attempt measurement, treating all shots as equally likely to be a goal. The xG model has an AUC of about 0.75, which is good, and safely in between perfect and random. The most dangerous 25% of shots as selected by the model make up about 60% of actual goals. While there’s irreducible error and model limitations, in practice it is an improvement over unweighted shot attempts and accumulates meaningful sample size quicker than goals for and against.
Hockey is also a zero-sum game. Goals (and expected goals) only matter relative to league average. Original iterations of the expected goal model built on a decade of data show that goals were becoming dearer compared to what was expected. Perhaps goaltenders were getting better, or league data-scorers were recording events to make things look harder than they were, or defensive structures were impacting the latent factors in the model or some combination of these explanations.
Without the means to properly separate these effects, each season receives it own weights for each factor. John McCool had originally discussed season-to-season instability of xG coefficients. Certainly this model contains some coefficient instability, particularly in the shot type variables. But overall these magnitudes adjust to equate each seasons xG to actual goals. Predicting a 2017-18 goal would require additional analysis and smartly weighting past models.
xG in Action
Every shot has a chance of going in, ranging from next to zero to close to certainty. Each shot in the sample is there because the shooter believed there was some sort of benefit to shooting, rather than passing or dumping the puck, so we don’t see a bunch of shots from the far end of the rink, for example. xG then assigns a probability to each shot of being a goal, based on the explanatory variables generated from the NHL data – shot distance, shot angle, is the shot a rebound?, listed above.
Modeling each season separately, total season xG will be very close to actual goals. This also grades goaltenders on a curve against other goaltenders each season. If you are stopping 92% of shots, but others are stopping 93% of shots (assuming the same quality of shots) then you are on average costing your team a goal every 100 shots. This works out to about 7 points in the standings assuming a 2100 shot season workload and that an extra 3 goals against will cost a team 1 point in the standings. Using xG to measure goaltending performance makes sense because it puts each goalie on equal footing as far as what is expected, based on the information that is available.
We can normalize the number of goals prevented by the number of shots against to create a metric, Quality Rules Everything Around Me (QREAM), Expected Goals – Actual Goals per 100 Shots. Splitting each goalie season into random halves allows us to look at the correlation between the two halves. A metric that captures 100% skill would have a correlation of 1. If a goaltender prevented 1 goal every 100 shots, we would expect to see that hold up in each random split. A completely useless metric would have an intra-season correlation of 0, picking numbers out of a hat would re-create that result. With that frame of reference, intra-season correlations for QREAM are about 0.4 compared to about 0.3 for raw save percentage. Pucks bounce so we would never expect to see a correlation of 1, so this lift is considered to be useful and significant.
Crudely, each goal prevented is worth about 1/3 of a point in the standings. Implying how many goals a goalie prevents compared to average allows us to compute how many points a goalie might create for or cost their team. However, a more sophisticated analysis might compare goal support the goalie receives to the expected goals faced (a bucketed version of that analysis can be found here). Using a win probability model the impact the goalie had on win or losing can be framed as actual wins versus expected.
goaltending update YTD. Bobrovsky has added almost 20 points & 9 spots in standings to CBJ. Vezina fav & Hart candidate. Talbot valuable too pic.twitter.com/l019js71U8
xG’s also are important because they begin to frame the uncertainty that goes along with goals, chance, and performance. What does the probability of a goal represent? Think of an expected goal as a coin weighted to represent the chance that shot is a goal. Historically, a shot from the blueline might end up a goal only 5% of the time. After 100 shots (or coin flips) will there be exactly 5 goals? Maybe, but maybe not. Same with a rebound from in tight to the net that has a probability of a goal equal to 50%. After 10 shots, we might not see 5 goals scored, like ‘expected.’ 5 goals is the most likely outcome, but anywhere from 0 to 10 is possible on only 10 shots (or coin flips).
We can see how actual goals and expected goals might deviate in small sample sizes, from game to game and even season to season. Luckily, we can use programs like R, Python, or Excel to simulate coin flips or expected goals. A goalie might face 1,000 shots in a season, giving up 90 goals. With historical data, each of those shots can be assigned a probability of a being a goal. If the average probability of a goal is 10%, we expect the goalie to give up 100 goals. But using xG, there are other possible outcomes. Simulating 1 season based on expected goals might result in 105 goals against. Another simulation might be 88 goals against. We can simulate these same shots 1,000 or 10,000 times to get a distribution of outcomes based on expected goals and compare it to the actual goals.
In our example, the goalie possibly prevented 10 goals on 1,000 shots (100 xGA – 90 actual GA). But they also may have prevented 20 or prevented 0. With expected goals and simulations, we can begin to visualize this uncertainty. As the sample size increases, the uncertainty decreases but never evaporates. Goaltending is a simple position, but the range of outcomes, particularly in small samples, can vary due to random chance regardless of performance. Results can vary due to performance (of the goalie, teammates, or opposition) as well, and since we only have one season that actually exists, separating the two is painful. Embracing the variance is helpful and expected goals help create that framework.
It is important to acknowledge that results do not necessarily reflect talent or future or past results. So it is important to incorporate uncertainty into how we think about measuring performance. Expected goal models and simulations can help.
Luckily, Bayesian analysis can also deal with weighting uncertainty and evidence. First, we set a prior –probability distribution of expected outcomes. Brian MacDonald used mean Even Strength Save Percentage as prior, the distribution of ESSV% of NHL goalies. We can do the same thing with Expected Save Percentage (shots – xG / shots), create a unique prior distribution of outcome for each goalie season depending on the quality of shots faced and the sample size we’ll like to see. Once the prior is set, evidence (saves in our case) is layered on to the prior creating a posterior outcome.
Imagine a goalie facing 100 shots to start their career and, remarkably, making 100 saves. They face 8 total xG against, so we can set the Prior Expected Save% as a distribution centered around 92%. The current evidence at this point is 100 saves on 100 shots, and Bayesian Analysis will combine this information to create a Posterior distribution.
Goaltending is a binary job (save/goal) so we can use a beta distribution to create a distribution of the goaltenders expected (prior) and actual (evidence) save percentage between 0 and 1, like a baseball players batting average will fall between 0 and 1. We also have to set the strength of the prior – how robust the prior is to the new evidence coming in (the shots and saves of the goalie in question). A weak prior would concede to evidence quickly, a hot streak to start a season or career may lead the model to think this goalie may be a Hart candidate or future Hall-of-Famer! A strong prior would assume every goalie is average and require prolonged over or under achieving to convince the model otherwise. Possibly fair, but not revealing any useful information until it has been common knowledge for a while.
Every time a reported result actives your small sample size spidey senses, remember Bayesian analysis is thoroughly unimpressed, dutifully collecting evidence, once shot at a time.
Perfect is often the enemy of the good. Expected goal models fail to completely capture the complex networks and inputs that create goals, but they do improve on current results-based metrics such as shot attempts by a considerable amount. Their outputs can be conceptualized by fans and players alike, everybody understands a breakaway has a better chance of being a goal than a point shot.
The math behind the model is less accessible, but people, particularly the young, are becoming more comfortable with prediction algorithms in their daily life, from Spotify generating playlists to Amazon recommender systems. Coaches, players, and fans on some level understand not all grade A chances will result in a goal. So while out-chancing the other team in the short term is no guarantee of victory, doing it over the long term is a recipe for success. Removing some the noise that goals contain and the conceptual flaws of raw shot attempts helps the smooth short-term disconnect between performance and results.
My current case study using expected goals is to measure goaltending performance since it’s the simplest position – we don’t need to try to split credit between linemates. Looking at xGA – GA per shot captures more goalie specific skill than save percentage and lends itself to outlining the uncertainty those results contain. Expected goals also allow us to create an informed prior that can be used in a Bayesian hierarchical model. This can quantify the interaction between evidence, sample size, and uncertainty.
Almost the entire conceptual arsenal that we use today to describe and study football consists of on-the-ball event types, that is to say it maps directly to raw data. We speak of “tackles” and “aerial duels” and “big chances” without pausing to consider whether they are the appropriate unit of analysis. I believe that they are not. That is not to say that the events are not real; but they are merely side effects of a complex and fluid process that is football, and in isolation carry little information about its true nature. To focus on them then is to watch the train passing by looking at the sparks it sets off on the rails.
Armed with only ‘outcome data’ rather than comprehensive ‘inputs data’ analyst most models will be best served with a logistic regression. Logistic regression often bests complex models, often generalizing better than machine learning procedures. However, it will become important to lean on machine learning models as reliable ‘input’ data becomes available in order to capture the deep networks of effects that lead to goal creation and prevention. Right now we only capture snapshots, thus logistic regression should perform fine in most cases.
 Most people readily acknowledge some share of results in hockey are luck. Is the number closer to 60% (given the repeatable skill in my model is about 40%), or can it be reduced to 0% because my model is quite weak? The current model can be improved with more diligent feature generation and adding key features like pre-shot puck movement and some sort of traffic metric. This is interesting because traditionally logistic regression models see diminishing marginal returns from adding more variables, so while I am missing 2 big factors in predicting goals, the intra-seasonal correlation might only go from 40% to 50%. However, deep learning networks that can capture deeper interactions between variables might see an overweight benefit from these additional ‘input’ variables (possibly capturing deeper networks of effects), pushing the correlation and skill capture much higher. I have not attempted to predict goals using deep learning methods to date.
I’ve recently attempted to measure goaltending performance by looking at the number of expected goals a goaltender faces compared to the actual goals they actually allow. Expected goals are ‘probabilitistic goals’ based on what we have data for (which isn’t everything): if that shot were taken 1,000 times on the average goalie that made the NHL, how often would it be a goal? Looking at one shot there is variance, the puck either goes in or doesn’t, but over a course of a season summing the expected goals gives a little better idea of how the goaltender is performing because we can adjust for the quality of shots they face, helping isolate their ‘skill’ in making saves. The metric, which I’ll refer to as QREAM (Quality Rules Everything Around Me), reflects goaltender puck-saving skill more than raw save percentage, showing more stability within goalie season.
Good stuff. We can then use QREAM to break down goalie performance by situations, tactical or circumstantial, to reveal actionable trends. Is goalie A better on shots from the left side or right side? Left shooters or right shooters? Wrist shots, deflections, etc? Powerplay? Powerplay, left or right side? etc. We can even visualise it, and create a unique descriptive look at how each goaltender or team performed.
This is a great start. The next step in confirming the validity of a statistic is looking how it holds up over time. Is goalie B consistently weak on powerplay shots from the left side? Is something that can be exploited by looking at the data? Predictivity is important to validate a metric, showing that it can be acted up and some sort of result can be expected. Unfortunately, year over year trends by goalie don’t hold up in an actionable way. There might be a few persistent trends below, but nothing systemic we can that would be more prevalent than just luck. Why?
Game Theory (time for some)
In the QREAM example, predictivity is elusive because hockey is not static and all players and coaches in question are optimizers trying their best to generate or prevent goals at any time. Both teams are constantly making adjustments, sometimes strategically and unconsciously. As a data scientist, when I analyse 750,000 shots over 10 seasons, I only see what happened, not what didn’t happen. If in one season, goalie A underperformed the average on shots from the left shooters from the left side of the ice that would show up in the data, but it would be noticed by players and coaches quicker and in a much more meaningful and actionable way (maybe it was the result of hand placement, lack of squareness, cheating to the middle, defenders who let up cross-ice passes from right to left more often than expected, etc.) The goalie and defensive team would also pick up on these trends and understandably compensate, maybe even slightly over-compensate, which would open up other options attempting to score, which the goalie would adjust to, and so on until the game reaches some sort of multi-dimensional equilibrium (actual game theory). If a systemic trend did continue then there’s a good chance that that goalie will be out of the league. Either way, trying to capture a meaningful actionable insight from the analysis is much like trying to capture lightning in a bottle. In both cases, finding a reliable pattern in a game where both sides and constantly adjusting and counter-adjusting is very difficult.
This isn’t to say the analysis can’t be improved. My expected goal model has weaknesses and will always have limitations due to data and user error. That said, I would expect the insights of even a perfect model to be arbitraged away. More shockingly (since I haven’t looked at this in-depth, at all), I would expected the recent trend of NBA teams fading the use of mid-range shots to reverse in time as more teams counter that with personnel and tactics, then a smart team could probably exploit that set-up by employing slightly more mid-range shots, and so on, until a new equilibrium is reached. See you all at Sloan 2020.
Data On Ice
The role of analytics is to provide a new lens to look at problems and make better-informed decisions. There are plenty of example of applications at the hockey management level to support this, data analytics have aided draft strategy and roster composition. But bringing advanced analytics to on-ice strategy will likely continue to chase adjustments players and coaches are constantly making already. Even macro-analysis can be difficult once the underlying inputs are considered.
An analyst might look at strategies to enter the offensive zone, where you can either forfeit control (dump it in) or attempt to maintain control (carry or pass it in). If you watched a sizable sample of games across all teams and a few different seasons, you would probably find that you were more likely to score a goal if you tried to pass or carry the puck into the offensive zone than if you dumped it. Actionable insight! However, none of these plays occurs in a vacuum – a true A/B test would have the offensive players randomise between dumping it in and carrying it. But the offensive player doesn’t randomise, they are making what they believe to be the right play at that time considering things like offensive support, defensive pressure, and shift length of them and their teammates. In general, when they dump the puck, they are probably trying to make a poor position slightly less bad and get off the ice. A randomised attempted carry-in might be stopped and result in a transition play against. So, the insight of not dumping the puck should be changed to ‘have the 5-player unit be in a position to carry the puck into the offensive zone,’ which encompasses more than a dump/carry strategy. In that case, this isn’t really an actionable, data-driven strategy, rather an observation. A player who dumps the puck more often likely does so because they struggle to generate speed and possession from the defensive zone, something that would probably be reflected in other macro-stats (i.e. the share of shots or goals they are on the ice for). The real insight is the player probably has some deficiencies in their game. And this where the underlying complexity of hockey begins to grate at macro-measures of hockey analysis, there’s many little games within the games, player-level optimisation, and second-order effects that make capturing true actionable, data-driven insight difficult.
It can be done, though in a round-about way. Like many, I support the idea of using (more specifically, testing) 4 or even 5 forwards on the powerplay. However, it’s important to remember that analysis that shows a 4F powerplay is more of a representation of the team’s personnel that elect to use that strategy, rather than the effectiveness of that particular strategy in a vacuum. And team’s will work to counter by maximising their chance of getting the puck and attacking the forward on defence by increasing aggressiveness, which may be countered by a second defenseman, and so forth.
Game Theory (revisited & evolved)
Where analytics looks to build strategic insights on a foundation of shifting sand, there’s an equally interesting force at work – evolutionary game theory. Let’s go back to the example of the number of forwards employed on the powerplay, teams can use 3, 4, or 5 forwards. In game theory, we look for a dominant strategy first. While self-selected 4 forward powerplays are more effective a team shouldn’t necessarily employ it if up by 2 goals in the 3rd period, since a marginal goal for is worth less than a marginal goal against. And because 4 forward powerplays, intuitively, are more likely to concede chances and goals against than 3F-2D, it’s not a dominant strategy. Neither are 3F-2D or 5F-0D.
Thought experiment. Imagine in the first season, every team employed 3F-2D. In season 2, one team employs a 4F-1D powerplay, 70% of the time, they would have some marginal success because the rest of the league is configured to oppose 3F-2D, and in season 3 this strategy replicates, more teams run a 4F-1D in line with evolutionary game theory. Eventually, say in season 10, more teams might run a 4F-1D powerplay than 3F-2D, and some even 5F-0D. However, penalty kills will also adjust to counter-balance and the game will continue. There may or may not be an evolutionarily stable strategy where teams are best served are best mixing strategies like you would playing rock-paper-scissors. I imagine the proper strategy would depend on score state (primarily), and respective personnel.
You can imagine a similar game representing the function of the first forward in on the forecheck. They can go for the puck or hit the defensemen – always going for the puck would let the defenseman become too comfortable, letting them make more effective plays, while always hitting would take them out of the play too often, conceding too much ice after a simple pass. The optimal strategy is likely randomising, say, hitting 20% of the time factoring in gap, score, personnel, etc.
A More Robust (& Strategic) Approach
Even if it seems a purely analytic-driven strategy is difficult to conceive, there is an opportunity to take advantage of this knowledge. Time is a more robust test of on-ice strategies than p-values. Good strategies will survive and replicate, poor ones will (eventually and painfully) die off. Innovative ideas can be sourced from anywhere and employed in minor-pro affiliates where the strategies effects can be quantified in a more controlled environment. Each organisation has hundreds of games a year in their control and can observe many more. Understanding that building an analytical case for a strategy may be difficult (coaches are normally sceptical of data, maybe intuitively for the reasons above), analysts can sell the merit of experimenting and measuring, giving the coach major ownership of what is tested. After all, it pays to be first in a dynamic game such as hockey. Bobby Orr changed the way the blueliners played. New blocking tactics (and equipment) lead to improved goaltending. Hall-of-Fame forward Sergei Fedorov was a terrific defenseman on some of the best teams of the modern era. Teams will benefit from being the first to employ (good) strategies that other teams don’t see consistently and don’t devote considerable time preparing for.
The game can also improve using this framework. If leagues want to encourage goal scoring, they should encourage new tactics by incentivising goals. I would argue that the best and most sustainable way to increase goal scoring would be to award AHL teams 3 points for scoring 5 goals in a win. This will encourage offensive innovation and heuristics that would eventually filter up to the NHL level. Smaller equipment or big nets are susceptible to second order effects. For example, good teams may slow down the game when leading (since the value of a marginal goal for is now worth less than a marginal goal against) making the on-ice even less exciting. Incentives and innovation work better than micro-managing.
The primary role of analytics in sport and business is to deliver actionable insights using the tools are their disposal, whether is statistics, math, logic, or whatever. With current data, it is easier for analysts to observe results than to formulate superior on-ice strategies. Instead of struggling to capture the effect of strategy in biased data, they should be using this to their advantage and look at these opportunities through the prism of game theory: testing and measuring and let the best strategies bubble to the top. Even the best analysis might fail to pick up on some second order effect, but thousands of shifts are less likely to be fooled. The data is too limited in many ways to create paint the complete picture. A great analogy came from football (soccer) analyst Marek Kwiatkowski:
Almost the entire conceptual arsenal that we use today to describe and study football consists of on-the-ball event types, that is to say it maps directly to raw data. We speak of “tackles” and “aerial duels” and “big chances” without pausing to consider whether they are the appropriate unit of analysis. I believe that they are not. That is not to say that the events are not real; but they are merely side effects of a complex and fluid process that is football, and in isolation carry little information about its true nature. To focus on them then is to watch the train passing by looking at the sparks it sets off on the rails.
Hopefully, there will soon be a time where every event is recorded, and in-depth analysis can capture everything necessary to isolate things like specific goalie weaknesses, optimal powerplay strategy, or best practices on the forecheck. Until then there are underlying forces at work that will escape the detection. But it’s not all bad news, the best strategy is to innovate and measure. This may not be groundbreaking to the many innovative hockey coaches out there but can help focus the smart analyst, delivering something actionable.
 Is hockey a simple or complex system? When I think about hockey and how to best measure it, this is a troubling question I keep coming back to. A simple system has a modest amount of interacting components and they have clear relationships to other components: say, when you are trailing in a game, you are more likely to out-shoot the other team than you would otherwise. A complex system has a large number of interacting pieces that may combine to make these relationships non-linear and difficult to model or quantify. Say, when you are trailing the pressure you generate will be a function of time left in the game, respective coaching strategies, respective talent gaps, whether the home team is line matching (presumably to their favor), in-game injuries or penalties (permanent or temporary), whether one or both teams are playing on short rest, cumulative impact of physical play against each team, ice conditions, and so on.
Fortunately, statistics are such a powerful tool because a lot of these micro-variables even out over the course of the season, or possibly the game to become net neutral. Students learning about gravitational force don’t need to worry about molecular forces within an object, the system (e.g. block sliding on an incline slope) can separate from the complex and be simplified. Making the right simplifying assumptions we can do the same in hockey, but do so at the risk of losing important information. More convincingly, we can also attempt to build out the entire state-space (e.g different combinations of players on the ice) and using machine learning to find patterns within the features and winning hockey games. This is likely being leveraged internally by teams (who can generate additional data) and/or professional gamblers. However, with machine learning techniques applied there appeared to be a theoretical upper bound of single game prediction, only about 62%. The rest, presumably, is luck. Even if this upper-bound softens with more data, such as biometrics and player tracking, prediction in hockey will still be difficult.
It seems to me that hockey is suspended somewhere between the simple and the complex. On the surface, there’s a veneer of simplicity and familiarity, but perhaps there’s much going on underneath the surface that is important but can’t be quantified properly. On a scale from simple to complex, I think hockey is closer to complex than simple, but not as complex as the stock market, for example, where upside and downside are theoretically unlimited and not bound by the rules of a game or a set amount of time. A hockey game may be 60 on a scale of 0 (simple) to 100 (complex).
 Spoiler alert: if you performing the same thought experiment with rock-paper-scissors you arrive at the right answer – randomise between all 3, each 1/3 of the time – unless you are a master of psychology and can read those around you. This obviously has a closed form solution, but I like visuals better:
 This likely speaks more to personnel than tactical, Fedorov could be been peerless. However, I think to football where position changes are more common, i.e. a forgettable college receiver at Stanford switched to defence halfway through his college career and became a top player in the NFL league, Richard Sherman. Julian Edelman was a college quarterback and now a top receiver on the Super Bowl champions. Test and measure.
Their desires and effort are justified: a single metric, when properly used, can be used to analyze salaries, trades, roster composition, draft strategy, etc. Though it should be noted that WAR, or any single number rating, is not a magic elixir since it can fail to pick up important differences in skill sets or role, particularly in hockey. There is also a risk that it is used as a crutch, which may be the case with any metric.
Targeting the Head
Prior explorations into answering the question have been detailed and involved, and rightfully so, aggregating and adjusting an incredible amount of data to create a single player-season value. However, I will attempt to reverse engineer a single metric based on in-season data from a project.
For the 2015-16 season, the CrowdScout project aggregated the opinions of individual users. The platform uses the Elo formula, a memoryless algorithm that constantly adjusts each player’s score with new information. In this case, information is the user’s opinion that is hopefully guided by the relevant on-ice metrics (provided to the user, see below). Hopefully, the validity of this project is closer to Superforecastingthan the NHL awards, and it should be: the ‘best’ users or scouts are given increasing more influence over the ratings, while the worst are marginalized.
The CrowdScout platform ran throughout the season with over 100 users making over 32,000 judgments on players, creating a population of player ratings ranging from Sidney Crosby to Tanner Glass. The system has largely worked as intended, but needs to continue to acquire an active, smart, and diverse user base – this will always be the case when trying to harness the ‘wisdom of the crowd.’ Hopefully, as more users sign-up and smarter algorithms emphasize the opinions of the best, the Elo rating will come closer to answering the question posed to scouts as they are prompted to rank two players – if the season started today, which player would you choose if the goal were to win a championship.
Each player’s Elo is adjusted by the range of ratings within the population. The result, ranging from 0 to 100, generally passes the sniff test, at times missing on players due to too few or poor ratings. However, this player-level rating provides something more interesting – a target variable to create an empirical model from. Whereas in theory, WAR is cumulative metric representing incremental wins added by a player, the CrowdScout Score, in theory, represents a player’s value to a team pursuing a championship. Both are desirable outcomes, and will not work perfectly in practice, but this is hockey analytics: we can’t let perfect get in the way of good.
Why is this analysis useful or interesting?
Improve the CrowdScout Score – a predicted CrowdScout Score based on-ice data could help identify misvalued players and reinforce properly valued players. In sum, a proper model would be superior to the rankings sourced from the inaugural season with a small group of scouts.
Validate the CrowdScout Score – Is there a proper relationship between CrowdScout Score and on-ice metrics? How large are the residuals between the predicted score and actual score? Can the CrowdScout Score or predicted score be reliably used in other advanced analyses? A properly constructed model that reveals a solid relationship between crowdsourced ratings and on-ice metrics would help validate the project. Can we go back in time to create a predicted score for past player seasons?
Evaluate Scouts – The ability to reliably predict the CrowdScout Score based on on-ice metrics can be used to measure the accuracy of the scout’s ratings in real-time. The current algorithm can only infer correctness in the future – time needs to pass to determine whether the scout has chosen a player preferred by the rest of the crowd. This could be the most powerful result, constantly increasing the influence of users whose ratings agree with the on-ice results. This is, in turn, would increase the accuracy of the CrowdScout Score, leading a stronger model, continuing a virtuous circle.
Fun – Every sports fan likes a good top 10 list or something you can argue over.
Reverse Engineering the Crowd
We are lucky enough to have a shortcut to a desirable target variable, the end of season CrowdScout Score for each NHL player. We can then merge on over 100 player-level micro stats and rate metrics for the 2015-16 season, courtesy of puckalytics.com. There are 539 skaters that have at least 50 CrowdScout games and complete metrics. This dataset can then be used to fit a model using on-ice data to explain CrowdScout Score, then we use the model output to predict the CrowdScout Score, using the same player-level on-ice data. Where the crowd may have failed to accurately gauge a player’s contribution to winning, the model can use additional information to create a better prediction.
The strength of any model is proper feature selection and prevention of overfitting. Hell, with over 100 variables and over 500 players, you could explain the number of playoff beard follicles with spurious statistical significance. To prevent this, I performed couple operations using the caret package in R.
Find Linear Combination of Variables – using the findLinearCombos function in caret, variables that were mathematically identical to a linear combination of another set of variables were dropped. For example, you don’t need to include goals, assists, and points, since points are simply assists plus goals.
Recursive Feature Elimination – using the rfe function in caret and a 10-fold cross-validation control (10 subsets of data were considered when making the decision, all decision were made on the models performance on unseen, or holdout, data) the remaining 80-some skater variables were considered from most powerful to least powerful. The RFE plot below shows a maximum strength of model at 46 features, but most of the gains are achieve by about the 8 to 11 most important variables.
Correlation Matrix – create a matrix to identify and remove features that are highly correlated with each other. The final model had 11 variables listed below.
The remaining variables were placed into a Random Forest models targeting the skaters CrowdScout Score. Random Forest is a popular ensemble model: it randomly subsets variables and observations (random) and creates many decision-trees to explain the target variable (forest). Each observation or player is assigned a predicted score based on the aggregate results of the many decision-trees.
Using the caret package in R, I created Random Forest model controlled by a 10-fold cross-validation, not necessarily to prevent overfitting which is not a large concern with Random Forest, but to cycle through all data and create predicted scores for each player. I gave the model the flexibility to try 5 different tuning combinations, allowing it to test the ideal number of variables randomly sampled at each split and number of trees to use. The result was a very good fitting model, explaining over 95% of the CrowdScout Score out of sample. Note the variation explained, rather than the variance explained was closer to 70%.
Note the slope of the best-fit relationship between actual and predicted scores is a little less than 1. The model doesn’t want to credit the best players too much for their on-ice metrics, or penalize the worst players too much, but otherwise do a very good job.
Let’s return to the original intent of the analysis. We can predict about 95% of CrowdScout Score using vetted on-ice metrics. This suggests the score is reliable, but that doesn’t necessarily mean the CrowdScout Score is right. In fact, we can assume that the actual score is often wrong. How does a simpler model do? Using the same on-ice metrics in a Generalized Linear Model (GLM) performs fairly well out of sample, explaining about 70% of the variation. The larger error terms of the GLM model represent larger deviations of the predicted score from the actual. While these larger deviations result in a poorer fitting model fit, they may also contain some truth. The worse fitting linear model has more flexibility to be wrong, perhaps allowing a more accurate prediction.
How do the player-level residuals between the two models compare? They are largely the same directionally, but the GLM residuals are about double in magnitude. So, for example, the Random Forest model predicts Sean Monahan’s CrowdScout Score to be 64 instead of his current 60, giving a residual of +4 (residual = predicted – actual). Not to be outdone, the Generalized Linear Model doubles that residual predicting a 68 score (+8 residual). It appears that both models generally agree, with the GLM being more likely to make a bold correction to the actual score.
The development of an accurate single comprehensive metric to measure player impact will be an iterative process. However, it seems the framework exists to fuse human input and on-ice performance into something that can lend itself to more complex analysis. Our target variable was not perfect, but it provided a solid baseline for this analysis and will be improved. To recap the original intent of the analysis:
Both models generally agree when a player is being overrated or underrated by the crowd, though by different magnitudes. In either case, the predicted score is directionally likely to be more accurate than the current score. This makes sense since we have more information (on-ice data). If it wasn’t obvious, it appears on-ice metrics can help improve the CrowdScout Score.
Fortunate, because our models fail to explain between 5% and 30% of the score and vary more from the true ability. Some of the error will be justified, but often it will signal that the CrowdScout Score needs to adjust. Conversely, a beta project with relatively few users was able to create a comprehensive metric that can be mostly engineered and validated using on-ice metrics.
Being able to calculate a predicted CrowdScout Score more accurate than the actual score gives the platform an enhanced ability to evaluate scouting performance in real-time. This will strengthen the virtuous circle of giving the best scouts more influence over Elo ratings, which will help create a better prediction model.
Your opinion will now be held up against people, models, and your own human biases. Fun.
Huge thanks to asmean to contributing to this study, specifically advising on machine learning methods.
 The Wins Above Replacement problem is not unlike the attribution problem my Data Science marketing colleagues deal with. We know the was a positive event (a win or conversion) but how do we attribute that event to the input actions between hockey players or marketing channels. It’s definitely a problem I would love to circle back to.
 What determines the ‘best’ scout? Activity is one component, but picking players that continue to ascend is another. I actually have plans to make this algorithm ‘smarter’ and is a long overdue explanation on my end.
 The CrowdScout platform and ensemble models have similar philosophies – they synthesize the results of models or opinions of users into a single score in order to improve their accuracy.