Expected Goals (xG), Uncertainty, and Bayesian Goalies

All xG model code can be found on GitHub.

Expected Goals (xG) Recipe

If you’re reading this, you’re likely familiar with the idea behind expected goals (xG), whether from soccer analytics, early work done by Alan RyderBrian MacDonald, or current models by DTMAboutHeart and Asmean, Corsica, Moneypuck, or things I’ve put up on Twitter. Each model attempts to create a probability of each shot being a goal (xG) given the shot’s attributes like shot location, strength, shot type, preceding events, shooter skill, etc. There are also private companies supplementing these features with additional data (most importantly pre-shot puck movement on non-rebound shots and some sort of traffic/sight-line metric) but this is not public or generated in the real-time so will not be discussed here.[1]

To assign a probability (between 0% and 100%) to each shot, most xG models likely use logistic regression – a workhorse in many industry response models. As you can imagine the critical aspect of an xG model, and any model, becomes feature generation – the practice of turning raw, unstructured data into useful explanatory variables. NHL play-by-play data requires plenty of preparation to properly train an xG model. I have made the following adjustments to date:

  • Adjust for recorded shot distance bias in each rink. This is done by using a cumulative density function for shots taken in games where the team is away and apply that density function to the home rink in case their home scorer is biased. For example (with totally made up numbers), when Boston is on the road their games see 10% of shots within 5 feet of the goal, 20% of shots within 10 feet of the goal, etc. We can adjust the shot distance in their home rink to be the same since the biases of 29 data-recorders should be less than a single Boston data-recorder. If at home in Boston, 10% of the shots were within 10 feet of the goal, we might suspect that the scorer in Boston is systematically recording shots further away from the net than other rinks. We assume games with that team result in similar event coordinates both home and away and we can transform the home distribution to match the away distribution. Below demonstrates how distributions can differ between home and away games, highlighting the probable bias Boston and NY Rangers scorer that season and was adjusted for. Note we also don’t necessarily want to transform by an average, since the bias is not necessarily uniform throughout the spectrum of shot distances.
home rink bias

No Place Like Home

  • Figure out what events lead up to the shot, what zone they took place in, and the time lapsed between these events and the eventual shot while ensuring stoppages in play are caught.
  • Limit to just shots on goal. Misses include information, but like shot distance contain scorer bias. Some scorers are more likely to record a missed shot than others. Unlike shots where we have a recorded event, and it’s just biased, adjusting for misses would require ‘inventing’ occurrences in order to adjust biases in certain rinks, which seems dangerous. It’s best to ignore misses for now, particularly because the majority of my analysis focuses on goalies. Splitting the difference between misses caused by the goalie (perhaps through excellent positioning and reputation for not giving up pucks through the body) and those caused by recorder bias seems like a very difficult task. Shots on goal test the goalie directly hence will be the focus for now.
  • Clean goalie and player names. Annoying but necessary – both James and Jimmy Howard make appearances in the data, and they are the same guy.
  • Determine the strength of each team (powerplay for or against or if the goaltender is pulled for an extra attacker). There is a tradeoff here. The coefficients for the interaction of states (i.e. 5v4, 6v5, 4v3 model separately) pick up interesting interactions, but should significant instability from season to season. For example, 3v3 went from a penalty-box filled improbability to a common occurrence to finish overtime games. Alternatively, shooter strength and goalie strength can be model separately, this is more stable but less interesting.
  • Determine the goaltender and shooter handedness and position from look-up tables.
  • Determine which end of the ice and what coordinates (positive or negative) the home team is based, using recordings in any given period and rink-adjusting coordinates accordingly.
  • Calculate shot distance and shot angle. Determine what side of the ice the shot is from, whether or not it is the shooters off-wing based on handedness.
  • Tag shots as rushes or rebound, and if a rebound how far the puck travelled and the angular velocity of the puck from shot 1 to shot 2.
  • Calculate ‘shooting talent’ – a regressed version of shooting percentage using the Kuder-Richardson Formula 21, employed the same way as in DTMAboutHeart and Asmean‘s xG model.

All of this is to say there is a lot going on under the hood, the results are reliant on the data being recorded, processed, adjusted, and calculated properly. Importantly, the cleaning and adjustments to the data will never be complete, only issues that haven’t been discovered or adjusted for yet. There is no perfect xG model, nor is it possible to create one from the publicly available data, so it is important to concede that there will be some errors, but the goal is to prevent systemic errors that might bias the model. But these models do add useful information regular shot attempt models cannot, creating results that are more robust and useful as we will see.

Current xG Model

The current xG model does not use all developed features. Some didn’t contain enough unique information, perhaps over-shadowed by other explanatory variables. Some might have been generated on sparse or inconsistent data. Hopefully, current features can be improved or new features created.

While the xG model will continue to be optimized to better maximize out of sample performance, the discussion below captures a snapshot of the model. All cleanly recorded shots from 2007 to present are included, randomly split into 10 folds. Each of the 10 folds were then used a testing dataset (checking to see if the model correctly predicted a goal or not by comparing it to actual goals) while the other 9 corresponding folders were used to train the model. In this way, all reported performance metrics consist of comparing model predictions on the unseen data in the testing dataset to what actually happened. This is known as k-fold cross-validation and is fairly common practice in data science.

When we rank-order the predicted xG from highest to lowest probability we can compare the share of goals that occur to shots ordered randomly. This gives us a gains chart, a graphic representation of the how well the model is at finding actual goals relative to selecting shots randomly. We can also calculate the Area Under the Curve (AUC), where 1 is a perfect model and 0.5 is a random model. Think of the random model in this case as shot attempt measurement, treating all shots as equally likely to be a goal. The xG model has an AUC of about 0.75, which is good, and safely in between perfect and random. The most dangerous 25% of shots as selected by the model make up about 60% of actual goals. While there’s irreducible error and model limitations, in practice it is an improvement over unweighted shot attempts and accumulates meaningful sample size quicker than goals for and against.

gains chart

Gains, better than random

Hockey is also a zero-sum game. Goals (and expected goals) only matter relative to league average. Original iterations of the expected goal model built on a decade of data show that goals were becoming dearer compared to what was expected. Perhaps goaltenders were getting better, or league data-scorers were recording events to make things look harder than they were, or defensive structures were impacting the latent factors in the model or some combination of these explanations.

Without the means to properly separate these effects, each season receives it own weights for each factor. John McCool had originally discussed season-to-season instability of xG coefficients. Certainly this model contains some coefficient instability, particularly in the shot type variables. But overall these magnitudes adjust to equate each seasons xG to actual goals. Predicting a 2017-18 goal would require additional analysis and smartly weighting past models.

Coefficient Stability

Less volatile than goalies?

xG in Action

Every shot has a chance of going in, ranging from next to zero to close to certainty.  Each shot in the sample is there because the shooter believed there was some sort of benefit to shooting, rather than passing or dumping the puck, so we don’t see a bunch of shots from the far end of the rink, for example. xG then assigns a probability to each shot of being a goal, based on the explanatory variables generated from the NHL data – shot distance, shot angle, is the shot a rebound?, listed above.

Modeling each season separately, total season xG will be very close to actual goals. This also grades goaltenders on a curve against other goaltenders each season. If you are stopping 92% of shots, but others are stopping 93% of shots (assuming the same quality of shots) then you are on average costing your team a goal every 100 shots. This works out to about 7 points in the standings assuming a 2100 shot season workload and that an extra 3 goals against will cost a team 1 point in the standings. Using xG to measure goaltending performance makes sense because it puts each goalie on equal footing as far as what is expected, based on the information that is available.

We can normalize the number of goals prevented by the number of shots against to create a metric, Quality Rules Everything Around Me (QREAM), Expected Goals – Actual Goals per 100 Shots. Splitting each goalie season into random halves allows us to look at the correlation between the two halves. A metric that captures 100% skill would have a correlation of 1. If a goaltender prevented 1 goal every 100 shots, we would expect to see that hold up in each random split. A completely useless metric would have an intra-season correlation of 0, picking numbers out of a hat would re-create that result. With that frame of reference, intra-season correlations for QREAM are about 0.4 compared to about 0.3 for raw save percentage. Pucks bounce so we would never expect to see a correlation of 1, so this lift is considered to be useful and significant.[2]

intra-season correlations

Goalies doing the splits

Crudely, each goal prevented is worth about 1/3 of a point in the standings. Implying how many goals a goalie prevents compared to average allows us to compute how many points a goalie might create for or cost their team. However, a more sophisticated analysis might compare goal support the goalie receives to the expected goals faced (a bucketed version of that analysis can be found here). Using a win probability model the impact the goalie had on win or losing can be framed as actual wins versus expected.


xG’s also are important because they begin to frame the uncertainty that goes along with goals, chance, and performance. What does the probability of a goal represent? Think of an expected goal as a coin weighted to represent the chance that shot is a goal. Historically, a shot from the blueline might end up a goal only 5% of the time. After 100 shots (or coin flips) will there be exactly 5 goals? Maybe, but maybe not. Same with a rebound from in tight to the net that has a probability of a goal equal to 50%. After 10 shots, we might not see 5 goals scored, like ‘expected.’ 5 goals is the most likely outcome, but anywhere from 0 to 10 is possible on only 10 shots (or coin flips).

We can see how actual goals and expected goals might deviate in small sample sizes, from game to game and even season to season. Luckily, we can use programs like R, Python, or Excel to simulate coin flips or expected goals. A goalie might face 1,000 shots in a season, giving up 90 goals. With historical data, each of those shots can be assigned a probability of a being a goal. If the average probability of a goal is 10%, we expect the goalie to give up 100 goals. But using xG, there are other possible outcomes. Simulating 1 season based on expected goals might result in 105 goals against. Another simulation might be 88 goals against. We can simulate these same shots 1,000 or 10,000 times to get a distribution of outcomes based on expected goals and compare it to the actual goals.

In our example, the goalie possibly prevented 10 goals on 1,000 shots (100 xGA – 90 actual GA). But they also may have prevented 20 or prevented 0. With expected goals and simulations, we can begin to visualize this uncertainty. As the sample size increases, the uncertainty decreases but never evaporates. Goaltending is a simple position, but the range of outcomes, particularly in small samples, can vary due to random chance regardless of performance. Results can vary due to performance (of the goalie, teammates, or opposition) as well, and since we only have one season that actually exists, separating the two is painful. Embracing the variance is helpful and expected goals help create that framework.

It is important to acknowledge that results do not necessarily reflect talent or future or past results. So it is important to incorporate uncertainty into how we think about measuring performance. Expected goal models and simulations can help.

simulated seasons

Hackey statistics

Bayesian Analysis

Luckily, Bayesian analysis can also deal with weighting uncertainty and evidence. First, we set a prior –probability distribution of expected outcomes. Brian MacDonald used mean Even Strength Save Percentage as prior, the distribution of ESSV% of NHL goalies. We can do the same thing with Expected Save Percentage (shots – xG / shots), create a unique prior distribution of outcome for each goalie season depending on the quality of shots faced and the sample size we’ll like to see. Once the prior is set, evidence (saves in our case) is layered on to the prior creating a posterior outcome.

Imagine a goalie facing 100 shots to start their career and, remarkably, making 100 saves. They face 8 total xG against, so we can set the Prior Expected Save% as a distribution centered around 92%. The current evidence at this point is 100 saves on 100 shots, and Bayesian Analysis will combine this information to create a Posterior distribution.

Goaltending is a binary job (save/goal) so we can use a beta distribution to create a distribution of the goaltenders expected (prior) and actual (evidence) save percentage between 0 and 1, like a baseball players batting average will fall between 0 and 1. We also have to set the strength of the prior – how robust the prior is to the new evidence coming in (the shots and saves of the goalie in question). A weak prior would concede to evidence quickly, a hot streak to start a season or career may lead the model to think this goalie may be a Hart candidate or future Hall-of-Famer! A strong prior would assume every goalie is average and require prolonged over or under achieving to convince the model otherwise. Possibly fair, but not revealing any useful information until it has been common knowledge for a while.

bayesian goalie

Priors plus Evidence

More research is required, but I have set the default prior strength of equivalent to 1,000 shots. Teams give up about 2,500 shots a season, so a 1A/1B type goalie would exceed this threshold in most seasons. In my goalie compare app, the prior can be adjusted up or down as a matter of taste or curiosity. Research topics would investigate what prior shot count minimizes season to season performance variability.

Every time a reported result actives your small sample size spidey senses, remember Bayesian analysis is thoroughly unimpressed, dutifully collecting evidence, once shot at a time.


Perfect is often the enemy of the good. Expected goal models fail to completely capture the complex networks and inputs that create goals, but they do improve on current results-based metrics such as shot attempts by a considerable amount.  Their outputs can be conceptualized by fans and players alike, everybody understands a breakaway has a better chance of being a goal than a point shot.

The math behind the model is less accessible, but people, particularly the young, are becoming more comfortable with prediction algorithms in their daily life, from Spotify generating playlists to Amazon recommender systems. Coaches, players, and fans on some level understand not all grade A chances will result in a goal. So while out-chancing the other team in the short term is no guarantee of victory, doing it over the long term is a recipe for success. Removing some the noise that goals contain and the conceptual flaws of raw shot attempts helps the smooth short-term disconnect between performance and results.

My current case study using expected goals is to measure goaltending performance since it’s the simplest position – we don’t need to try to split credit between linemates. Looking at xGA – GA per shot captures more goalie specific skill than save percentage and lends itself to outlining the uncertainty those results contain. Expected goals also allow us to create an informed prior that can be used in a Bayesian hierarchical model. This can quantify the interaction between evidence, sample size, and uncertainty.

Further research topics include predicting goalie season performance using expected goals and posterior predictive distributions.


[1]Without private data or comprehensive tracking data technology analysts are only able to observe outcomes of plays – most importantly goals and shots – but not really what created those results. A great analogy came from football (soccer) analyst Marek Kwiatkowski:

Almost the entire conceptual arsenal that we use today to describe and study football consists of on-the-ball event types, that is to say it maps directly to raw data. We speak of “tackles” and “aerial duels” and “big chances” without pausing to consider whether they are the appropriate unit of analysis. I believe that they are not. That is not to say that the events are not real; but they are merely side effects of a complex and fluid process that is football, and in isolation carry little information about its true nature. To focus on them then is to watch the train passing by looking at the sparks it sets off on the rails.

Armed with only ‘outcome data’ rather than comprehensive ‘inputs data’ analyst most models will be best served with a logistic regression. Logistic regression often bests complex models, often generalizing better than machine learning procedures. However, it will become important to lean on machine learning models as reliable ‘input’ data becomes available in order to capture the deep networks of effects that lead to goal creation and prevention. Right now we only capture snapshots, thus logistic regression should perform fine in most cases.

[2] Most people readily acknowledge some share of results in hockey are luck. Is the number closer to 60% (given the repeatable skill in my model is about 40%), or can it be reduced to 0% because my model is quite weak? The current model can be improved with more diligent feature generation and adding key features like pre-shot puck movement and some sort of traffic metric. This is interesting because traditionally logistic regression models see diminishing marginal returns from adding more variables, so while I am missing 2 big factors in predicting goals, the intra-seasonal correlation might only go from 40% to 50%. However, deep learning networks that can capture deeper interactions between variables might see an overweight benefit from these additional ‘input’ variables (possibly capturing deeper networks of effects), pushing the correlation and skill capture much higher. I have not attempted to predict goals using deep learning methods to date.

Goaltending and Hockey Analytics – Linked by a Paradox?

There may be an interesting paradox developing within hockey. The working theory is that as advanced analysis and data-driven decision-making continue to gain traction within professional team operations and management, the effect of what can be measured as repeatable skill may be shrinking. The Paradox of Skill suggests as absolute skill levels rise, results become more dependent on luck than skill. As team analysts continue (begin) to optimize player deployment, development, and management there should theoretically be fewer inefficiencies and asymmetries within the market. In a hypothetical league of more equitable talent distribution, near perfect information and use of optimal strategies, team results would be driven more by luck than superior management.

Goaltenders Raising the Bar

Certainly forecasting anything, let alone still-evolving hockey analytics, is often a fool’s errand – so why discuss? Well, I believe that the paradox of skill has already manifested itself in hockey and actually provides a loose framework of how advanced analysis will become integrated into the professional game. Consider the rise of modern goaltending.

Absolute NHL goaltender ability has continually increased for the last 30 years. However, differential ability between goaltenders has tightened. It has become increasingly difficult to distinguish long-term, sustainable goaltender ability while variations in results are increasingly owed to random chance. Goalies appear ‘voodoo’ when attempting to measure results (read: ability + luck) using the data currently available – much like the paradox of skill would predict.[1] More advanced ways of measuring goaltending performance will be developed (say, controlling for traffic and angular velocity prior to release), but that will just further isolate and highlight the effect of luck.[2]

Spot the Trend Data courtesy of hockey-reference.com

Spot the Trend
Data courtesy of hockey-reference.com

Will well-managed teams create a similar paradox amongst competing professional teams in the future? Maybe. Consider such a team would maximize the expected value talent acquired, employ optimal on-ice strategies, and employ tactics to improve player development. Successful strategies could be reverse engineered and replicated, cascading throughout the league – in theory. Professional sports leagues are ‘copycat’ leagues and there is too much at stake not to adopt a superior strategy, despite a perceived coolness to new and challenging ideas.

Dominant Strategies“I don’t care what you do, just stop the puck”

How did goaltending evolve to dominate the game of hockey? And what parallel pathways need to exist in hockey analytics to do the same?

  1. Advances in technology – equipment became lighter and more protective.[3] This allowed goaltenders to move better, develop superior blocking tactics (standing up vs butterfly), cover more net, and less worry of catching a painful shot. The growth of hockey analytics has been dependent on web scraping, automation, and increasing processing power and will soon come to rely on data derived from motion-tracking cameras. Barriers to entry and cost of resources are negligible lending all fanalysts the opportunity to contribute to the game.
  2. Contributions from independent practitioners – The ubiquitous goaltending coach position is a relatively new one compared to most professional leagues. In the early 2000s, I was lucky enough to cross paths with innovative goaltending instructors who distributed new tactics, strategies, and training methods available to young goaltenders. Between their travel, camps, and clinics (and later their own development centers) they diffused innovative approaches to the position, setting the bar higher and higher for students. A few of these coaches went on become NHL goalie coaches – effectively capturing a position that didn’t exist 30 years prior. Now the existence of goalie coach cascade down to all levels of competitive hockey.[4]  Similarly, the most powerful contributions to the hockey analytics movement have been by bright individuals exposing their ideas and studies to the judicious public. The best ideas were built upon and the rest (generally) discarded. Will hockey analytics evolve (read: become accepted widely among executives) faster than goaltending? I don’t know – a goaltending career takes well over a decade to mature, but they play many games providing feedback on new strategies rather quickly.[5] Comparatively, ideas develop quicker but might take longer to demonstrate their value – not only are humans hard-wired to reject new ideas there are fewer managerial opportunities to prove a heavy data-driven approach to be a dominant strategy.
  3. Existence of a naïve acceptance – The art (and science) of goaltending is not especially well understood among many coaches, particularly with relative skill levels converging. However, managers and coaches do understand results. Early in my career, I had a coach who was only comfortable with stand-up goaltenders, his own formative experiences occurring when goaltender predominately remained erect (in order to keep their poorly padded torso and head from constant danger). However, he saw a dominant strategy (more net coverage) and placed faith in my ability without a comprehensive understanding or comfort of modern goaltending. Analytics will have to be accepted the same way – gradual but built on demonstrated effectiveness. Not everyone is comfortable with statistics and probabilities, but like goaltenders, the job of analysts is to produce results. That means rigorous and actionable work that offers a superior strategy to the status quo. This will earn the buy-in from owners and senior management who understand that they can’t be at a competitive disadvantage.

Forecasting Futility

Clearly the arc of the analytics evolution will differ from the goaltender evolution, primary reasons being:

  • Any sweeping categorization of two-decade-plus ‘movement’ is prone to simplification and revisionist history.
  • While goaltending as a whole has improved substantially, incremental differences in ability still obviously exist between goaltenders. In the same way, not all analysts or teams of analysts will be created equal. A non-zero advantage in managerial ability may compound over time. However, the signal will likely be less significant than variation in luck over that extended timeframe. In both disciplines, that rising ability may give way to a paradox of not being able to decipher their respective skills, muddying the waters around results.
  • Goaltending results occur immediately and visibly. Fair or not, an outlier goaltender can be judged after a quarter of a season, managerial results will take longer to come to fruition. Not only that, we only observe the one of many alternative histories for the manager, while we get to observe thousands of shots against a goaltender. Managerial decisions will almost always operation under a fog of uncertainty.

Alternatively, it important to consider the distribution of athlete talent against those of those in the knowledge economy. Goaltenders are bound by normally distributed deviations of size, speed, and strength. Those limitations don’t exist for engineers and analysts, but they do operate in a more complex system, leaving most decisions to be subjected to randomness. This luck is compounded by the negative feedback loops of the draft and salary cap, it is unlikely a masterfully designed team would permanently dominate, but it suggests some teams will hold an analytical advantage and the league won’t turn into some efficient-market-hypothesis-all-teams-50%-corsi-50%-goals-coin-flip game. But if a superstar analyst team could consistently and handily beat a market of 29 other very good analyst teams in a complex system, they should probably take their skills to another more profitable or impactful industry.



Other Paradoxes of Analytics

Because these are confusing times we live in, I’d be remiss if I didn’t mention two other paradoxes of hockey analytics.

    • Thorough, rigorous work is often difficult to understand and not easily understood by senior decision-makers. This is a problem in many data-intensive industries – analytical tools outpace the general understanding of how they work. It seems that (much like the goaltending framework available to us) once data-driven strategies are employed and succeed, all teams will be forced to buy-in and trust that they have hired competent analysts that can deliver actionable insights from a complex question. Hopefully.

  • With more and more teams buying into analytics, the some of the best work is taken private. The best work is taken in-house seemingly overnight, sometimes burying a lot of foundational work and data. That said, these issues are widely understood and there is a noble and concerted effort to maintain transparency and openness. We can only hope that these efforts are appreciated, supported, and replicated.


Final Thoughts

The best hockey analysis has borrowed empiricism and data-driven decision-making from the scientific method, creating an expectation that as hockey analytics gain influence at the highest levels, we (collectively) will know more about the game.[7] However, assuming the best hockey analysts end up influencing team behavior, it is possible much of the variation between NHL teams[8] will be random chance – making future predictive discoveries less likely and weakening the relationship of current discoveries.

Additionally, when it feels like the analytical approach to hockey is receiving unjustified push back or skepticism, it is important to remember that the goaltender evolution, initiated by fortuitous circumstance, eventually forced buy-ins from traditionalists by offering a superior approach and results. However, increasing absolute skill in a field can have unintended consequences – relative differences in skill will decrease, possibly causing results to become more dependent on luck than skill. Something to consider next time you try to make sense of the goaltender position.


[1] This is not to say all goalies in 2016 are of equal skill levels, but they are absolutely more talented than their ancestors and fall within a smaller range of abilities. That said, outside of a top 2 or 3 guys, the top 5-10 list of goalies is a game of musical chairs, quarter to quarter, season to season.

[2] Goaltenders don’t get a chance to ‘drive the play,’ so it is very important to control for external factors. This can’t be done comprehensively with current data. Even with complete data, it may be futile.

[3] And cooler, possibly attracting better athletes to the position, your author notwithstanding.

[4] Another feature of the paradox of rising skill levels: to fail to improve is the same as getting worse. Hence, employing a goalie coach is necessary in order to prevent a loss of competitiveness. The result: plenty of goalie coaches of varying ability, but likely without a strong effect on their goaltender’s performance. This likely causes some skepticism toward their necessity. This is probably a result of their own success, they are indirectly represented by an individual whose immediate results might owe more to luck than incremental skill aided by the goalie coach.

[5] For example, a strategy devised at 6 years old of lying across the goal line forcing other 6 year-olds to lift the puck proved to be inferior and was consequently dropped from my repertoire.

[7] Maybe even understanding the link between shot attempts and goals (you can read this sarcastically if you like).

[8] And other leagues that are able to track and provide accurate and useful data.