How do you evaluate a team in Overwatch? It’s a problem that fans and experts alike have tackled throughout the inaugural season of the Overwatch League. Do you judge a team by their potential, only to be disappointed when they fail to reach it? Or do you evaluate a team based on previous performances, good and bad? The question of evaluating teams is one I will try to answer, by introducing an Overwatch League Elo rating system (OLE).

The Elo rating system was originally created by a man named Arpad Elo to measure the strength of chess players over time. It can measure the relative strength of opponents in a way that takes their current level and prior results into account. Elo systems are zero-sum: the average and total amount of Elo in the system should never change. This means that no matter the Elo rating gap between two competing teams, the amount of Elo the winner gains is equivalent to the amount of Elo the loser surrenders.

*For OLE, all teams start at 1,000 Elo.

A basic Elo system is basic math—all it requires is a starting point* and a scaling factor of “K.” K is a multiplier that determines how much Elo you gain and lose after a given match. The larger the K, the faster you can gain and lose Elo over time, so the more volatile the system becomes. Sometimes you want a volatile system; sometimes you want less volatility. For example, some chess systems use a tiered Elo that utilizes different Ks for different skill tiers. This helps new chess players reach their actual skill rating faster, while grandmaster-level players who have established competitive histories aren’t as susceptible to big swings.

This is just one example of how Elo systems are very customizable. FiveThirtyEight, the sports and elections statistics site, has custom Elo systems for the National Football League and National Basketball Association that attempt to improve simple Elo models to more accurately portray team strength. These are the Elo systems that I’ve based OLE on.

I’ve already spoken about one parameter for Elo systems—K—but let’s consider a few other customizations.

Stage Carry-Through

The Overwatch League has stages and, more importantly, each stage is played on an entirely different patch. The NBA and NFL have to deal with player turnover and trades both during and between seasons; this was the motivation for FiveThirtyEight to make seasonal corrections. Overwatch not only has similar trades, acquisitions, and releases, but underneath it all, over time, the game evolves. Thus, OLE must account for stage changes.

For my model, I found that FiveThirtyEight’s solution was a perfect fit. Whenever the season changes, they only carry over a fraction of the previous season’s Elo—for their NBA model, it’s 75% of the previous season’s Elo. For this exercise, I’m going to leave this as an unknown variable called “Stage Carry-Through Ratio,” or CT.

What Granularity?

When calculating new OLE ratings, I had to decide how often to do updates. The NBA, NFL, and chess have games, while the Overwatch league has matches, which are comprised of maps. Should I only update a team’s OLE after a match has completed? Or after every map? I took the following into consideration:

  1. There were not enough matches played to calculate new OLE.
  2. Not every match is created equal—a 4-0 is not the same as a 3-2.
  3. Even if I chose maps, it is impossible for a 3-2 win to result in more Elo gained for the match loser, due to the zero-sum nature of Elo.

Therefore, each new OLE is calculated after each map played for each team.

Margin of Victory (MoV)

Not every match is created equal, nor is every map win created equal. FiveThirtyEight’s Elo systems both account for margin of victory in game outcomes, as a multiplier to K. In this way, it creates a dynamic K that determines how big of an upset—or stomp—a win was for a team. FiveThirtyEight’s MoV calculation seems to be a secret sauce: it is defined, but poorly explained. What I can say for sure is that they created a logarithmic function that rewards teams less and less the more they run up the score, but a blowout win is still worth more MoV than a close match.

This is where things get tricky for Overwatch. Basketball and football always have set, understandable MoVs because—short of overtime situations—every game lasts the same amount of time. During the inaugural season of the Overwatch League, maps lasted anywhere from a minimum of six minutes, 22 seconds, to more than 37 minutes. Also, depending on which team attacked first, scores can be unreliable. One could argue that a defend-first 1-0 win is just as—if not more—impressive than an attack-first, 3-0 win. Therefore, I couldn’t use capture-point differential as my MoV, and I needed to find a MoV statistic that accurately measured how big of a stomp a win was regardless of how long the match lasted.

I didn’t succeed, exactly. I was able to determine a set of time-invariant statistics that predicted the winner in 94% of maps played in the Overwatch League. There were still some outliers where the loser outperformed the winner on paper, but lost the match. Since Overwatch is a game of objectives, I reasoned that this is just bound to happen from time to time.

To address this, any time the winner’s MoV was lower than the loser’s, I flipped the values—if you lost on paper, but won the map, you deserve the spoils of your opponent. I then fit my “secret sauce” of MoV stats to a logarithmic function that multiplies close matches by 0.75, “blowout” wins by approximately 1.5, and absolutely ridiculous stomps (quite rare) by approximately 3. Around 79% of matches ended up with a MoV multiplier of 1, plus or minus 0.25.

The Fun Part: Picking Parameter Values

My final OLE formula ended up being:

New Elo = Old Elo + MoV * K * (Actual Result – Expected Result)

Unless there was a stage change, where the new stage OLE would be calculated as:

New Elo = Old Elo * CT + 1000 * (1-CT)

But, how do I know which K or CT to use? Until now, I had left them open-ended. Rather than choose some at random, I tried them all! Specifically, I focused on a K range from 5 to 24, but I tried K values from 25–50 as well. Below, you can see how different K and CT values changed the progression of OLE over the course of the Overwatch League inaugural season. First, a “middle-of-the-pack” combination of K = 10 and CT = 0.5:

ct5k10.png
Overwatch League ELO Progression by Match Count

The graph plots OLE progression by match count throughout the season, with each OLE calculated at the map level. We can see general seasonal trends that match the eye test. For example, until Stage 3 the league was dominated by NYXL and London. Stage 3 was NYXL’s peak, but they dropped off quickly and even lost their No. 1 status to Valiant. The Dragons, Mayhem, and Fuel quickly dropped in each stage, but the Fuel were able to climb and end Stage 4 near the middle of the standings, weighted down by their previous performances.

Now compare this to a K = 24 and CT = 1 (more volatile K and no stage correction at all):

CT1K24.png
Overwatch League ELO Progression by Match Count

A higher K means more volatility, which is best reflected in the rise and fall of the Fuel and NYXL, respectively, in Stage 4. Also, teams like Shanghai found themselves further and further below their upper-tier peers, which had a noticeable effect on their win expectancy.

It’s fun to play around with cool graphs, but how can I tell how accurate my OLE model is? To evaluate the different parameters, I turned to Brier Scores, which grade the accuracy of probabilistic outcomes. Since Elo systems inherently calculate win expectancy, I already had probabilities to grade. With Brier Scores, a lower score is better. I plotted the average Brier score for each K/CT combination:

rawrbrier.png
Average Brier Scores for Every K:CT combo

The best Brier score ended up being the minimum K (5) and CT (0.1), in the upper left corner of the chart above. Additionally, here is what each team’s average Brier Score looked like for K = 24 and CT = 1 (more volatile, with no stage correction), and the minimum Brier Score:

brier ct1k24.png
Brier Scores

brier ct1k5.png
Brier Scores

Using Brier scores, the lower the K, the more accurate the predictions seemed to be. Comparing these two graphs, we can see it was primarily due to New York and the upper two-thirds of the league. While a high K quickly and accurately sent the Dragons, Mayhem, and Fuel into the basement, situations where these teams took maps off teams like NYXL affected the league’s average Brier score. As a result, win expectancy predictions became more conservative compared to the more volatile models. Let’s now look at OLE progression through the season with our “ideal” parameters:

CT1k5.png
Overwatch League ELO Progression by Match Count

The same peaks and valleys appear, but in each stage, the teams started off much closer together, gaining and losing OLE rating a bit slower. Because the ideal K and CT are so low, the Overwatch League must have been incredibly unpredictable at the beginning of each stage, but by the end, the teams’ relative rankings were fairly set. This is good—it matches the eye test! This unpredictability has led me to my final conclusion:

No one could have predicted London winning the inaugural Overwatch League championship

Heading into the playoffs, if I had used my OLE model to choose the champion, I would have chosen a Valiant vs. NYXL final with the Valiant as the winners. Instead, Stage 4’s sixth-strongest team (Fusion) faced off against the fourth-weakest team (Spitfire), with the lower-OLE team winning it all. Both teams had reached highs in the first half of the season, so we knew how good they could be, but they languished as the season went on, riddled by injuries, lack of direction, or tilt.

In the end, the only people who knew how good London and Philadelphia were on the new patch were… London and Philadelphia.

No matter how many bells and whistles I add to the OLE system, it cannot reveal the reason behind a lack of performance across multiple stages, only that there was a drop in performance in the first place. It also cannot predict which teams are going to suddenly slump after three stages of dominance, only how much they slumped, and when the slump began. Systems like this cannot grade what a team is capable of, because there is no way to correct for “team potential.”

However, Elo systems do a great job of applying numbers to the history of leagues. The inaugural season had its ups and downs, from the peak of NYXL dominance to the improbable comeback by the Spitfire. Some teams peaked early, some in the middle, and some late. It was this unpredictability that made each stage unique, compelling, and memorable. Now, we have the numbers to prove it.

Ben "CaptainPlanet" Trautman is the statistics producer for the Overwatch League Global Broadcast. Follow him on Twitter!