Last season, in my never-ending quest to quantify stuff in Overwatch, I created an Elo system to track the strength of our Overwatch League teams over time. Unfortunately, my implementation, while it made for pretty graphs, was flawed because I incorrectly graded its accuracy. Rather than dwell on the past, though, let’s look to the future with my newly resurrected and improved system, Overwatch League Elo 2.0. In OLE 2.0, I’ve fixed all the bugs of the past and created a system that is a Frankenstein of FiveThirtyEight’s NBA Elo and NFL Elo systems, with specific tweaks to fit the Overwatch League ecosystem.
Elo models are zero-sum: when two teams play each other, the winner gains the same amount of Elo that the loser drops. The amount of Elo that is exchanged between the two teams is governed in all Elo systems by a volatility constant K, and an expected win chance based on the difference between the two team’s Elo ratings:
Elo exchanged = K(1 – expected win chance)
FiveThirtyEight’s Elo systems take a couple steps beyond this simple calculation to improve their accuracy. The first is their Year-to-Year Carry-Over (CT) where some, but not all of a previous season’s Elo rating is maintained from season to season. In traditional sports, this helps adjust the system’s interpretation of team strength after an offseason full of trades, player drafts, and other shakeups.
In OLE 2.0, I tested many combinations of K and CT and chose a pair that maximized my Brier score results, map prediction results, and match outcome results. The K value that I settled on was 47, and the CT value was 60%. This 60% carry-through is applied upon every new patch, rather than every season, because in the Overwatch League we have to care not only about team roster changes but also changes to the way the game itself functions. As an example, the Vancouver Titans ended Stage 1 with 1183.6 Elo, meaning they began Stage 2 with a (1183.6 * 0.6) + (1001 * 0.4) = 1110.6 rating.
FiveThirtyEight also accounts for new teams entering the pool by assigning a lower-than-average Elo to new teams. The Overwatch League just grew by eight expansion teams, so I felt it would be wise to do so too. In OLE 2.0, new teams start with only 990 Elo rating—the average is 1000—to represent the relative uncertainty of their strength coming into the league. This drives down the system average, but I’ve also adapted FiveThirtyEight’s method for maintaining the system average over time: resetting each team’s Elo to slightly above average (1001, seen above in the CT calculation) upon each new patch.
The final element of FiveThirtyEight’s systems that I’ve appropriated and tweaked for OLE 2.0 is a margin of victory multiplier (MoV). The goal of MoV is to award winning teams for stomping their opponents in spectacular fashion, since a 1-0 full hold on a map is usually more indicative of team disparity than a close 5-4 battle.
OLE 2.0 uses a two-tiered MoV multiplier based on teamfight win differential over time and team death differential over time. Not every team that wins a map ends up winning in teamfight differential (draws are fairly common), so team death differential is used when teamfight differential fails.
MoV must be set up in a way that reduces autocorrelation—something that occurs when teams that are expected to win do so by large score differentials. Teams who are expected to win are expected to win for a reason, so an un-tuned MoV tends to inflate winning teams’ Elo ratings inaccurately over time. FiveThirtyEight’s systems address autocorrelation in their MoV calculation by granting more multiplier to an underdog win than the favored team given the same score differential. The score differential (SD) for maps in my system ranges roughly from 0.3 to 5. These margins of victory are fed into the following formula (adapted from FiveThirtyEight’s NFL MoV calculation):
log(1 + SD) * 1/(elo difference * 0.001 + 1)
In OLE 2.0, a 2.5 score differential will grant around 1.4 MoV to a 100 Elo underdog, or around 1.15 to a 100 Elo overdog. Now that all the bells and whistles of OLE 2.0 have been defined, let’s see how this year’s teams have progressed thus far:
Stage 1 presented a landscape of opportunity: eight new teams joined the league at 990 Elo each, positioned perfectly to steal Elo from existing high-rated teams. However, only one expansion team ended significantly above its starting position: Vancouver. The Titans’ destructive path to a Stage 1 Finals victory is well-known, but could easily have been derailed had they faced the NYXL in the playoffs—but the Seoul Dynasty got in the way. Notice the NYXL’s precipitous loss of Elo, and Seoul’s meteoric rise:
This is the visual representation of the Excelsior’s playoff loss to the Dynasty. Coming into that match, casual fans and experts alike had written off the Dynasty’s early exit as a sure thing. So many predictors got it wrong, but just how wrong?
Completing OLE 2.0 has opened a new analytical door: predictions. If I have two Elo ratings, a database full of match outcomes, and a system for updating those Elo ratings, I can use the past to predict the future. To do so, I employed the use of a Monte Carlo simulation.*
* The origin of the Monte Carlo method is extremely cool, and involves the original ENIAC computer and the Manhattan Project. If you have extra time, I suggest checking it out.
Monte Carlo simulations are a way of discretely determining probable outcomes of a system given randomized inputs. Pretend that you have a pair of six-sided dice and you want to know likelihood of the dice pair adding up to 7 after rolling them. You could either do a bunch of math to find the solution analytically, or you could roll those two dice 10,000 times and see how many times out of 10,000 the sum added to 7. Monte Carlo sims are the latter.
My Monte Carlo uses the OLE 2.0 framework to roll virtual dice based on Elo differences, map draw rates, and margins of victory:
- Begin with two starting Elo ratings from the OLE 2.0 system.
- Play through a hypothetical map: randomly determine map winner based on Elo difference, randomly sample known margins of victory, and include potential draws if the map type supports draws.
- Update both teams’ Elo according to map outcome in the OLE 2.0 system.
- Repeat steps 2 and 3 until a match winner is determined (4 maps, or 5 if tied at 4).
- Output match score outcome.
- Repeat steps 1 through 5 10,000 times.
After 10,000 iterations of this simulation, not only do I have a raw count of match wins by each team, I also have a raw count of unique match scores from 4-0 to 3-2. I tested the predictions of this method on last week’s matches, and they were eerily good: the expected winner won in 13 out of 16 matches, and a 50.5% favorite (Boston, against Hangzhou) lost a close 3-2. It even predicted the Valiant’s first win of the season over Atlanta! On the flip side, it whiffed on a 60% win prediction for Chengdu over Shanghai, and a 56% win prediction for Toronto over Philly.
The model’s accuracy is likely lower than last week’s hot streak. I further tested this method on all matches from Stage 4 last year, which turned out to be one of the most volatile stages for OLE 2.0: Brigitte was released, the NYXL began to fall in power, and the Dallas Fuel went from the basement to the Stage Playoffs, among other unpredictable results. Even with all of this volatility, the model correctly predicted the winner in 40 out of 60 matches.
Back to the question that spawned this effort in the first place: how big was Seoul's upset of New York in the Stage 1 Playoffs? Going into that match, Seoul had an Elo of 1015.2 and NYXL had 1183.6, the highest in the league at the time. Plugging those into the simulator (with a modified map progression for the playoff format) returns these results:
Not only did NYXL emerge as the winner in 81.86% of outcomes, they were predicted to clean-sweep the Dynasty in 40.73% of those. The Dynasty’s ultimate winning score, 3-1, only occurred in 6.82% of simulation outcomes.
NYXL’s playoff choke was incredibly unexpected, and that’s OK. If a match prediction algorithm correctly chose the outcome of every match, then what would be the point of even playing those matches?
Now that I have this new toy to play with, let’s look ahead to Week 3 and play a game: are you smarter than a simulation? Here are three matches from next week that are exciting for various reasons. How does the algorithm stack up against your own predictions?
Match 1: Vancouver Titans (1132.9) vs. Dallas Fuel (1054.5)
Reason: Dallas faces their first big test of Stage 2.
Match 2: Los Angeles Valiant (939.7) vs. Washington Justice (903.7)
Reason: both teams looking for their second match win this season.
Match 3: Hangzhou Spark (998.3) vs. Guangzhou Charge (882.3)
Reason: Guangzhou’s last chance to avoid setting a map-loss streak record.
Ben "CaptainPlanet" Trautman is the statistics producer for the Overwatch League global broadcast. Follow him on Twitter!