So far, rating models have not received the covering they deserve in my blog. This is mostly due to the fact that I until now I have put more emphasis on technical strategies than on fundamental / handicapping ones. And building and following a rating model certainly falls in the second category. I have covered technical strategies more extensively, since I believe them to be more accessible to the average punter. Handicapping is a hard task only mastered by few. It requires resources and dedication most people just don’t have.
That doesn’t change the fact that betting markets are being shaped by handicapping models. Traders and punters take positions on certain outcomes based on the probabilities they assign to those outcomes. One cannot understand betting markets without developing at least a basic understanding for handicapping models. And rating models have a huge role to play in this.
Rating? What is that?
What exactly is a rating? A rating is a numerical assessment of a team’s or a player’s relative strength as compared to the competition. Ratings can be a powerful tool for assessing the outcome probabilities in a game between two opponents.
How does one calculate ratings and translate them to probabilities? What are the best rating types for different kinds of sports? Today I will try to answer those questions by looking into ratings from several different angles. First I will shortly explain the basic maths behind ratings. Then I will look at some of the more famous types of ratings, their practical advantages and drawbacks. Finally, I will give my point of view on the practical usability of ratings in different sports.
How does a Rating Model work?
Now, before diving into the details, let’s recall: ratings are an assessment of a team’s or player’s relative strength. Ratings don’t make sense out of that context. You can only use them in conjunction with the ratings of one or more opponents.
The technical details
Ratings are dynamic numbers, meaning the rating of a certain team or player is constantly changing. Generally, after a team or player wins a game their rating goes up and after they lose one it goes down. The amount of points gained or lost depends primarily on the strength of the team or player they were playing against. Winning against a stronger team gives a lot of points and winning against a weaker one adds relatively few. On the other hand, losing against a much weaker opponent should cost you a lot of points, while losing from a stronger opponent will only cost you a few.
Here you can already see the beauty of ratings. Ratings are an ever-changing strength indicator. They are based on a complex network of participants in a given sport and deliver a relatively accurate strength-estimation based on an elegant rating-calculating formula.
Finally, you can translate the strength-estimations of the two opponents into winning probabilities (adding up to 1), compare them to the market and eventually find an edge to bet on.
Make your assumptions
Even the simplest rating model, that only takes into account the outcome of the game (win/lose) and the difference in strength between the teams, already needs some arbitrarily chosen input factors to work.
A main question that needs to be answered is how you redistribute the rating points after each game. How many games in the past would you consider relevant for your rating? Do you take the last 5, 10 or 50 games? Do you weigh them the same? After all, the last game supposedly says more about a team’s strength than a game from a year ago. Do you only care about who won, or also about the winning margin? Throw in the draw as a possible outcome and that complicates the probability calculations further.
It is all those considerations (and more) you need to keep in mind when designing your rating model. Luckily, the decision making approach behind those different factors is pretty stable. You simply adjust your rating model to real-world data by picking the components and factors that result in the highest predictability.
Sounds complicated. Where to start?
Ratings in real life
How about using ratings someone else already calculated for you?
There are a number of ratings out there, many of them available on the web for free. Different ratings are made with different goals in mind. In sports, they are mostly being used to match players of similar strength. A popular example for such rating is the ATP Rating in tennis. By matching players with similar ATP ratings, the tennis association ensures more competitive games, which attract greater interest.
There is plenty of information on the internet on the calculation of the ATP ratings. However, they were obviously developed for different purpose than sports betting. So the question asks itself: how good are they actually for calculating winning probabilities?
Can we use ratings to compile odds?
Luckily, someone already answered that question for us. As you can see ATP rankings’ predictive power is quite strong, even if not as strong as the one of Pinnacle’s closing linne. Of course, one shouldn’t expect the ATP ratings to outperform the market, since they are publicly available information, so they are probably already factored in in the price. But they still do a decent job in predicting match outcomes. Therefore, they might serve as a basis or a component of a predictive model.
When designing a rating model for a certain sport, such ratings can be used as an input parameter. Although they are not perfect, they are good for the purpose. Such ratings are designed specifically for the sport at hand and are readily available. Even if you make some adjustments in the calculation of your ratings (as you should), you need some starting rating value for your teams or players. This way you can calibrate rating model more accurately within a shorter sample.
Well-Known Rating Models
Now let us have a look at the most well-known rating systems and their usability in assessing teams’ and players’ strength. Of course, we start with the…
Elo is the mother of all ratings. It was developed for chess by the Hungarian-American physics professor Arpad Elo and is loved for its simplicity and elegance. It was the first (as far as I am aware) dynamic and self-correcting model working in a network of opponents, using only the game outcomes within the network as an input.
Since then Elo has become pretty much the standard for sports rating systems, even if it fits some sports better than others. As an example, here is the current club football ranking as calculated by clubelo.com
Elo ratings use a K factor to account for the fact that recent games should have higher influence over one’s rating. This choice of a K-factor is important since it determines how you weigh past games relative to recent ones when updating a rating after a game. It has direct impact over the rating. However, there are no strict rules for determining the K-factor, so its value is often arbitrary chosen.
Converting Elo Ratings to winning probabilities
When two opponents face each other, their Elo ratings can be used to estimate the probability of each of them winning the clash. The one with the higher rating has higher chances to win. After the game, the winning player takes a certain number of rating points from his opponent and adds them to his own rating. The number of points depends on the K factor (higher K factor means more rating points transferred), and on how surprising the outcome was (if the favourite wins the transferred rating points are less than if the outsider wins).
Let us say player 1 has a rating of R1 and player 2 has a rating of R2. The winning probability or player 1 using Elo Ratings is calculated as:
10^(R1/400) / (10^(R1/400)+10^(R2/400))
… and the one of player 2 as:
10^(R2/400) / (10^(R1/400)+10^(R2/400))
Finally, the losing player loses K * his chance of winning prior to the game. The winning player wins the same amount.
What is missing?
In general, the simplicity of the Elo ratings mean they do not account for many factors, including:
- The probability of a draw
- In the case of a team, the contribution of each team member to the team’s success
- The margin of victory
- The reliability of the measured rating
Now, those are strictly speaking not necessarily drawbacks of the Elo ratings. The ratings were not designed with the ability to account for draw probability or for individual contribution to a team in mind. Furthermore, generally speaking there is no margin of victory in chess – unless you count the number of moves a player needed to win, which I’m not sure is theoretically sound.
The primary purpose of Elo Ratings was to match chess players of similar strength, which it does pretty well. However, those are drawbacks in terms of the usability of Elo ratings for betting models.
Still, due to their simplicity, Elo ratings represent a perfect starting point for building your own model. You can employ them “out of the box” for individual sports with no draw outcome (like tennis). Matthew Trenhaile from the ‘Inside Betting’ podcast did an episode on Ratings, where he reports about the admin of Tennis Abstract who developed simple surface-adjusted tennis Elo ratings that seemed to do quite well in betting terms. If you’d like to listen to the bit in the episode it is around the 01:23:00 mark. You can listen to it below:
The Glicko rating was invented by Mark Glickman. It tries to improve the Elo ratings in the domain of rating reliability. In practice, Glicko ranks chess players and introduces a second measurable element to the ratings, namely “ratings reliability”. Hereby the system factors in the accuracy of the measured rating.
An adjusted Glicko ratings system called Glicko-2 includes the rating volatility as a measurable factor.
A clear advantage of the Glicko ratings is that, unlike Elo, they measure the reliability of a rating. In general the rating of a player reflects their strength at the time they have finished their last game. You don’t know how close their strength today would be to that number. That is an effect that is more apparent at the beginning of a new season after a long break. Certainly, it is an important factor to consider when compiling a rating.
In a betting context, this is not a trivial detail. Zero rating reliability of both opponents would mean fair odds of 2.00 for each of them, regardless of their ratings. Obviously, ratings reliability should have a significant influence over the fair odds, so it must be included in the calculation.
On the other hand, introducing further elements to a rating model complicates the whole system quite a bit. It might be challenging to incorporate all those elements in a playable system. But if you find a way to do it sensibly, you would definitely improve your model.
TrueSkill is a relatively young member of the ratings family, but it is also a more sophisticated one. Microsoft has initially developed TrueSkill for matching players in Xbox Live. The main purpose of TrueSkill is to measure ratings for team games. Hereby it addresses another aspect of Elo’s incompatibility for applications outside of chess. Moreover, much like Glicko, TrueSkill accounts for the uncertainty of the measured rating.
TrueSkill is based on Bayesian statistics. Bayesian statistics are a statistics domain that relies on a prior assumption about a distribution before analysing gathered observations.
How does TrueSkill work? Much like Glicko, it measures two skill metrics. Your expected skill μ and the variance of your skill σ (representing the uncertainty of μ). TrueSkill gives an initial skill rating and a rating variance to each new player (the so called ‘priors’). Afterwards Bayesian inference updates those figures as the player finishes more and more games. Since True Skill is essentially a matching algorithm (as most ratings are) it aims at maximising the draw probability between potential opponents.
Same as Glicko, TrueSkill factors in rating reliability, which as said above is an important improvement over the Elo ratings. Furthermore, unlike Elo and Glicko, which were developed for chess, TrueSkill is adapted to team games as well. Finally, TrueSkill does reliably measure the probability of a draw. After all, measuring the draw probability is a central function of the TrueSkill concept. With this, TrueSkill addresses three of the four Elo’s problems (or rather, problems of Elo’s adaptability to sports other than chess) listed above.
Still, much like Glicko and Elo, TrueSkill doesn’t care about the margin of victory. Furthermore, in case of team games, TrueSkill adds up the individual skills of the participating players to estimate a team’s skill, without accounting for dependencies within the team.
Even so, TrueSkill is one of the most sophisticated publicly available rating systems. It accounts for many of the deficiencies of the older ones. It is also useful for formats with many players/teams all competing against each other. That makes it well suited for horse racing, or any type of racing sports for that matter.
If you want to read more about the model just check this very detailed and informative article on the topic. It explains the concept and the underlying maths better than I ever could.
There are countless different rating systems on the Internet, but I want to give a honorable mention to a particular one I came across a few months ago. That one has to do with FIFA (the computer game, not the organisation).
Bradley Grantham has posted an article about him developing a football rating model based on player ratings from the EA Sports’ FIFA computer game. What sounds like a crazy idea isn’t that crazy anymore when you find out the amount of time, money and resources in general that EA spends in order to calculate those ratings. As it turns out, EA Sports has a special team dedicated to collecting player data from just about every league featured in the game to calculate the most accurate rating for each player.
Obviously, you only get an update from EA Sports on those ratings once a year. So, the author starts with those ratings as an input and adjusts them throughout the season with the help of a machine learning model. He claims, that the resulting model beats the closing line. I is up to you if you would accept this claim at face value. I found the idea fascinating so I put it in the list as well – maybe it can provide inspiration for someone to do something similar.
Which rating model is best?
Those systems are fit for different purposes. Furthermore, not all of them aim at calculating the most correct winning probabilities. Sometimes even a more simplistic system might be the better one for you if that is what you feel comfortable working with at the moment, but in general you would be looking for the highest predictive accuracy in a rating system.
There is a lot of literature on this topic and here I will shortly quote just few of more interesting works I have come across.
The first one is “Ranking rankings: an empirical comparison of the predictive power of sports ranking methods” by Daniel Barrow, Ian Drayer, Peter Elliott, Garren Gaut and Braxton Osting. Here the authors compare the predictive power of eight sports ranking methods for US college football and basketball. They find out that models incorporating score differential data as opposed to a simple win/use variable are more accurate at their predictions at a statistically significant level (remember, that the rating models explained above did not account for score differential). Two methods – the least squares and the random walker – stand out from the rest as it comes to predictive ability for college football.
Another paper by the same authors named “Sports Rankings REU Final Report 2012: An Analysis of Pairwise-Comparison Based Sports Ranking Methods and a Novel Agent-Based Markovian Basketball Simulation” again compares rating systems and this one is actually available for free. The compared methods are least squares, random walker (aka. the Google’s PageRank), Elo, TrueSkill, Rating Percentage Index (RPI) and some variations. The data used is from professional and college baseball, basketball and American football. Again, the least squares method shows strong predictive power.
In any case it is worth remembering that the results of such comparisons are only as good as the used statistical methods and the underlying data. Furthermore, it could well be that certain ratings outperform in certain sports and underperform in others. The specifics of a sport are important in determining the best fitting rating model.
If you want to read some practical tips on calculating a least squares and some other ratings yourself, you can refer to “Statistical Models Applied to the Rating of Sports Teams” by Kenneth Massey. Furthermore, the article “Rating Sports Teams — Elo vs. Win-Loss” by Blake Atkinson contains useful code for calculating Elo scores.
Speaking of which it is time to move on to…
Building your model
After familiarizing yourself with the famous rating models and the theory behind them, the question that logically follows is, does it make sense to build your own model? And what kind of model should it be?
You could either take one of the above models and use it as is, apply some adjustments to it, or build your own model from scratch. There are a few practical considerations that need to be addressed in taking this decision.
Sport and league
You need to decide which sport and league you are going to focus on. The choice of sport is decisive for the type of rating model you would want to use. As mentioned above, different rating systems are better suited for different kinds of sports.
What league should you to focus on? Of course, free- and paid-data availability on larger leagues is better. However, smaller leagues and sports are a less explored area, where the achievable edges are higher.
Therefore, if you are just getting started it is recommended to focus on a smaller league and/or a more exotic sport. Many sharp punters would avoid those simply because the limits are too low to be worth their time. That means there will be some edge for you to exploit. If the money is good enough for you, you might find yourself a niche there. Once you master a smaller league you can slowly start improving your rating model and data set and try your luck at a bigger playground.
Data for model building and backtesting
To get started you will need data to backtest your rating model.
On the one hand, it pays to have a long series, so that you can backtest your rating model on a large sample. Of course, sports do change in time, sports rules too, and edges tend to disappear from the betting markets. All those factors mean you should treat old data with caution. Then again, overweighing recent results could also be a problem. In general, more data never hurts, you just need to interpret it correctly.
On the other hand, it is useful to collect diverse data. A Premier League rating model using only goals, shots and shots on target as an input is guaranteed not to find you any edge. This is widely available data so it is surely already factored in in the betting markets. Now, adding to that things like player assists, mileage, pass quality, tackles, you name it, is a whole different story.
The more factors – the better?
More exotic factors might improve your model and help you find an angle that others might have missed. Be careful though, since too many factors in a rating model may lead to overparametarisation. Overparametarisation does not only twist your tongue (have you tried saying that out loud?), but even worse, it makes you see causal relationships where there are none. Make sure you understand the mathematical basics of the kind of model you are trying to build. There is a lot of literature trying to answer the question of which is the ideal number of parameters for a model. So make sure to make use of it.
A really important type of data for building a rating model (or any kind of model for betting) is historical odds data. If you get your hands on it you could compare your model estimations with the market closing line, which would give you much better idea about the viability of your model. In fact, using the closing line as an estimator, you need much shorter sample in order to confirm or reject the profitability of your bets. You could build a model without odds data as well, only it will be much harder to prove if you have an edge in this case.
Choosing a rating model
Once you have secured the data, you need to decide on a model you are going to use. You can take the famous rating models listed above as a starting point or build something from scratch. The important thing is to understand the premises behind those models and evaluate which would work best for your data set and the goal you are after.
After you have decided on a model you need to backtest your data in order to calibrate it. At this point you should find answers to questions such as:
- How long back in time should games influence a rating and by how much?
- What role should the margin of victory play in determining a power rating?
- Should long periods with no rating-relevant games adjust the rating and by how much?
… as well as other questions about the importance and weighting of factors that might be relevant for your rating.
All those considerations basically boil down to just two questions. First, what are the factors that determine a good player or team in a given sport. And second, how do you measure and implement the uncertainty around the ratings you have collected.
Placing your bets
Placing your bets is always a challenge, especially if your bets contain an edge. In this case you will consistently beat the closing line and soft bookmakers will chase you away. What is left for you then are the sharp books and the exchanges.
You could automate the placement of the tips your model produces. Whether that is worth the effort depends mostly on the number of bets your model is producing. You could use Betfair’s or Pinnacle’s API to build a bot or use an external service provider to do that for you.
If, on the other hand, the number of bets is relatively small, your main task will be to secure the best odds. In this case it should not be an issue to place your bets manually. Using an agent would be a good approach in this case. Two of the big players on that market and the ones I work with are AsianConnect and Sportmarket, but there are others as well.
Once you are up and running, you would need to re-calibrate your rating model every once in a while. The importance and weights of certain factors in your model will change in time, so you cannot use the same numbers forever, even if they made sense in the beginning. Furthermore, you will probably need to update your ratings after every game, so there is no way around finding a way to regularly update your data.
To do this you will probably need some sort of scraping mechanism or a good API. Even if you have a good source of data in a user-friendly format, downloading and inserting your data manually every week would be burdensome. A scraping script itself tends to require maintenance, as web pages change their layout so you might need to adjust your script accordingly every now and then. You must also comply with the rules of the website hosting the data so they don’t ban you.
The quality of the data you are working with is an important aspect to take care off. Especially if you get the data for free. There might be gaps or straight up wrong numbers in your data and you need to find a way to validate and correct such data points.
Having done all this, you can re-calibrate your rating model on preset intervals, or even better, dynamically update your numbers after each game. In this way you will have your ratings and the rest of the model parameters always up to date.
At the end of the day, as you see above, building a rating model might require huge effort in:
- Scraping, storing, validating and updating data
- Educating yourself on mathematics and statistics to design your model
- Mastering a programming language that allows you to apply your model using your data in practice
- Ensuring optimal execution, which might require the development of a bot
And even after you do all this, there is still no guarantee you will have positive return! If you are looking for fast results, you can simply follow a Technical Value Betting or an Arbitrage strategy. However, rating models belong to the family of Fundamental Value Betting models and they carry not only the typical drawbacks, but the typical advantages of those as well. Namely,
- Even though you must fight hard for every percentage point of expected ROI, your upside in terms of possible turnover is (virtually) unlimited. This depends on the sport and league you focus on, but for most people the possibilities you have to get down in all the sharp books, exchanges and agents are more than sufficient.
- You have a steep learning curve, which means you can gradually improve your ROI as you go. That goes together with smart bank management, since burning through your bank before having perfected your model won’t get you anywhere
- In any case, in ELO, Glicko, TrueSkill and others you already have some pretty powerful rating models you could use as a basis. You will need quite some programming skills and understanding of statistics to use those, build up on them and get going. However, even if you don’t have those skills, acquiring them certainly wouldn’t hurt you, since they are in high demand in many domains aside from sports betting.
Is it worth it?
So is it worth it? The call is yours. There are many risks and many potential benefits. At the end it all depends on your skills, resources and motivation. Clearly, there is significant risk that you never reach the point where your model becomes profitable. So what’s important is to enjoy the journey and manage the financial risk with caution.
As for me, I am planning to start developing player and team ratings for League of Legends together with a friend who is a fan of the eSport. Using this article (among others) as a reference, hopefully we will manage to produce something valuable from a betting perspective. I will report on the progress of this project at a later point.
Thanks for reading and I would love to hear your thoughts in the comment section below. Furthermore, if you’d like to stay updated on my new content, make sure to follow me on Twitter and/or subscribe to my e-mail list in the top-right corner of this page. See you around!