Is there anything better than the Elo system?
Although, as explained in more detail elsewhere, the Elo system is the best system used in practice to measure playing strength — so well suited to producing meaningful rankings — there are nevertheless a few weaknesses in this system. Some of these are well known, others have perhaps rarely been discussed.
All this only becomes relevant if one has suggestions on how to remedy the weaknesses or if one could even present a better system in which these weaknesses are eliminated per se. In order to make the whole thing palatable to the reader, we will proceed as follows:
The Elo system is introduced. For those who feel comfortable with it, this can be read over with pleasure. Subsequently, the individual weaknesses of the current system are examined. Mentioned here only briefly, so that reading on is not so difficult: the arbitrariness of the numbers, the Elo inflation, the black-white problem, the draw problem (yes, it exists!), the adjustment of the numbers on the basis of results achieved. The latter is a problem that exists in every system. Here, however, a method is presented with which one can find an adjustment that is as realistic as possible. In the end, it serves a self-made specification, according to which the system would have to be suitable for predicting game outcomes in order to be really good.
After all the problems have been pointed out and discussed, an alternative system will be presented which (better) copes with these problems.
2) The Elo system presented
Professor Elo assumed that the playing strengths of players were normally distributed (and have remained so until today). The normal distribution, i.e. the so-called Gaussian bell curve, simply asserts that there are a few very good players, a few very weak players and in between many more or less average players.
This results in an expectation of points when comparing two numbers. Please do not confuse this with a probability of winning. The points are made up of draws and wins.
I have simply copied the explanation of the system from the internet. I can explain it on the basis of this.
Note: If there were no draws, the expected score would be just the probability that A wins. Since a game of chess can also end in a draw, the expected score is equal to the probability of winning plus one and a half times the probability of drawing. The probabilities of winning, drawing and losing are not used in the Elo system, only the expected values.
Here now is the copied section:
EA = 1 / (1 + 10^(RB – RA) / 400)
EA: Expected score for player A. For a series of 5 games, you can also multiply EA by 5.
RA: Player A’s Elo rating to date.
RB: previous Elo rating of player B
The expected value for A is now EA – 100 %. The new Elo number of player A is
EA = 1 / (1 + 10^(RB – RA) / 400)
k: is usually 15, for top players (Elo > 2400) 10, for less than 30 rated games 25.
SA: score actually played (1 for each win, 0.5 for each draw, 0 for each defeat).
Note 1: The number 400 included in the formula as well as the original k-factor were chosen by Arpad Elo to make the Elo numbers as compatible as possible with the scoring numbers of the Kenneth Harkness rating system used earlier. In fact, the Harkness model can be seen as a piecewise linear approximation to the Elo model.
Note 2: It is easy to show mathematically that EA + EB = 1.
Here you can see what can happen to you when you try to educate yourself on the internet.
The two formulas and their explanations are decisive here. The shape of the first formula guarantees that a number between 0 and 1 will come out, which is not only desirable but even necessary. This is not only desirable, but necessary, since we are dealing with point expectations, which in chess must logically lie between 0 and 1 because of the rating 0 for defeat, 1/2 for draw and 1 for victory, but which can also be understood as probabilities in other sports. This is ensured by: 1 divided by a number of 1 + (a positive number) guarantees that. Because 10 to the power of something, written 10^ x, is always positive, even if x is negative.
RA – RB measures the difference in playing strength. This can be positive or negative. The difference in playing strength divided by 400 basically only gives a possible scaling. As explained above, this was only done because of the Kenneth Harkness’ previous numbers. While a good idea, it is only possible to anticipate the problem of arbitrariness explained further below somewhat here.
To give you an idea of what the Elo formula roughly produces, I have shown the whole thing in a diagram. The curve represents a player’s points expectation against opponents who have up to 600 Elo points less or up to 720 Elo points more. Here is the diagram:
The curve looks beautiful, of course. In principle, it is also realistic. Against very weak opponents, the expectation approaches 100% (or 1). That is absolutely realistic. Against a certain class of opponents you are so superior at some point that you almost certainly win, against the far better ones you are without a chance at some point. Whether these players exist, however, depends on your own rating, of course. So the world ranking itself contains players between 2000 and about 2800 (number 1; at the moment, on 27.10.2008, Anand). Weaker players do not receive Elo numbers. In the list maintained by FIDE, these differences in playing strength do not exist, at least not for an individual in either direction. An average player with 2300 Elo would have a maximum positive Elo difference of 300 and a negative one of 500. In terms of a game, he would always have an expectation of between 85% (or 0.85 points) and about 8% (or 0.08 points).
The hyperbolic curve can also be easily understood. The probabilities come closer and closer to 0 and 1 as the difference in playing strength increases. But they are never reached. This is also intuitively obvious. One has to assume that every participant in the game at least knows the rules, i.e. can recognise all possible moves in a position and therefore can/could also execute them. So he could also execute the respective move that Kasparov would have made (more on this topic in the chapter “A few number games”).
But now comes the second formula. This is used to calculate the new Elo numbers. This is the “playing strength update function”. You have to react, that’s for sure. Players achieve good or bad results. Players actually improve or worsen. In other respects, too, the point of the system is to show developments, changes. It spurs you on, you don’t just want to win the game(s), the game, the match, but to climb up the rankings.
Here, we calculate with factors of 25, 15 and 10. If we use 25, we react more quickly. It is obvious that one must assume a faster development as long as a player has only played a few (i.e. less than 30) games. Likewise, it seems to make sense to react somewhat more slowly when the numbers are high (>2400). It expresses the fact that a very good player tends to be less subject to large fluctuations in general. Also correct so far. One could also express it in such a way that with weaker players many coincidences determine the outcome. Because of this, one reacts a little more quickly.
At least the criterion I mentioned is mentioned here: there is talk of an expectation of points. An expectation is a forecast of a future event. And that at least indicates an attempt to forecast with this system. Or is it even a forecasting system? This can be illuminated a little in the following:
a) Weaknesses of the system
The formulas all work quite well. The Elo system is, in my estimation, the best rating system in existence and in use in practice, just as a reminder.
Most of the weaknesses listed here are probably already known. Nevertheless, I list them here in bundles, explained in detail. The system presented later will cope better with some weaknesses.
i) Arbitrariness of the Elo numbers
This is a minor problem. But it is in fact completely arbitrarily determined. If you tell a non-chess player you have an Elo figure of 2300, they can only do something with it if they know other comparative figures.
Nevertheless, I find it intuitively plausible that one would like to have a measure with which one can possibly do something just by naming the number, possibly even across sports. As I said, this only becomes really interesting if one has a better suggestion.
ii) The speed of adaptation
Factors of 25, 15 and 10 are used. This causes faster or less rapid response to the results. That it makes sense in some respect to differentiate this has already been explained above. But how differently the reaction must be has probably not yet been investigated, let alone the existing system verified.
Elsewhere (see chapter “Comparability of predictions”) it is mentioned that there are possibilities to verify a system in itself. That would also be possible here, but the necessity is missing. The system runs, you let it run, everyone accumulates. Why improve it? What’s more, everyone has learnt the rules and has probably accepted them. “I gained 14 Elo points in the last period, what about you?” “I lost 18 points, I had a very bad tournament.” That’s just the way it is. Poorly played points lost, well played points gained. That’s how it’s calculated, that’s how FIDE does it, finished. But the question not mentioned could be: “How realistic was it that I lost exactly 18 points?” The “realistic” always refers to what the actual points expectation would be in the next game against that opponent.
So whether the change in the number is suitable to make the best possible prediction for the following game has not really been considered relevant so far….
First of all, here again is the statement of what the weakness is: the adjustment speed is not verified. It seems well-considered, but still intuitively determined. Is there a method that could be used to investigate and, if necessary, improve the speed of adjustment? Yes! Patience….
iii) Elo inflation
The problem of arbitrariness makes itself felt a little here. That inflation can occur is obvious. Simply reasoned: There is no normalisation or anything. The numbers are somewhere between 2000 and 2800, in the current Elo list. FIDE only takes into account players with a playing strength of 2000 or more. If you now add up all the numbers and divide them by the number of players, you get an average value. This is, let’s say, 2285.
That would be the current average player, so to speak. But it is not normalised to this number, it is purely “by chance” 2285. In the next period (which are always half-yearly) there are quite a few “newcomers”, some fall out of the list. The newcomers develop faster, with a factor of 25, so point gains and point losses are not congruent. If the newcomers are also good, then they win more points (they win anyway, just think about it; those who don’t win don’t even make it into the list) than are lost elsewhere. This must cause the average to rise in the next period. This is a typical inflation effect. It may only go up to 2286, but it goes up.
It is also only a small problem. But it exists. If you meet a player with an Elo number of 2420 today, it might mean: 20 years ago it was worth something, today he is one of many.
Note: One could argue that it is correct that inflation occurs. That would be the case if a player of 2420 from 20 years ago, who was “a giant then”, was an exception, in fact exact equality of opportunity against the one who has 2420 today. That is a philosophical question. In any case, the truth is that the game itself is also evolving. In good German: “Everyone is getting better.” Standing still is a step backwards. What previous generations painstakingly achieved is now “common sense”, anyone can do it. And this development is in no way “typical of chess”. This is true in virtually all sports.
Professor Elo certainly neither intended nor foresaw this. Apart from that, it would be a miracle if the general development of chess itself happened to be represented by the (unintended) inflation effect.
iv) The Black-White Problem
This problem is already much more serious. According to the database I have, about 70% of all winning games are scored with the white pieces. White has an advantage, also called the suit advantage, whereby this cannot be attributed to a dress code. The advantage exists, objectively and undisputedly. It is not taken into account in the formulas used.
In the past, only entire tournaments were evaluated. There, the problem was somewhat subordinate. Nowadays, individual games are evaluated. Sure, you sometimes have White and sometimes Black, but the problem is neutralised in the course of time. But if you play a single game and it is evaluated, then you can be a victim — but also a beneficiary — of this injustice.
Apart from that, the prediction is violated at this point at the latest. If one calculates the chances for a game from the Elo formula, then one is not asked who has White. One of the two is, whose chances are guaranteed to be better than assumed by the formula.
v) The draw problem
To repeat: the whole system is good and works.
The draw problem is only that the result you get does not provide any answer to the question “How likely is a draw?”: In the section above on the Elo system it says quite succinctly “
The probabilities for victory, draw and defeat are not needed in the Elo system, only the expected values.
That is true in a way. But: could it not still be interesting how probable it is? Just like that, as a question. Besides, thinking further: if there were to be betting on chess games at some point, it would be needed in any case, at least the provider (rumour has it that betting has already taken place?!).
vi) The Normal Distribution
The basic assumption of Professor Elo, which is thus also the basis for the function used until today, is that the playing strengths are normally distributed. And this “normally distributed” is already a highly mathematical concept. The Gaussian normal distribution, the famous bell curve. Nor is there any direct fault to be found in this assumption. One can see:
The mean value here is 2200 Elo. So that’s where you theoretically meet most players. Very few players have a very low number, very few a very high number, the number of players to be found rises towards the mean, reaches a maximum there and then falls again. About two thirds (more like 69%) of the values lie within the simple standard deviation. This is the area between the two turning points. Turning points are where one would have to change from a left-hand oscillation to a right-hand oscillation when tracing the curve and where one would have to change back again. That’s a little mathematical digression.
The two assumptions that the playing strengths are normally distributed and that one can deduce from this that one would have to have certain expectations for certain differences in the numbers are not entirely confirmed by practice. There is evidence for this, and certainly some good reasoning.
One proof that it does not quite correspond to reality in the case of large Elo differences: players with high Elo numbers are very reluctant to play Open tournaments. The reasoning is quite simple: “I’m not going to mess up my number.” So the statement goes. And they are right! If they do not achieve the points yield calculated according to the Elo formula, it is not due to their own weakness of form, but to the weakness of the formula.
A friend and grandmaster of chess, Robert Rabiega, has been a professional chess player for many years. He has a special talent when it comes to rapid and blitz chess, so he can make a pretty good living at it. He has a wife and two children. However, he also has to travel frequently to Open tournaments, which are rated Elo (as opposed to rapid and blitz). A grandmaster doesn’t get that many invitations these days. He says that participation in an Open costs him an average of two Elo points. A very good calculation. And another proof: the Elo system is good. But it has weaknesses. And it is only good enough until a better system is found.
b) Suitability as a forecasting system
The system is not suitable as a forecasting system for the reasons mentioned above, mainly because it is not designed for that purpose. One reacts intuitively to the results. Factors of 10, 15 and 25 are set. This is how you achieve movements in the rankings. Intuitively, it also makes sense that, say, if one achieves a “performance” of 2260 in 9 games (the performance is the number that corresponds to the Elo performance achieved in a tournament; it cannot be read directly from the above formula, but chess players calculate with it), previously had a number of 2340, that the new number should then be somewhere between 2260 and 2340, but probably closer to 2340, since this number was earned over a longer period of time, the current result a snapshot, which should, however, have an effect because it is just current. Calculated according to the Elo formula with a k-factor of 15, this would result in a loss of (funnily enough, since it was chosen arbitrarily) 15 Elo points. One has the feeling that this is adequate, sure. Before, you had 2340, played badly, now you still have 2325. Not 2260, not 2340, closer to 2340, so 2325. It was never checked whether this gives a good prognosis for the future.
Moreover, the draw is not predicted at all and the Black-White considerations prove anyway that the expectations related to a single game do not meet the criterion of prediction, of prognosis.
If one were allowed to bet on this with FIDE – if they were to represent their system as a forecasting system — then the bets would have to be formulated in such a way that FIDE calculates its point expectation and one can bet on or against these expectations in the sense before spreads. (See also in the chapter “The Betting Market”, subsection “spread betting”). A bet would look something like this: The spread in a game Anand – Topalov is >0.53 or < 0.50. You could bet on Anand to score more than 0.53 points, or bet against him to score less than 0.5 points. If you bet on him, you would have a small loss on a draw. Based on the Elo expectation, he would be the favourite as the Elo stronger (assumption here: he has the better number). However, this way of betting would also not be based on the predicted draw probability. But since Black and White are not taken into account, bettors would still have an advantage.
c) The unwieldiness
The formulas are quite unwieldy. If you ask around in the chess scene, you will always find that most people can’t pull the formulas out of their sleeves, rather the opposite. It is not known how to calculate one’s own performance, nor what the new figure will be. Recently, there is a possibility on the internet – since everything is available there – to have the number calculated directly. One enters one’s own number, the opponent’s number(s), the result and sees everything that is of interest. On the other hand, it might be desirable to keep the formula complex comprehensible. You believe the results, but you cannot calculate them yourself.
Certainly, this problem is also not noticeable only as long as there is no better proposal. But this is precisely what is supposed to happen here and, surprise, surprise …
1) The better system
As long announced, there is a system that copes better with these weaknesses. Most of the proposed improvements could also be applied to the Elo system. But there are some points that make the system presented here actually superior. But in order, here first…
a. The formula for calculation
The playing strengths are always expressed as percentages. So each participant receives a playing strength between 0 and 1, or 0% and 100%. (How to get them and how to maintain them will be dealt with later.) Now, when two playing strengths, two players, two participants, meet, their playing strengths must be offset against each other. The calculation rule is first derived intuitively.
It is immediately obvious that each participant has his own expectation against the 50% participant as an expectation against this opponent. So if you have 64% and your opponent has 50%, then you have 64% in this duel. It is, so to speak, the definition of playing strength. The indication of the playing strength describes the expectation against the average participant.
The second, immediately obvious, prerequisite is that one has exactly 50% expectation against a participant of one’s own playing strength. That is, 64% for oneself, 64% for the opponent, 50% for the match.
Now it is explained intuitively and by example how one arrives at the simple calculation formula. My own intuition helped me with this and sometimes it really is simple sentences that you only have to say to make a problem manageable. So I say to myself the following sentence: If I win twice as often as I lose, but my opponent only wins half as often as he loses, I immediately find it reasonable that I should win against him four times as often as I lose. Is that convincing?
You still don’t know what percentage you win, but at least you know. Maybe you will find out then?
Note: This idea comes from tennis. There is always only one winner and one loser. That’s how the game is played. Transferred to, for example, chess, one would have to express it in such a way that it already loses its vividness, but nevertheless, here you go: “If I score twice as many points as points I give away and my opponent only scores half as many as he gives away, then I score against him four times as many as I give away”.
That makes it much easier to find the formula. Just calculate it with this and another example:, for the sake of simplicity in a sport where there are only winners and losers, no draws, for the sake of illustration:
If you win twice as many games as you lose, then that corresponds to a playing strength of 66.66%. You win 66.66% of your games and lose the remaining 33.33%. 66.66% divided by 33.33% results in a factor of 2, i.e. double, in the ratio of wins to losses.
My opponent has a playing strength of 33.33%. He wins 33.33% and loses 66.66%, he wins only half as often as he loses. Wins to losses ratio: 1/2. Now divide the playing strength calculated in this way (the win/loss ratio, so to speak) of 2 by his of 1/2, or 0.5. And 2/0.5 = 4. So you win against him, calculated in this way, four times as often as you lose.
Now all that’s missing is the answer to the question: What does this mean in percent? It is actually a three-sentence problem that remains. What we are looking for are two numbers, two percentages, that express how often one wins and how often one loses in a specific match/game. So the sum of two numbers must be 1, the quotient of the two numbers must be 4. p1 + p2 = 1 and p1/p2 = 4.
Let’s get to work: Replace p2 with p1 in the second expression, Remember? You get from expression one that p2 = 1 – p1. Substituting this into expression two gives that p1/(1-p1) = 4. Then the stupid fraction to the other side, by multiplication. So then p1 = 4 * (1-p1). Multiplying that out again gives p1 = 4 – 4 * p1. Then return p1 to the other side, but change the sign! Results in p1 + 4 * p1 = 4. Then add up, 5 * p1 = 4. But we need and want p1. So we divide both sides by 5 and get p1 = 4/5 or p1 = 0.8.
Now only one question remains: Why do these saudoo variables always like to be called p? And there’s an answer to that, too: mathematicians think they’re more scientific if they express everything in English. That’s why we have pi and epsilon. But, joking aside for this moment, the p stands for “probability” and that means something like “likelihood”.
Of course, that was not the relevant question. We have now calculated 0.8 or 80% as the probability of victory (transferred to chess: as the expectation of points). So the opponent has the remaining 20%. 80% is four times as much as 20%. So the condition is also fulfilled. Witchcraft?
The validity of the formula can also be briefly checked with the two standard examples: Against a 50% player, everyone has his expectation. Because the 50% player himself has a quotient of 1 as far as the ratio of victories to defeats is concerned (50%/50% = 1). Then the expectation for this game/match is one’s own playing strength, since the expectation is not changed by a factor of 1.
Against a player of the same playing strength, one always divides a quotient by the same quotient. So you would have to win 1 times as many games against your opponent as you lose. And 1 times as many is still 1. So both win the same number of times, each wins 50%, so that is also correct.
If a participant with 82% meets another participant with 64%, then it is not immediately obvious how one calculates the expectation of the two against each other. Nevertheless, it should be explained here: One makes the two playing strengths comparable with each other by representing them as quotients. The quotient expresses the following question: how many times more often does the participant win than lose?
The first participant wins 82% of his games, so to speak, which is what the playing strength tries to express. He wins 82% and loses 18%. The other with 64% has a win/loss ratio of 64/36 because he loses 36% of his games. We have made the two playing strengths comparable by expressing them as a ratio. Player 1 has a ratio of 82/18, player 2 has a ratio of 64/36. Once you have these two numbers in ratio, you divide them by each other to get the ratio for that specific match. Then you take the rule of three and you have found the odds for player 1 (and therefore also for player 2).
b. The superiority of this system
The system is superior in that it is universally applicable. It does not matter whether it is a team sport or an individual sport and it does not matter how the winners are determined. It can be about points or goals, but also times or distances. Even draws, as in chess, football, handball, are included. It is suitable for any sport or game where there are two parties.
Now it is best to show point by point which improvements could be suggested. In doing so, it also becomes clear point by point which of the suggestions could also be applied to the Elo system.
i. The arbitrariness of the numbers
The Elo numbers are purely arbitrary. One can gradually develop a feeling for what is a good number and what is a bad one. But even there, inflation would stand in the way.
This problem would be solved by the alternative system, which always provides comparable numbers across sports and games, which are not arbitrary.
One can even go further: The level of the numbers that could and would be achieved at all would even make the games comparable with each other. We will try to make this clear with the help of a few games:
The first game would be Mensch-Ärger-Dich-Nicht. To what extent has it actually been researched? But one can pursue certain strategies. For example these: Either bring a stone around as quickly as possible. Or let several stones move together. But: Should you overtake opponents when pursuing the first strategy? There is the danger of being thrown out. Well, there are strategies. And they will not all be identically good. But it can be assumed that the greatest experts, due to the game alone, will never be able to achieve a much higher playing strength than 55%. The luck factor remains too high.
The situation is a little different in chess: There is perhaps, but so far only in theory, one best move in every position. The player who finds this move, always and in every position, must become world champion. It could only be that he meets someone who also always finds the best move. In that case, according to the currently generally accepted view, the game would end in a draw. Chess is a draw game. The advantage of the attracting player is not sufficient to give him an advantage sufficient to win. So the game differs from Man-It-Not in two ways: there are no (obvious) luck factors and there is a draw.
Together, these two factors mean that it is possible to achieve very high levels of play. But certainly not 100%. Maybe the world number 1 would be at 90% playing strength at some point. But never higher. That is out of the question. He makes the best moves, the opponent perhaps sometimes only the second best. Nevertheless, even then the advantage is not always enough to win. That’s the game.
In tennis, things could look different again. There is no draw and only a little luck. Least of all, there is a perfection that can be judged conclusively. There could, God forbid, be breedings that produce pure tennis monsters. The whole physique is geared towards tennis. Or should I even say computers? Like chess? So it would be conceivable that there would be a quasi-unbeatable player.
But the proposed system is designed in such a way, through permanent adaptation, that even then 100% would never be reached. You always improve by a percentage of what you have achieved compared to what you have predicted. So if you have 99.99% expectation in a game because you are so superior and you actually win, then the original playing strength also only improves by a proportion of the value between itself and the 100%. So the way the playing strength changes would guarantee that 100% could never be reached. But one could come very close. Ws would then also be realistic, correct. Didn’t Federer recently have a winning streak of 46 games?
It would also be conceivable to use the system in backgammon. The best players there are are pretty much agreed that you can hardly win more than 65% of your games against any opposition (including good ones). It is a strategy game. But there is the luck factor of the dice.
You just can’t play the game that superiorly, even if you always make the best move. Of course, it depends on the length of the match. In longer matches, it is clear that the better player has a greater chance of winning, in shorter ones the luck factor increases, just as naturally. But nevertheless, one could use the system for backgammon without further ado. The playing strength numbers would also represent the character of the game a little. Here, playing strengths of over 60% would certainly be very good.
In short, the system could be applied to all these games and sports. The maximum possible or achieved numbers would give an impression of both the character of the game and the quality of the player. In any case, arbitrariness no longer exists, rather the exact opposite.
ii. The speed of adaptation
Here a few remarks should be made in advance, which are in any case relevant to this sub-chapter:
E iss only briefly mentioned above: the main quality feature of a system used is its suitability as a forecasting system. That is always the overriding idea. However, there is a system – studied elsewhere — for checking the quality of forecasts (to be read in the chapter of the same name). This can be used to make two different forecasts on the same events comparable with each other in the long term, but also to test a single system for plausibility. In this respect, it would be obvious to hold the system proposed here against Elo for a while and compare the results.
An essential, almost overriding criterion seems to be traceability. The same, simple rules should apply to all participants. That is understandable. This point would be omitted in the interpretation on forecasts. This would have to be solved in practice.
Another, also overriding, could be that one wants to have (a lot of) movement in the ranking. That could create a kind of tension. In this case, the consideration that it should be “as realistic as possible” takes a back seat.
The only thing to keep in mind with all the improvements suggested here is the goal being pursued. If someone considers one of the other criteria to be more valuable, please feel free to do so. The orientation towards prognostics is taken as a basis here. The practical problems can be dealt with later.
So much for now. Now let’s get down to business:
What we are looking for is the perfect adjustment speed. How must one react as precisely as possible to a result in order to make the best possible prognosis for the next game of the two participants (against themselves or against others)? Here, too, the answer is already in the (well-formulated) question. Firstly, one sees that it can be individually different and secondly, one sees how one can determine that.
Elo has already recognised that it can and even must vary from person to person. It is calculated with the factors 25 (fast adaptation), 15 (slower) and 10 (slowest). Two opponents with different factors can even meet. Then even one would gain more than the other loses (see inflation). Even in the system proposed here, one has to react individually.
How one can determine the (different) adjustment factors can in principle also be seen in the question: In order to determine the quality of the forecast for the following game, one can simply use past results again.
Now you can once again learn something about the way my brain works and, by extension, the resulting implementation for the creation of a text : I have revised the whole chapter here several times (you don’t notice, you say?). I wrote and wrote. And then read and read again. Then deleted and deleted again. Then rewrote and rewrote. And then I wrote and wrote again. Do you notice anything? Yes? Me too: why have I never actually thought? Aha, lack of ability; but at least I can read minds. Anyway, I found the following text passage from an old, long-forgotten or (unfortunately not finally) deleted text. And although it doesn’t quite fit here, I wanted to preserve it.
BEGINNING OF THE INSERTED TEXT PASSAGE
With the Pauli system, I have solutions for all these problems, of course (somehow I always vacillate between megalomania and insanity, modesty, which can also be wrong, and absolute cluelessness, if only I still knew the noun of “submissive”, I don’t know the passive subjunctive perfect of “know”, if there is such a thing at all. But then it would have to be invented, so why not me). So good old Pauli has once again thought (a few) thoughts (too many).
Well, actually I only transferred old ideas. But still. So you have an expectation for a game. This is, as in the example above, 71.93% of the possible points for player 1. Then you have a result in the game. Be it a draw. Then you have a deviation from the forecast result. This would be 0.2193 points. Player 1, the favourite, has scored too few points. Player 2 has exceeded his expectation by this number of points. So you should now correct the playing strengths of both players in the right direction (and what can be wrong with a DIRECTION? Oh yes, the direction!). The best way to do this is to use a factor, which you then place in the denominator, thus making it a quotient. — I’m just occasionally puzzling over whether I should always use a different font for the lousy (that’s what I said!) calques in the text. Your view? — So, for example, in football I calculate with a playing strength update factor of 30. That is the “best” value determined over years.
In chess, of course, one could similarly create an optimisation function that calculates the best possible “update factor”. In analogy to football, one tries out all possible update factors with a set of known results, which must be chronologically ordered, and takes the one that has produced the smallest deviation from the forecast result over all results.
To do this, it may be necessary to explain that there are guaranteed to be different deviations. After all, the forecast for a game depends on the current playing strength of the two players (even if one analyses retrospectively, it would of course make no sense to simply forecast the already known result and thus lie in one’s pocket; this would logically produce a deviation of 0). So: The playing strengths are different for all the different update factors after each game that has been evaluated. So for the following game, where one of the two players participates, there is also a different forecast again.
However, even this would not be quite sufficient. For there are obviously players for whom one would have to react more quickly and those for whom one would have to react more slowly. However, this is not a purely individual problem, but is determined by the number of games played. A 40-year-old, of whom I already have 1000 games in the database, is less set back by two defeats in a row than a 17-year-old with 5 games so far. So far, that makes sense.
So age and the number of games should (and must) be taken into account. This would already be a small challenge for the optimisation programme, as it would require a fair amount of artificial intelligence to adjust several parameters in order to optimise both at the same time. But it would still have to be done and one would definitely not do worse than with the Elo system used so far.
However, another question remains unanswered: should the parameter “playing strength update factor” also be individually designed or allowed? This raises two problems: First, it seems relatively obvious that there are different characters of players. There is the so-called “solid” player who shies away from risk anyway and also plays very consistently. That’s just the way it is. And there is the risk-taker, who is also often enough exposed to large fluctuations in playing strength. On the one hand, this can be due to the risks he takes, which then occasionally “backfire”, but also due to the basic character itself, which carries him along in a winning streak and makes him keep winning, but unfortunately also in a losing streak.
But one would have the problem of acceptance in particular. Imagine that two players of the same level beat the same player one after the other in a tournament. And one would gain more than the other. “Yes, that’s because you play too consistently. You need to build bigger fluctuations into your results.” A somewhat weak rationale. However, to reassure: After all, it would have to happen anyway due to the consideration of age and number of games. Any change in playing strength on a result would be individual.
Of course, I also have suggestions to make for the calculation of the draw probability. Just this much in advance: this kind of prophecy would be pure gimmickry. It is not decisive for the feasibility of the system. But I have to mention it here in order to do justice to the claim of the “suitability of Pauli as a prediction system”.
Draw frequencies obviously depend on both the character and the level of a person’s playing strength. The weaker, the fewer draws occur, generally speaking. The stronger the player, the more draws. But here, too, there are individual differences. If, one would have to carry this parameter individually as well.
Of course, all these parameters would have to be maintained and serviced. So a player who has been risk-averse up to now and suddenly (due to age?) becomes solid would have to experience an increase in his individual draw factor. Likewise, a player who has played rather consistently up to now and suddenly allows greater fluctuations to occur would also be “rewarded” there individually with a greater factor for the reaction.
Likewise, the general parameters must be maintained and serviced. So the average draw value, for example, can continue to rise or fall again.
A few notes of clarification:
The example I referred to in the passage was calculated for two players of playing strengths 82% and 64%. So you calculate the win/loss ratio for player 1 as 82%/18% = 4.56. So he wins 4.56 times as often as he loses. Player 2 has the ratio 64%/36% = 1.78.
The ratio of player 1 to player 2 is, as it is pronounced, a ratio. Ratios are, mathematically speaking, quotients (i.e. fractions; but don’t break now, and especially not because of this, with your partner, please!). So we divide 4.56/1.78 and get the winning ratio of player 1 to player 2 as a 2.56. Player 1 wins against player 2 2.56 times as often as player 2 wins against player 1. We have to calculate that back into a percentage, so we divide 2.56 by (2.56+1), that is 2.56/3.56 and get the expected point yield for player 1 in this game. That is 71.93%. I only say “point yield” here because in chess exactly 1 point is awarded per game. So 1 in total, just as probabilities for an event must add up to 1. And the points yield is made up of a certain proportion of draws and another from victories. How large these are in each case is an as yet unanswered question.
As a sample: 71.93% / 28.07% = 2.56. So that is also true. And 71.93% + 28.07% = 1. Player 1 wins against player 2.56 times as often as he loses.
A few problems raised in the text passage did not even arise anymore. For example, it had already been clarified that in the Pauli system as well as in the Elo system, individual reactions have to be made, and in the case of Elo they even are.
In football, I did not individualise the system that finds the optimal adjustment for results. It is a team sport. Slightly different laws apply there. But still, people often talk about the “moody diva” at Eintracht Frankfurt. So does that exist there as well?
iii. The inflation of numbers
The inflation of numbers no longer exists. In fact, my system even offers the possibility of making games and sports comparable with each other. In backgammon, for example, the top player in the world would have a playing strength of 65%. He doesn’t manage any more, backgammon is and remains a game of chance (with a fair amount of skill thrown in). You can see from the numbers what character the game has.
In tennis, there might be someone who gets to 92%, maybe even higher later, or the world’s top players get back together (in my tennis database, I think Sampras was once at 92%; later, only Federer was around that). But the highest numbers express something. Not only a win/loss ratio (which would also be a lot; unlike Elo) but also a game character.
In chess, due to the rules of the game (chess is a draw game!), one could not get much higher than 85% or so. Even the computers that are now conquering the world’s best would not get any higher. There would be too many draws. Again: the level of play reflects the character of the game.
One thing is certain: no more inflation. Desirable?
iv. Black and white issue
Most of this issue is already covered in the inserted text passage. To summarise again: There is a White advantage, which must be determined anyway for the sake of good prognoses. There is a general and an individual White advantage. A good white player would at some point have a higher factor than the average, the successful black player a lower one (thus a higher black value, which would be the equivalent of the white advantage).
The individual care and maintenance of the parameters would be a certain administrative effort. In addition, the traceability would suffer somewhat. One would no longer have to maintain one value for each participant but three or four values (playing strength, white advantage, adjustment factor, draw factor). In order to calculate what all players/participants always want to calculate for themselves, they would have to have all these values at hand, even if only as estimates.
I readily admit that in the explanations here I waver a little between applying the Pauli system to chess and to all games/sports, but almost all the points mentioned can be transferred 1:1 to another game/sport. For example, the black-white problem is identical to the home advantage in football. And should this not exist in a game/sport, then all parameters in this category would be 1 and would not change anything.
v. Draw problem
The draw problem is already explained above in the inserted text passage. The draw factor depends both generally on the strength of the game and individually on the character of the player. However, the possibility of predicting a draw probability is irrelevant for the rating of all participants.
This problem also exists in a certain relationship in other sports/games. In football, for example.
i) The normal distribution
The normal distribution is hereby abolished. At least with the introduction of the Pauli system. I have plotted here the curve for a player of quality 65% against all others:
Pretty and aesthetic, isn’t it? The blue line indicates the point expectation. The purple line is the loss point expectation. The two added together give 1. The opponent strength varies along the x-axis from 1% – 99%.
The difference to the Elo diagram: In the Pauli system there is a natural limit to the numbers. This is the playing strength 1. More than winning all games is not possible. Then there is an in-game limit for the numbers. In chess, this would be the expected yield if you always make the best move and then meet opponents with it. So even the perfect computer would never win all the games against the currently best player in the world. It would definitely not lose any, but it would certainly not win them all. That is internal to the game. Chess is a draw game, that is so widely recognised.
The diagram does not reflect that, but it expresses the difference to the Elo system. With Elo, there is no such and such limit. The normal distribution does not allow for that. So in theory there are arbitrarily strong and arbitrarily weak (yes, the normal distribution is also not limited downwards) players. Which of the two assumptions, Elo or Pauli, is more realistic, I leave to you.
ii. Again, the speed of adjustment
Somehow I got the impression that there is still a need for clarification regarding adjustment and adjustment speed, so back to this point:
There has to be a reaction to results. That was immediately clear. There must be movements within the rankings. So the question remains how much to react. I have tried to express this in words. I probably didn’t succeed, like many other times before. But that’s why, fortunately, there are the practical examples. Not only do I hope to bring it even closer to you, it has brought me some clarity myself.
So here is the practical example: First, I had two players play against each other. These two have their playing strengths before the first game. The playing strength is adjusted by a certain factor based on the result in the first game. Then the second game is played. The forecast for the second game is based on the changed playing strength due to the result of game 1. Now comes the result of game 2. The procedure continues. A total of 10 games are played. In each game there is a deviation forecast – result. This deviation is added up in absolute terms (i.e. the amounts; negative numbers become positive) and gives the total error. The total error measures the quality of the prediction. As long as you don’t have a comparison, you only have an error sum. Now the comparison is created. An alternative factor is used for adjustment. Here, too, there is a total error that is not identical to the first total error. One of the two values was better. Now you just look for the best one by trying other factors. So what we are looking for is the minimum error of the forecast that was determined with the help of a certain factor. Here are the result
|Abweichung aus Sicht S1||-0.66||-0.52||-0.41||-0.33||0.235||0.191||0.154||0.623||0.492||0.388|
|bisherige Gesamterwartung S1||0.656||1.176||1.586||1.913||2.178||2.487||2.834||3.211||3.719||4.331|
|bisher erreicht S1||0||0||0||0||0.5||1||1.5||2.5||3.5||4.5|
|aktuelle Gesamtabw S1||-0.66||-1.18||-1.59||-1.91||-1.68||-1.49||-1.33||-0.71||-0.22||0.169|
|S1 Gewinn/Verlust aktuell||-0.07||-0.05||-0.04||-0.03||0.023||0.019||0.015||0.062||0.049||0.039|
|S2 Gewinn/Verlust aktuell||0.066||0.052||0.041||0.033||-0.02||-0.02||-0.02||-0.06||-0.05||-0.04|
That’s how complex it seems to get right away, using a very simple example to illustrate. But I am happy to explain: S1vor and S2vor are the playing strengths of the two players before the following game. They represent the changed values due to the result in the game, which can be found in the line “Result”. The update factor, here 10, regulates how strongly the result is reacted to. The 10 used here causes a fast reaction. The deviation from the point of view of S1 reflects the deviation from the point expectation and the result in the game. This deviation is divided by the update factor and subtracted from the playing strength. The entry S1after reflects the new playing strength after the change. This is transferred to the line S1before, but there in column 2. The changed playing strength is therefore the basis for the forecast of the following game. The line “achieved so far” only adds up the points from the line “result”. The line “total deviation” adds up the errors of each individual forecast as an amount. The current total deviation shows how far player 1 is behind his expectation (in the more favourable case also how far he has exceeded it).
Now player 1, who started as the stronger one, has lost the first 4 games. This had an extremely negative effect on his playing strength. But then, from the 5th game onwards, he started to score. First with 3 draws, then finally with 3 wins. That is, of course, a somewhat unusual course of events. But practically possible in any case. Now, up there were the results for the update factor 10. The total deviation forecast-result here was 3.995.
Now I have shown the results for update factor 25. See and be amazed, if possible:
|Abweichung aus Sicht S1||-0.66||-0.61||-0.57||-0.53||0.008||0.008||0.007||0.507||0.471||0.437|
|bisherige Gesamterwartung S1||0.656||1.268||1.838||2.367||2.859||3.351||3.844||4.337||4.866||5.429|
|bisher erreicht S1||0||0||0||0||0.5||1||1.5||2.5||3.5||4.5|
|aktuelle Gesamtabw S1||-0.66||-1.27||-1.84||-2.37||-2.36||-2.35||-2.34||-1.84||-1.37||-0.93|
|S1 Gewinn/Verlust aktuell||-0.02||-0.02||-0.02||-0.02||3E-04||3E-04||2E-04||0.017||0.016||0.015|
|S2 Gewinn/Verlust aktuell||0.022||0.02||0.019||0.018||-0||-0||-0||-0.02||-0.02||-0.01|
|Abweichung aus Sicht S1||-0.66||-0.6||-0.55||-0.51||0.038||0.034||0.032||0.529||0.484||0.443|
|bisherige Gesamterwartung S1||0.656||1.259||1.812||2.317||2.779||3.245||3.713||4.185||4.701||5.258|
|bisher erreicht S1||0||0||0||0||0.5||1||1.5||2.5||3.5||4.5|
|aktuelle Gesamtabw S1||-0.66||-1.26||-1.81||-2.32||-2.28||-2.24||-2.21||-1.68||-1.2||-0.76|
|S1 Gewinn/Verlust aktuell||-0.03||-0.02||-0.02||-0.02||0.002||0.001||0.001||0.021||0.019||0.018|
|S2 Gewinn/Verlust aktuell||0.026||0.024||0.022||0.02||-0||-0||-0||-0.02||-0.02||-0.02|
The sequence of results is the same. I am only looking for the answer after asking how best to respond to the results. The calculations in the second list are done exactly the same. But this adjustment speed of 1/30 produced a lower total deviation. The total deviation here was only 3.805, which is less than 3.995.
In the following diagram I have shown the deviations for the different update factors. I first varied in steps of 5, from 10 to 55. That was sufficient for illustration:
There is a somewhat curious movement here in the sense that the error initially increases when you go from 10 to 15. I don’t have a direct explanation for such a thing. But in any case there is a clearly recognisable minimum. This is (as it also happens in football) 30. But this is only the random value for this small example. Now you just have to imagine that you apply the procedure to a database in which thousands of results are available chronologically. One reacts to the results with a given value and adds up the total deviation forecast result per batch. The lowest total deviation gives the best value.