I've read the discussions in this thread and numerous threads on the cheating forum on chess.com. I find it quite amazing that most of the defenders of the top3 method don't take good notice of what the expert on this matter (Kenneth Regan) has stated on his website (
http://www.cse.buffalo.edu/~regan/chess/fidelity/) about the matchup methods:
The main statistical principle which these pages show has been misunderstood by the chess world is that a move that is given a clear standout evaluation by a program is much more likely to be found by a strong human player. And a match to any engine on such a move is much less statistically significant than one on a move given slight but sure preference over many close alternatives. He himself gives an example how to cope with this difficulty in his analysis of the Mamedyarov-Kurnosov incident. Althought his method is not fully explained, I think it something like this:
- first you need a large set of positions with evaluations (for the 10 best moves in every postion or so) and the move which was actually played;
- you have to take care that the set is filled with positions of games which are under the same conditions as the postions you want to examine for cheating; therefore Regan's figures are not applicable to cc-chess, he used 10.000 postions from OTB games by strong grandmasters.
- from this data you can find the a priori probability p that a player would play the move with the highest evaluation in a position where there is a difference delta between best and second best evaluations;
- for testing a game you would have to sum (for all the moves you want to take in consideration) all the a priori chances. That would give a average score for this game. You could view that score as the average score that would be scored by the players in the data set given the same positions, although that's already a disputable statement.
- Confront this model score with the actual score in the game.
OK, nothing really different from the top-3 matchup method promoted by Zygalski and others. But the big difference is that you could calculate variances and confidence intervals.
I do not have a good set of data to use, but in one of chess.com threads there were some interesting figures (
http://www.chess.com/groups/forumview/titled-player-banned message #291) from which you can derive estimates for the a priori chances (source was the 8th ICCF world Championship).
I did it quite quickly but this is what I got out of it:
delta positions No1-played p var
0.1 1202 442 0.388 0.14
0.2 291 169 0.581 0.34
0.3 169 110 0.651 0.43
0.4 83 58 0.699 0.49
0.6 93 77 0.828 0.69
0.8 63 46 0.730 0.54
1.0 46 42 0.913 0.85
So in a position where the engine gives a difference of .4 between best and second best move, you would expect that the best move is played in 69.9% of the cases. The last column is an indication how good the p-values are determined by this sample.
Now I did an analysis of the earlier mentioned game (message #68) kingboy-dembo, moves 8-34, with Houdini, multiline mode 5 for depth 17 using Arena (forward analysis, cleared hash at start and all those issues).
Results from the game: 23 times Dembo played the number 1 choice, which is 85%. The model with the a priori chances gives 16 hits on average, but the standard deviation is about 4! So the 23 out of 27 score is high but within margins you might expect.
And I have to say that with better model values I would expect the average to be higher. For instance if the delta was 0.23 I have taken as p the value for delta=0.2 from the table given. As there is no value for delta=0 I just assumed it to be 0.3 (with variance 0), that is just a guess.
A bigger sample of positions is of course needed to draw conclusions on cheating. What chance of false positives is acceptable to blame someone for cheating would you think? 1 in a million?
My main point is that the variances are far bigger than many might think and that you will blame a lot of false positives for cheating if you don't use a very solid model (and probably a lot of games are needed to check). The variances come from two sources: the p-values are derived by sampling and the use of the p-values (even if they would be known exactly) in the model also adds to the variance.
There are issues with this method, which can be discussed. But to me it looks far better than the simple matchup method most 'cheathunters' use.