Zygalski wrote on 09/22/10 at 18:27:19:
I was one of the 4 analysts who fairly recently looked at 20 of WGM Yelena Dembo's chess.com games.
These games were objectively chosen (insomuch as is possible) in that they were the then most recently completed games vs 2200+ rated chess.com opponents. All the games had 35 or more moves.
I found only 18 games that fulfilled this criteria, so I then selected the 2 most recently completed games which were vs near-2200 rateds which also had 35+ moves.
I would like to point-out that there seems to be a commonly held misconception that chess.com closed Yelena's account purely on the back of top 3 or top 4 engine match up results. This simply is not the case - the t3/t4 analysis was used by a group of players simply to suggest members who were possible engine-users.
As a result of engine match up analysis of the 20 games by 4 separate analysts using 4 different systems & 3 different engines, full ply-by-ply analysis for the games was forwarded to site staff for them to peruse.
Chess.com staff have refused to reveal what methodology they use, but I would assume that if you're looking for possible engine users, you would probably look at the frequency of how their moves match an engine's once the game in question goes out of a multi-million games database.
Either that, or you'd use average error/blundercheck analysis, for which I personally have no data.
You would then need a set of control data, benchmarks of how the highest quality players can perform in regards to engine match up, both pre-computer era CC long time control World Champion finalists & otb Super GM's, where you are pretty certain no powerful engines were used. Logically then you would look for any consistency within the benchmarks relating to upper-end thresholds.
Either that consistency seems to exist or it doesn't.
Let me tell you know that using a 3000-3200+ Elo rated engine on a reasonably powerful pc using 30 second per-ply infinite analysis, there does indeed seem to be a consistency at the extreme upper end for the unassisted players.
The benchmarks I tested strongly suggest an extreme upper limit of engine-like play in 15-20+ games with minimum of around 500 non-database moves of
Top 1 Match: 60%
Top 2 Match: 75%
Top 3 Match: 85% (all 3 figures =)
I won't list my personal benchmark tests here, but they have been posted on chess.com.
Other analysts have posted their own, and they are also consistent with my findings.
I was taught top 3 match up methodology by a FIDE 2300 rated, who has been a games mod on a site other than chess.com for several years.
He also said that my benchmarks (which were independently generated) were "exactly what I'd expect you to find".
Now for the controversial bit!
As in everything in life, you have to make compromises.
Ideally, you'd analyse every single pre-computer CC match (with a 400+ rolled back non-database sample size) for every single player who's ever played the game. You'd ideally also do the same for every otb Super GM.
Ideally, you'd use every single strong engine available to analyse these many thousands of games, presumably analysing at a very slow pace.
You'd also like to have many different systems, because some could possibly give quite different results to others.
So, you spend the next 50 or so years creating the benchmarks...
I'm sure you can see that you need to balance practicality with reliability. There will be some compromise!
The next controversial point is that many analysts use a +5% buffer on each of the t3 upper-end threshold stats.
So, now you have
Top 1 Match: 65%
Top 2 Match: 80%
Top 3 Match: 90% (all 3 figures =)
Does this mean that anyone whose games/opponents fulfil the selection criteria is 100% guilty of engine use?
No - of course not!
What it means is that in non-database moves in many objectively chose games over time, the player has out-performed all the benchmarks tested so far by quite some margin.
Often the match up results returned are significantly higher than these new (admittedly seemingly rather arbitrary) thresholds.
This can never be an exact science & as I say, to remain at all practical you must make at least some compromises.
I was the analyst who returned the following stats for 20 of Dembo's games' non-database moves:
Deep Rybka 3 x64 Hash:256 Time:30s Depth:12-20ply
AMD Phenom x 4 2.30Ghz 4GB DDR2 RAM
YelenaDembo (Games: 20)
Top 1 Match: 530/723 ( 73.3% )
Top 2 Match: 638/723 ( 88.2% )
Top 3 Match: 676/723 ( 93.5% )
Top 4 Match: 698/723 ( 96.5% )
You can draw your own conclusions as to why Yelena's match up rate was so high when the average opponent was rated about chess.com 2500 or so.
I've been doing this analysis for about 3 years now, both creating benchmarks & analysing/submitting evidence on suspects.
Many suspects I've analysed & submitted on have had lower match up rates than those I found for WGM Dembo, but have resulted in the players in question being removed from the site(s).
You can never say that WGM Dembo cheated in those games, simply that she played incredibly error-free, engine-like chess.
WGM Dembo also had a chess.com record of
p = 155
w = 140
d = 15
l = 0
against, no doubt, some suspected engine users, unless you believe that the top 0.1% highest rated on chess.com are all there as a result of not using engines!
Another thing that's been banded-around as a means to discredit this approach is why were 20 of Dembo's recent otb games which met the selection criteria analysed. Well, we just wanted to check that Yelena didn't have a particularly engine-like game when unassisted. That was all. Her results were far lower than the top-end unassisted benchmarks, so we felt we had at least some basis to rule out the fact that Dembo has a particluarly engine-like unassisted style.
If she was at the very top of the 60/75/85 benchmarks mentioned earlier when playing otb, it would be rational to expect her to play more top engine moves when playing long t/c in online chess. This would discredit benchmark thresholds.
Finally, I do hope you realise I'm not frothing at the mouth, hoping all users I analyse are cheats! It's a nice feeling when results come back well below the thresholds, makes you realise there are some good folk out there at or near the top of their game.
That really is a load of numerical mumbo-jumbo. Do you, or does anyone else at chess.com, understand that this is a
statistical problem? You don't seem to.
What is the
data? What is the
model? What are the
statistics? What are the
tests?
I keep asking these questions, and I receive no answer.
It is simply the work of ignorami to point to a set of arbitrary numbers and say, "There, you see?" That's all I have seen so far, and it stinks.
You simply must understand that this is a
statistical problem, or admit that you are a fool.
You don't have an objective, replicable method, do you get that? All you have is a mishmash of verbiage, boiled potatoes and overcooked broccoli, topped off with some numerical dressing, and concluding with "There, you see?"
Just for example, what is the exact provenance of the percentages that you allege are the "extreme upper limit of engine-line play in 15-20 games?" What does "extreme upper limit" mean? What percentile of a distribution is that? What statistic is that? What the hell is that, other than mumbo-jumbo?
Let me ask you this also: do you agree that in Dembo's game with kingboy, 34...f5 is a notably stronger move than Dembo's 34...Nf2? I mean, notably? So how does that square with the account that Dembo is cheating? She cheats, but sometimes she forgets to cheat?