I finally took a look at Bicknell's video. At 12:31 he crosses off the "Blame the NNUE Evaluation Function" hypothesis, but his data don't support that conclusion. He cites as evidence Stockfish 7 (12:18 - 12:30), which doesn't use NNUE, and Stockfish 17 cluster version (12:07 - 12:18), which only has a single data point.
But it's probably not NNUE. It's probably the supervisor for the multi-threading and/or some other hardware bottleneck. The cluster has special hardware and instructions for the supervisor. Stockfish compiled for the cluster no doubt uses those, whereas the non-cluster version is by necessity running different supervisor code. That makes the supervisor a good place to look for any flaky performance.
What jump out to me, besides the "lack of trend" in the Stockfish 17 threading results, are these two points:
- For all engines, including the old ones, there is a performance hit going from one thread to two, after that the benefits of more threads start to kick in. This clearly shows that a supervisor has started at two threads. Some supervisors use a dedicated thread, in which case asking for one thread uses one, while asking for two threads uses three. I don't know how Stockfish handles it.
- In the Stockfish 17 results (for example at 5:19) there is a huge degradation going from 32 threads to 64. I assume 64 is all the cores on the test machine, in which case it might be interesting to see the result for 63 cores.
I'm no expert, but I have read some chess engine programmer blogs and change logs. Multi-threaded code is a nightmare and there are many ways to get it wrong. Then when you get it right it still doesn't always improve performance like you hoped.
One last hurdle is getting the other settings to mesh well with the number of threads. Cache is particularly trick to get right, both in code and in settings.