Remix.run Logo
sourcepluck 4 days ago

> Since gpt-3.5-turbo-instruct has been measured at around 1800 Elo

Where's the source for this? What's the reasoning? I don't see it. I have just relooked, and stil l can't see it.

Is it 1800 lichess "Elo", or 1800 FIDE, that's being claimed? And 1800 at what time control? Different time controls have different ratings, as one would imagine/hope the author knows.

I'm guessing it's not 1800 FIDE, as the quality of the games seems far too bad for that. So any clarity here would be appreciated.

og_kalu 3 days ago | parent [-]

https://github.com/adamkarvonen/chess_gpt_eval

sourcepluck 3 days ago | parent [-]

Thank you. I had seen that, and had browsed through it, and thought: I don't get it, the reason for this 1800 must be elsewhere.

What am I missing? Where does it show there how the claim of "1800 ELO" is arrived at?

I can see various things that might be relevant, for example, the graph where it (GPT-3.5-turbo-instruct) is shown as going from mostly winning to mostly losing when it gets to Stockfish level 3. It's hard (/impossible) to estimate the lichess or FIDE ELO of the different Stockfish levels, but Lichess' Stockfish on level 3 is miles below 1800 FIDE, and it seems to me very likely to be below lichess 1800.

I invite any FIDE 1800s and (especially) any Lichess 1800s to play Stockfish level 3 and report back. Years ago when I played a lot on Lichess I was low 2000s in rapid, and I win comfortably up till Stockfish level 6, where I can win, but also do lose sometimes. Basically I really have to start paying attention at level 6.

Level 3 seems like it must be below lichess 1800, but it's just my anecdotal feeling of the strengths. Seeing as how the article is chocabloc full of unfounded speculation and bias though, maybe we can indulge ourselves too.

So: someone please explain the 1800 thing to me? And any lichess 1800s like to play guinea pig, and play a series of games against stockfish 3, and report back to us?

og_kalu 3 days ago | parent [-]

In Google's paper, then titled "Grandmaster level chess without search", they evaluate turbo-instruct to have a lichess Elo of 1755 (vs bots)

https://arxiv.org/abs/2402.04494

Admittedly, this isn't really "the source" though. The first people to break the news on turbo-instruct's chess ability all pegged it around 1800. https://x.com/GrantSlatton/status/1703913578036904431

sourcepluck 3 days ago | parent [-]

Thank you, I do appreciate it. I had a quick search through the paper, and can at least confirm for myself that it's a Lichess Elo, and one of 1755, that is found in that arxiv paper. That tweet there that says 1800, without specifying it's a Lichess rating, I can't see where he gets it from (but I don't have Twitter, I could be missing something).

At least the arxiv paper is serious:

> A direct comparison between all engines comes with a lot of caveats since some engines use the game history, some have very different training protocols (i.e., RL via self-play instead of supervised learning), and some use search at test time. We show these comparisons to situate the performance of our models within the wider landscape, but emphasize that some conclusions can only be drawn within our family of models and the corresponding ablations that keep all other factors fixed.