Remix.run Logo
DaKevK 3 hours ago

Genuinely one of the more interesting model evals I've seen described. The sunk cost framing makes sense -- 4.5 doubles down, 4.6 cuts losses faster. 9 days vs 59 is a wild result. Makes me wonder how much of the regression complaints are from people hitting 4.6 on tasks where the first approach was obviously correct.

MrCheeze 41 minutes ago | parent [-]

Notably 45 out of the 50 days of improvement were in two specific dungeons (Silph Co and Cinnabar Mansion) where 4.5 was entirely inadequate and was looping the same mistaken ideas with only minor variation, until eventually it stumbled by chance into the solution. Until we saw how much better it did in those spots, we weren't completely sure that 4.6 was an improvement at all!

https://docs.google.com/spreadsheets/u/0/d/e/2PACX-1vQDvsy5D...