Remix.run Logo
mxwsn a day ago

Gemini has beat it already, but using a different and notably more helpful harness. The creator has said they think harness design is the most important factor right now, and that the results don't mean much for comparing Claude to Gemini.

throwaway314155 a day ago | parent [-]

Way offtopic to TFA now, but isn't using an improved harness a bit like saying "I'm going to hardcore as many priors as possible into this thing so it succeeds regardless of its ability to strategize, plan and execute?

silvr a day ago | parent | next [-]

While true to a degree, I think this is largely wrong. Wouldn't it still count as a "harness" if we provided these LLMs with full robotic control of two humanoid arms, so that it could hold a Gameboy and play the game that way? I don't think the lack of that level of human-ness takes away from the demonstration of long-context reasoning that the GPP stream showed.

Claude got stuck reasoning its way through one of the more complex puzzle areas. Gemini took a while on it also, but made it through. I don't that difference can be fully attributed up to the harnesses.

Obviously, the best thing to do would be to run a SxS in the same harness of the two models. Maybe that will happen?

throwaway314155 a day ago | parent [-]

I can appreciate that the model is likely still highly capable with a good harness. Still, I think this is more in line with ideas from say, speed running (or hell even reinforcement learning) where you want to prove something profound is possible and to do so before others do, you need to accumulate a series of "tricks" (refining exploits/hacking rewards) in order to achieve the goal. but if you use too many tricks you're no longer proving something as profound as originally claimed. In speed running this tends to splinter into multiple categories.

Basically, the gane being conpleted by gemini was in an inferior category (however minuscule) of experiment.

I get it though. People demanded these types of changes in the CPP twitch chat, because the pain of watching the model fail in slow motion is simply too much.

samrus a day ago | parent | prev | next [-]

it is. the benchmark was somewhat cheated, from the perspective of finding out how the model adjusts and plans within a dynamic reactive environment

11101010001100 a day ago | parent | prev [-]

They asked gemini to come up with another word for cheating and it came up with 'harness'.