▲ | throwaway314155 a day ago | |||||||
Way offtopic to TFA now, but isn't using an improved harness a bit like saying "I'm going to hardcore as many priors as possible into this thing so it succeeds regardless of its ability to strategize, plan and execute? | ||||||||
▲ | silvr a day ago | parent | next [-] | |||||||
While true to a degree, I think this is largely wrong. Wouldn't it still count as a "harness" if we provided these LLMs with full robotic control of two humanoid arms, so that it could hold a Gameboy and play the game that way? I don't think the lack of that level of human-ness takes away from the demonstration of long-context reasoning that the GPP stream showed. Claude got stuck reasoning its way through one of the more complex puzzle areas. Gemini took a while on it also, but made it through. I don't that difference can be fully attributed up to the harnesses. Obviously, the best thing to do would be to run a SxS in the same harness of the two models. Maybe that will happen? | ||||||||
| ||||||||
▲ | samrus a day ago | parent | prev | next [-] | |||||||
it is. the benchmark was somewhat cheated, from the perspective of finding out how the model adjusts and plans within a dynamic reactive environment | ||||||||
▲ | 11101010001100 a day ago | parent | prev [-] | |||||||
They asked gemini to come up with another word for cheating and it came up with 'harness'. |