Remix.run Logo
krackers 6 hours ago

> this uses a harness

This seems like an arbitrary restriction. Tool-use requires a harness, and their whitepaper never defines exactly what counts as valid.

fermentation 3 hours ago | parent | next [-]

Right, fair, but look at the prompt. For the purpose of testing general intelligence, this seems kind of pointless.

UltraSane 3 hours ago | parent | prev [-]

It isn't arbitrary. They want measure the capability of the general LLM

fc417fc802 11 minutes ago | parent [-]

So if I say "I want to measure your capability as a mechanic" but then also "to ensure an accurate score you're forbidden to use any tools" how are you the human mechanic planning to diagnose and fix the engine problem without wrenches and jack stands and the like? It makes no sense.

That said their harness isn't generic. It includes a ridiculously detailed prompt for how to play this specific game. Forbidding tool use is arbitrary and above all pointless hoop jumping but that doesn't make the linked "achievement" any less fraudulent.