Remix.run Logo
vintermann 14 hours ago

They test specific prompts with temperature 0. It is of course possible that all their tests prompts were lucky, but still then, shouldn't you see an immediate drop followed by a flat or increasing line?

Also, from what I understand from the article, it's not a difficult task but an easily machine checkable one, i.e. whether the output conforms to a specific format.

lostmsu 5 hours ago | parent | next [-]

With T=0 on the same model you should get the same exact output text. If they are not getting it, other environmental factors invalidate the test result.

Spivak 12 hours ago | parent | prev [-]

If it was random luck, wouldn't you expect about half the answers to be better? Assuming the OP isn't lying I don't think there's much room for luck when you get all the questions wrong on a T/F test.