▲ | krisoft 4 days ago | |
> authors should have done a parallel comparison study against humans on the same question bank as if the study authors had set out to investigate whether humans or LLMs reason better in this situation. Only if they want to make statements about humans. The paper would have worked perfectly fine without those assertions. They are, as you are correctly observing, just a distraction from the main thrust of the paper. > maybe some would and some wouldn't that could be debated It should not be debated. It should be shown experimentally with data. If they want to talk about human performance they need to show what the human performance really is with data. (Not what the study authors, or people on HN imagine it is.) If they don’t want to do that they should not talk about human performance. Simples. I totaly understand why an AI scientist doesn’t want to get bogged down with studying human cognition. It is not their field of study, so why would they undertake the work to study them? It would be super easy to rewrite the paper to omit the unfounded speculation about human cognition. In the introduction of “The triggers are not contextual so humans ignore them when instructed to solve the problem.” they could write “The triggers are not contextual so the AI should ignore them when instructed to solve the problem.” And in the conclusions where they write “These findings suggest that reasoning models, despite their structured step-by-step problem-solving capabilities, are not inherently robust to subtle adversarial manipulations, often being distracted by irrelevant text that a human would immediately disregard.” Just write “These findings suggest that reasoning models, despite their structured step-by-step problem-solving capabilities, are not inherently robust to subtle adversarial manipulations, often being distracted by irrelevant text.” Thats it. Thats all they should have done, and there would be no complaints on my part. | ||
▲ | bee_rider 3 days ago | parent | next [-] | |
> It would be super easy to rewrite the paper to omit the unfounded speculation about human cognition. In the introduction of “The triggers are not contextual so humans ignore them when instructed to solve the problem.” they could write “The triggers are not contextual so the AI should ignore them when instructed to solve the problem.” Another option would be to more explicitly mark it as speculation. “The triggers are not contextual, so we expect most humans would ignore them.” Anyway, it is a small detail that is almost irrelevant to the paper… actually there seems to be something meta about that. Maybe we wouldn’t ignore the cat facts! | ||
▲ | disconcision 3 days ago | parent | prev [-] | |
i feel it's not quite that simple. certainly the changes you suggest make the paper more straightforwardly defensible. i imagine the reason they included the problematic assertion is that they (correctly) understood the question would arise. while inserting the assertion unsupported is probably the worst of both worlds, i really do think it is worthwhile to address. while it is not realistic to insist every study account for every possible objection, i would argue that for this kind of capability work, it is in general worth at least modest effort to establish a human baseline. i can understand why people might not care about this, for example if their only goal is assessing whether or not an llm-based component can achieve a certain level of reliability as part of a larger system. but i also think that there is similar, and perhaps even more pressing broad applicability for considering the degree to which llm failure patterns approximate human ones. this is because at this point, human are essentially the generic all-purpose subsystem used to fill gaps in larger systems which cannot be filled (practically, or in principle) by simpler deterministic systems. so when it comes to a problem domain like this one, it is hard to avoid the conclusion that humans provide a convenient universal benchmark to which comparison is strongly worth considering. (that said, i acknowledge that authors probably cannot win here. if they provided even a modest-scale human study, i am confident commenters would criticize their sample size) |