▲ | disconcision 3 days ago | |
i feel it's not quite that simple. certainly the changes you suggest make the paper more straightforwardly defensible. i imagine the reason they included the problematic assertion is that they (correctly) understood the question would arise. while inserting the assertion unsupported is probably the worst of both worlds, i really do think it is worthwhile to address. while it is not realistic to insist every study account for every possible objection, i would argue that for this kind of capability work, it is in general worth at least modest effort to establish a human baseline. i can understand why people might not care about this, for example if their only goal is assessing whether or not an llm-based component can achieve a certain level of reliability as part of a larger system. but i also think that there is similar, and perhaps even more pressing broad applicability for considering the degree to which llm failure patterns approximate human ones. this is because at this point, human are essentially the generic all-purpose subsystem used to fill gaps in larger systems which cannot be filled (practically, or in principle) by simpler deterministic systems. so when it comes to a problem domain like this one, it is hard to avoid the conclusion that humans provide a convenient universal benchmark to which comparison is strongly worth considering. (that said, i acknowledge that authors probably cannot win here. if they provided even a modest-scale human study, i am confident commenters would criticize their sample size) |