Since the agents were instructed to not explain their answer, you can't know if their answer was reasonable or not.

The reason for the "No explanations, no qualifiers" in the prompt was to force the models to put the claim in one of the four buckets and answer with the bucket name only. It's a pure quantitive analysis (first in a series) and it does indeed lack the qualitative aspect.

	▲	hombre_fatal an hour ago \| parent [-]
		Sure, but people are drawing conclusions beyond "LLMs said different words" and trying to use it to analyze whether LLMs were wrong about the underlying facts, but that information isn't available to us.