Remix.run Logo
theamk 12 hours ago

So they use LLM to evaluate LLMs: with LLM writing the questions, another LLM writing the country-specific answers, and yet another LLM getting the country from an answer. The only manual steps seem to be "manually reviewed [questions] to remove repetitions or accidental location references."

This seems like a pretty lazy methodology, as if there are LLM-specific country biases, they could be introduced at any stage of the process.