Remix.run Logo
dom96 5 hours ago

Why do none of the benchmarks test for hallucinations?

tedsanders 3 hours ago | parent | next [-]

In the text, we did share one hallucination benchmark: Claim-level errors fell by 33% and responses with an error fell by 18%, on a set of error-prone ChatGPT prompts we collected (though of course the rate will vary a lot across different types of prompts).

Hallucinations are the #1 problem with language models and we are working hard to keep bringing the rate down.

(I work at OpenAI.)

netule 4 hours ago | parent | prev [-]

[flagged]