| ▲ | dom96 5 hours ago | |
Why do none of the benchmarks test for hallucinations? | ||
| ▲ | tedsanders 3 hours ago | parent | next [-] | |
In the text, we did share one hallucination benchmark: Claim-level errors fell by 33% and responses with an error fell by 18%, on a set of error-prone ChatGPT prompts we collected (though of course the rate will vary a lot across different types of prompts). Hallucinations are the #1 problem with language models and we are working hard to keep bringing the rate down. (I work at OpenAI.) | ||
| ▲ | netule 4 hours ago | parent | prev [-] | |
[flagged] | ||