This is specifically a consumer model (or specifically ChatGPT) issue. e.g. IME codex does not do this, and will just tell you when you're missing something or somehow wrong, and Gemini does this weird thing where it tells you you're a genius and then immediately starts correcting everything you said.

▲

solid_fuel an hour ago | parent [-]

Sycophancy is just one aspect of the problems I mentioned, though. Another huge one is hallucination, and one that is actually far worse than I thought:

> It’s been proven that when a model is trained on large volumes of highly factual and non-theoretical data, it learns to always have an answer. DeepSeek V4 Pro (1.6T params, 49B active, 44 AA Intelligence Index score) has a ludicrous 94% hallucination score on the AA-Omniscience benchmark, meaning on questions that it couldn’t figure out, it only stated that it didn’t know around 6% of the time, and the rest it confidently hallucinated an answer. GLM-5.2 scored a 28% hallucination rate, Opus 4.8 was 36%, Fable 5 was 48%, and GPT-5.5 was 86%.

https://arrowtsx.dev/bigger-models/

I think even a 5% hallucination rate would be terrible for a teacher, who should generally be comfortable with saying "I don't know off the top of my head but here is how to find resources to answer your question".

---

So, just to drive the point home, Codex has an 86.9% hallucination rate on the AA-omniscience score in this index https://benchlm.ai/models/gpt-5-3-codex - if you ask it something that wasn't sufficiently covered in its training data, it will confidently make up an answer nearly 87% of the time.

While you might think it is happy to correct you when you are wrong, you don't know that for sure since you don't know when you're wrong. Codex may have been happily agreeing with you about things you had completely backwards.

	▲	ndriscoll a minute ago \| parent [-]
		Except I generally do know when I'm wrong because I'm working in a domain I am familiar with, and it will often create experiments on the fly unprompted (well, prompted, but generically in AGENTS.MD) to see. My experience actually using it for software is that it almost never makes up answers.