Can you elaborate on that? In which part of the RAG pipeline did GPT-4.1 perform better? I would expect GPT-5 to perform better on longer context tasks, especially when it comes to understanding the pre-filtered results and reasoning about them

▲

tifa2up a day ago | parent [-]

For large context (up to 100K tokens in some cases). We found that GPT-5: a) has worse instruction following; doesn't follow the system prompt b) produces very long answers which resulted in a bad ux c) has 125K context window so extreme cases resulted in an error

▲

internet_points a day ago | parent | next [-]

Interesting. https://www.robert-glaser.de/prompts-as-programs-in-gpt-5/ claims GPT-5 has amazing!1!! instruction following. Is your use-case very different, or is this yet another case of "developer A got lucky, developer B tested more things"?

	▲	tifa2up a day ago \| parent [-]
		Think it varies by use case. It didn't do well with long context

▲

Shank a day ago | parent | prev | next [-]

ChatGPT when using 5 or 5-Thinking doesn’t even follow my “custom instructions” on the web version. It’s a serious downgrade compared to the prior generation of models.

	▲	cj a day ago \| parent [-]
		It does “follow” custom instructions. But more as a suggestion rather than a requirement (compared to other models)

▲

Xmd5a a day ago | parent | prev [-]

Ah, 100k/125K this is what poses problems I believe. GPT-5 scores should go up should you process contexts that are 10 times shorter.