We tried GPT-5 for a RAG use case, and found that it performs worse than 4.1. We reverted and didn't look back.

4.1 is such an amazing model in so many ways. It's still my nr. 1 choice for many automation tasks. Even the mini version works quite well and it has the same massive context window (nearly 8x GPT-5). Definitely the best non-reasoning model out there for real world tasks.

▲

HugoDias a day ago | parent | prev | next [-]

Can you elaborate on that? In which part of the RAG pipeline did GPT-4.1 perform better? I would expect GPT-5 to perform better on longer context tasks, especially when it comes to understanding the pre-filtered results and reasoning about them

▲

tifa2up a day ago | parent [-]

For large context (up to 100K tokens in some cases). We found that GPT-5: a) has worse instruction following; doesn't follow the system prompt b) produces very long answers which resulted in a bad ux c) has 125K context window so extreme cases resulted in an error

▲

internet_points a day ago | parent | next [-]

Interesting. https://www.robert-glaser.de/prompts-as-programs-in-gpt-5/ claims GPT-5 has amazing!1!! instruction following. Is your use-case very different, or is this yet another case of "developer A got lucky, developer B tested more things"?

	▲	tifa2up a day ago \| parent [-]
		Think it varies by use case. It didn't do well with long context

▲

Shank a day ago | parent | prev | next [-]

ChatGPT when using 5 or 5-Thinking doesn’t even follow my “custom instructions” on the web version. It’s a serious downgrade compared to the prior generation of models.

	▲	cj a day ago \| parent [-]
		It does “follow” custom instructions. But more as a suggestion rather than a requirement (compared to other models)

▲

Xmd5a a day ago | parent | prev [-]

Ah, 100k/125K this is what poses problems I believe. GPT-5 scores should go up should you process contexts that are 10 times shorter.

▲

mbesto a day ago | parent | prev | next [-]

How do you objectively tell whether a model "performs" better than another?

▲

belval a day ago | parent [-]

Not the original commenter but I work in the space and we have large annotated datasets with "gold" evidence that we want to retrieve, the evaluation of new models is actually very quantitative.

	▲	mbesto 18 hours ago \| parent [-]
		> but I work in the space Ya, the original commenter likely does not work in the space - hence the ask. > the evaluation of new models is actually very quantitative. While you may be able to derive a % correct (and hence quantitative), they are by their nature very much not quantitative. Q&As on written subjects are very much subjective. Example benchmark: https://llm-stats.com/benchmarks/gpqa Even though there are techniques to reduce overfitting, it still isn't eliminated. So it's very much subjective.

▲

teekert a day ago | parent | prev [-]

So… You did look back then didn’t look forward anymore… sorry couldn’t resist.