| ▲ | tifa2up a day ago |
| We tried GPT-5 for a RAG use case, and found that it performs worse than 4.1. We reverted and didn't look back. |
|
| ▲ | sigmoid10 a day ago | parent | next [-] |
| 4.1 is such an amazing model in so many ways. It's still my nr. 1 choice for many automation tasks. Even the mini version works quite well and it has the same massive context window (nearly 8x GPT-5). Definitely the best non-reasoning model out there for real world tasks. |
|
| ▲ | HugoDias a day ago | parent | prev | next [-] |
| Can you elaborate on that? In which part of the RAG pipeline did GPT-4.1 perform better? I would expect GPT-5 to perform better on longer context tasks, especially when it comes to understanding the pre-filtered results and reasoning about them |
| |
| ▲ | tifa2up a day ago | parent [-] | | For large context (up to 100K tokens in some cases). We found that GPT-5:
a) has worse instruction following; doesn't follow the system prompt b) produces very long answers which resulted in a bad ux c) has 125K context window so extreme cases resulted in an error | | |
| ▲ | internet_points a day ago | parent | next [-] | | Interesting. https://www.robert-glaser.de/prompts-as-programs-in-gpt-5/ claims GPT-5 has amazing!1!! instruction following. Is your use-case very different, or is this yet another case of "developer A got lucky, developer B tested more things"? | | |
| ▲ | tifa2up a day ago | parent [-] | | Think it varies by use case. It didn't do well with long context |
| |
| ▲ | Shank a day ago | parent | prev | next [-] | | ChatGPT when using 5 or 5-Thinking doesn’t even follow my “custom instructions” on the web version. It’s a serious downgrade compared to the prior generation of models. | | |
| ▲ | cj a day ago | parent [-] | | It does “follow” custom instructions. But more as a suggestion rather than a requirement (compared to other models) |
| |
| ▲ | Xmd5a a day ago | parent | prev [-] | | Ah, 100k/125K this is what poses problems I believe. GPT-5 scores should go up should you process contexts that are 10 times shorter. |
|
|
|
| ▲ | mbesto a day ago | parent | prev | next [-] |
| How do you objectively tell whether a model "performs" better than another? |
| |
| ▲ | belval a day ago | parent [-] | | Not the original commenter but I work in the space and we have large annotated datasets with "gold" evidence that we want to retrieve, the evaluation of new models is actually very quantitative. | | |
| ▲ | mbesto 18 hours ago | parent [-] | | > but I work in the space Ya, the original commenter likely does not work in the space - hence the ask. > the evaluation of new models is actually very quantitative. While you may be able to derive a % correct (and hence quantitative), they are by their nature very much not quantitative. Q&As on written subjects are very much subjective. Example benchmark: https://llm-stats.com/benchmarks/gpqa Even though there are techniques to reduce overfitting, it still isn't eliminated. So it's very much subjective. |
|
|
|
| ▲ | teekert a day ago | parent | prev [-] |
| So… You did look back then didn’t look forward anymore… sorry couldn’t resist. |