▲ | briga 14 hours ago | ||||||||||||||||
I have a theory: all these people reporting degrading model quality over time aren't actually seeing model quality deteriorate. What they are actually doing is discovering that these models aren't as powerful as they initially thought (ie. expanding their sample size for judging how good the model is). The probabilistic nature of LLM produces a lot of confused thinking about how good a model is, just because a model produces nine excellent responses doesn't mean the tenth response won't be garbage. | |||||||||||||||||
▲ | vintermann 14 hours ago | parent | next [-] | ||||||||||||||||
They test specific prompts with temperature 0. It is of course possible that all their tests prompts were lucky, but still then, shouldn't you see an immediate drop followed by a flat or increasing line? Also, from what I understand from the article, it's not a difficult task but an easily machine checkable one, i.e. whether the output conforms to a specific format. | |||||||||||||||||
| |||||||||||||||||
▲ | nothrabannosir 14 hours ago | parent | prev | next [-] | ||||||||||||||||
TFA is about someone running the same test suite with 0 temperature and fixed inputs and fixtures on the same model over months on end. What’s missing is the actual evidence. Which I would love of course. But assuming they’re not actively lying, this is not as subjective as you suggest. | |||||||||||||||||
▲ | chaos_emergent 14 hours ago | parent | prev | next [-] | ||||||||||||||||
Yes exactly, my theory is that the novelty of a new generation of LLMs’ performances tends to cause an inflation in peoples’ perceptions of the model, with a reversion to a better calibrated expectation over time. If the developer reported numerical evaluations that drifted over time, I’d be more convinced of model change. | |||||||||||||||||
▲ | zzzeek 14 hours ago | parent | prev | next [-] | ||||||||||||||||
your theory does not hold up for this specific article as they carefully explained they are sending identical inputs into the model each time and observing progressively worse results with other variables unchanged. (though to be fair, others have noted they provided no replication details as to how they arrived at these results.) | |||||||||||||||||
▲ | gtsop 12 hours ago | parent | prev | next [-] | ||||||||||||||||
I see your point but no, it's getting objectively worse. I have a similar experience of casually using chatgpt for various use cases, when 5 dropped i noticed it was very fast but oddly got some details off. As time moved on it became both slower and the output deteriorated. | |||||||||||||||||
▲ | yieldcrv 14 hours ago | parent | prev | next [-] | ||||||||||||||||
fta: “I am glad I have proof of this with the test system” I think they have receipts, but did not post them there | |||||||||||||||||
| |||||||||||||||||
▲ | colordrops 14 hours ago | parent | prev [-] | ||||||||||||||||
Did any of you read the article? They have a test framework that objectively shows the model getting worse over time. | |||||||||||||||||
|