▲ | jjani 5 days ago | |||||||
That's a very roundabout way to phrase "you're completely making all of this up", which is quite disappointing tbh. Are you familiar with evals? As in automated testing using multiple runs? It's simple regression testing, just like for deterministic code. Doing multiple runs smooths out any stochastic differences, and the change I explained isn't explainable by stochasticity regardless. There is no evidence that would satisfy you then, as it would be exactly what I showed. You'd need a time machine. https://www.reddit.com/r/ClaudeAI/comments/1gxa76p/claude_ap... Here's just one thread. | ||||||||
▲ | simonw 5 days ago | parent [-] | |||||||
I don't think you're making it up, but without a lot more details I can't be convinced that your methodology was robust enough to prove what you say it shows. There IS evidence that would satisfy me, but I'd need to see it. I will have a high bar for that though. A Reddit thread of screenshots from nine months ago doesn't do the trick for me. (Having looked at that thread it doesn't look like a change in model weights to me, it looks more like a temporary capacity glitch in serving them.) | ||||||||
|