That's a very roundabout way to phrase "you're completely making all of this up", which is quite disappointing tbh. Are you familiar with evals? As in automated testing using multiple runs? It's simple regression testing, just like for deterministic code. Doing multiple runs smooths out any stochastic differences, and the change I explained isn't explainable by stochasticity regardless.

There is no evidence that would satisfy you then, as it would be exactly what I showed. You'd need a time machine.

https://www.reddit.com/r/ClaudeAI/comments/1gxa76p/claude_ap...

Here's just one thread.

▲

simonw 5 days ago | parent [-]

I don't think you're making it up, but without a lot more details I can't be convinced that your methodology was robust enough to prove what you say it shows.

There IS evidence that would satisfy me, but I'd need to see it.

I will have a high bar for that though. A Reddit thread of screenshots from nine months ago doesn't do the trick for me.

(Having looked at that thread it doesn't look like a change in model weights to me, it looks more like a temporary capacity glitch in serving them.)

	▲	jjani 5 days ago \| parent [-]
		This was nothing but "temporary", it's still in place; the last time we ran the evals is 2 weeks ago and it's the exact same. It can't be a "capacity glitch" either, as it actually outputs those as proper tokens. It's possible that it was an internal system prompt change despite the claims of "there is no system prompt on the API", but this is in effect the same as changing the model. > There IS evidence that would satisfy me, but I'd need to see it. Describe what this evidence would look like. It sure feels like an appeal to authority - if I'd be someone with a "name" I'm sure you'd believe it. If you'd had had the same set of evals set up since then, you wouldn't have questioned this at all. You don't. > I don't think you're making it up, but without a lot more details I can't be convinced that your methodology was robust enough to prove what you say it shows. Go and poke holes at it then, go on. I've clearly explained the methodology.