Remix.run Logo
data-ottawa 5 days ago

Benchmarks are too expensive for ordinary users to run, but it would be useful if they could publish their benchmarks using prod over time, that would expose degradations in a more objective manner.

Of course there’s always the problem of teaching to the test and out of test degradations, but presumably bugs would be independent of that.

rapind 5 days ago | parent [-]

A few weeks ago reddit was on fire with outages and timeouts and yet the Anthropic Jira status page was showing everything as green. So even if they had benchmarks, I'm not sure they'd be transparent with them.