Remix.run Logo
Majromax 2 hours ago

If I'm reading that page correctly, then the benchmark results don't cover the interesting "mid February" inflection point noted in the article/report. The numbers appear to begin after the quality drop began. Moreover, the daily confidence interval seems to be stupidly wide, with a confidence interval between 42% and 69%?

The "Other metrics" graphs extend for a longer period, and those do seem to correlate with the report. Notably, the 'input tokens' (and consequently API cost) roughly halve (from 120M to 60M) between the beginning of February and mid-March, while the number of output tokens remains similar. That's consistent with the report's observation that new!Opus is more eager to edit code and skips reading/research steps.