I speculate something similar (or even worse) is going on with Terminal-Bench [1].

Like, seriously, how come all these agents are beating Claude Code? In practice, they are shitty and not even close. Yes. I tried them.

cma 5 days ago | parent | next [-]

Claude code was severely degraded the last few weeks, very simple terminal prompts were failing for me that it never had problems with.

Follow the money. Or how much comes from your pocket vs. VC and big tech speculators.

	▲	cma 5 days ago \| parent [-]
		They did a big fundraising round right after so it's easy to suspect they were manipulating profitability growth for it.

Bolwin 5 days ago | parent | prev [-]

They're all using claude so idk. Claude code is just a program, the magic is mainly in the model