I've long believed those numbers were faked by Anthropic/OpenAI to serve as a form of advertisement. The estimates are impossible to verify and their ability to do "2 days of work" in 10 minutes will presumably make the user go "Wow, I just saved SO much time!" Plus, the unnecessary text eats up the users' tokens so it helps the companies on the backend, as well.

▲

leodavi 2 hours ago | parent | next [-]

I agree with you that labs are benefiting from those outputs but I'm skeptical that labs are purposefully training the models to produce those outputs.

Raw pre-training data includes plenty of conversations between professional builders and some of those include estimates.

I believe the outputs are a training coincidence with consequences that are opportunitistic for the labs.

▲

Terretta an hour ago | parent | prev | next [-]

> the estimates

It doesn't estimate.

It generates tokens that read like estimates associated with the context in its training material.

What would you expect the generator to output instead?

	▲	ghshephard 5 minutes ago \| parent [-]
		I think people are continuing to view these systems as pure LLMs - when that ship sailed 6+ months ago. Between being able to review memory, using agent harnesses and sub agents and skills to go out and discover information - modern systems (Codex, Claude Code, Cursor) - use LLMs - but the LLM is only a small component of it. Compare what you get from sending a request to a chatbot like ChatGPT - to what you can from a modern harness. The output is influenced by the LLM, but it's no longer a "model making a token prediction based on training material and RLHF" - that's a very 2025 way of looking at these systems. Even Gary Marcus is starting to come around and realize that his priors are no longer as relevant as they once were.

▲

AgentMasterRace an hour ago | parent | prev | next [-]

All the models have broken estimates. They're trained heavily on jira and GitHub tasks and issues, that's why their estimates are human.

	▲	esperent 16 minutes ago \| parent [-]
		Even for humans r the estimates are way off, unless it's based on data that has some serious padding. That said, it'll often say "2 days of work" and then complete the coding in 30 minutes, and while that's amusing, afterwards, I'll need to manually test, or send to other people for review, or realize the agent only actually did half the work and I need to do a second pass (or a third etc.) and then often getting the feature in does genuinely take two days.

▲

dizhn an hour ago | parent | prev [-]

All models do it. It's their training. They didn't have "a person does this in a week but an LLM could in a minute" in their training yet. They also don't have the concept of elapsed time unless you ask them how long something has taken.