I still find in these instances there's at least a 50% chance it has taken a shortcut somewhere: created a new, bigger bug in something that just happened not to have a unit test covering it, or broke an "implicit" requirement that was so obvious to any reasonable human that nobody thought to document it. These can be subtle because you're not looking for them, because no human would ever think to do such a thing.

Then even if you do catch it, AI: "ah, now I see exactly the problem. just insert a few more coins and I'll fix it for real this time, I promise!"

▲

einrealist 5 minutes ago | parent | next [-]

And there is this paradox where it becomes harder to detect the problems as the models 'improve'.

▲

gtowey 10 hours ago | parent | prev | next [-]

The value extortion plan writes itself. How long before someone pitches the idea that the models explicitly almost keep solving your problem to get you to keep spending? Would you even know?

▲

password4321 7 hours ago | parent | next [-]

First time I've seen this idea, I have a tingling feeling it might become reality sooner rather than later.

▲

sailfast 9 hours ago | parent | prev | next [-]

That’s far-fetched. It’s in the interest of the model builders to solve your problem as efficiently as possible token-wise. High value to user + lower compute costs = better pricing power and better margins overall.

▲

d0mine 8 hours ago | parent | next [-]

> far-fetched

Remember Google?

Once it was far-fetched that they would make the search worse just to show you more ads. Now, it is a reality.

With tokens, it is even more direct. The more tokens users spend, the more money for providers.

▲

retsibsi 4 hours ago | parent | next [-]

> Now, it is a reality.

What are the details of this? I'm not playing dumb, and of course I've noticed the decline, but I thought it was a combination of losing the battle with SEO shite and leaning further and further into a 'give the user what you think they want, rather than what they actually asked for' philosophy.

	▲	supriyo-biswas 3 hours ago \| parent [-]
		https://www.wheresyoured.at/the-men-who-killed-google/

▲

throwthrowuknow 7 hours ago | parent | prev [-]

Only if you are paying per token on the API. If you are paying a fixed monthly fee then they lose money when you need to burn more tokens and they lose customers when you can’t solve your problems within that month and max out your session limits and end up with idle time which you use to check if the other providers have caught up or surpassed your current favourite.

	▲	layla5alive 2 hours ago \| parent [-]
		Indeed, unlimited plan seems like the only way that makes sense to not have it be guaranteed to be abused by the provider

▲

xienze 8 hours ago | parent | prev [-]

> It’s in the interest of the model builders to solve your problem as efficiently as possible token-wise.

Unless you’re paying by the token.

▲

Fnoord 7 hours ago | parent | prev | next [-]

I was thinking more of deliberate backdoor in code. RCE is an obvious example, but another one could be bias. "I'm sorry ma'am, computer says you are ineligable for a bank account." These ideas aren't new. They were there in 90s already when we still thought about privacy and accountability regarding technology, and dystopian novels already described them long, long ago.

▲

fragmede 10 hours ago | parent | prev | next [-]

The free market proposition is that competition (especially with Chinese labs and grok) means that Anthropic is welcome to do that. They're even welcome to illegally collude with OpenAi such that ChatGPT is similarly gimped. But switching costs are pretty low. If it turns out I can one shot an issue with Qwen or Deepseek or Kimi thinking, Anthropic loses not just my monthly subscription, but everyone else's I show that too. So no, I think that's some grade A conspiracy theory nonsense you've got there.

▲

coffeefirst 9 hours ago | parent | next [-]

It’s not that crazy. It could even happen by accident in pursuit of another unrelated goal. And if it did, a decent chunk of the tech industry would call it “revealed preference” because usage went up.

	▲	hnuser123456 9 hours ago \| parent [-]
		LLMs became sycophantic and effusive because those responses were rated higher during RLHF, until it became newsworthy how obviously eager-to-please they got, so yes, being highly factually correct and "intelligent" was already not the only priority.

▲

bandrami 6 hours ago | parent | prev | next [-]

> But switching costs are pretty low

Switching costs are currently low. Once you're committed to the workflow the providers will switch to prepaying for a year's worth of tokens.

▲

daxfohl 8 hours ago | parent | prev | next [-]

To be clear I don't think that's what they're doing intentionally. Especially on a subscription basis, they'd rather me maximize my value per token, or just not use them. Lulling users into using tokens unproductively is the worst possible option.

The way agents work right now though just sometimes feels that way; they don't have a good way of saying "You're probably going to have to figure this one out yourself".

▲

7 hours ago | parent | prev | next [-]

[deleted]

▲

jrflowers 9 hours ago | parent | prev | next [-]

This is a good point. For example if you have access to a bunch of slot machines, one of them is guaranteed to hit the jackpot. Since switching from one slot machine to another is easy, it is trivial to go from machine to machine until you hit the big bucks. That is why casinos have such large selections of them (for our benefit).

	▲	krupan 8 hours ago \| parent \| next [-]
		"for our benefit" lol! This is the best description of how we are all interacting with LLMs now. It's not working? Fire up more "agents" ala gas town or whatever
	▲	robotmaxtron 4 hours ago \| parent \| prev [-]
		last time I was at a casino I checked to see what company built the machines, imagine my surprise that it was (by my observation) a single vendor.

▲

thunderfork 9 hours ago | parent | prev [-]

As a rational consumer, how would you distinguish between some intentional "keep pulling the slot machine" failure rate and the intrinsic failure rate?

I feel like saying "the market will fix the incentives" handwaves away the lack of information on internals. After all, look at the market response to Google making their search less reliable - sure, an invested nerd might try Kagi, but Google's still the market leader by a long shot.

In a market for lemons, good luck finding a lime.

	▲	krupan 8 hours ago \| parent [-]
		FWIW, kagi is better than Google

▲

chanux 3 hours ago | parent | prev [-]

Is this from a page of dating apps playbook?

▲

wvenable 9 hours ago | parent | prev | next [-]

> These can be subtle because you're not looking for them

After any agent run, I'm always looking the git comparison between the new version and the previous one. This helps catch things that you might otherwise not notice.

	▲	teaearlgraycold 4 hours ago \| parent [-]
		And after manually coding I often have an LLM review the diff. 90% of the problems it finds can be discounted, but it’s still a net positive.

▲

charcircuit 10 hours ago | parent | prev [-]

You are using it wrong, or are using a weak model if your failure rate is over 50%. My experience is nothing like this. It very consistently works for me. Maybe there is a <5% chance it takes the wrong approach, but you can quickly steer it in the right direction.

▲

testaccount28 10 hours ago | parent [-]

you are using it on easy questions. some of us are not.

▲

meowface an hour ago | parent | next [-]

A lot of people are getting good results using it on hard things. Obviously not perfect, but > 50% success.

That said, more and more people seem to be arriving at the conclusion that if you want a fairly large-sized, complex task in a large existing codebase done right, you'll have better odds with Codex GPT-5.2-Codex-XHigh than with Claude Code Opus 4.5. It's far slower than Opus 4.5 but more likely to get things correct, and complete, in its first turn.

▲

mikkupikku 9 hours ago | parent | prev | next [-]

I think a lot of it comes down to how well the user understands the problem, because that determines the quality of instructions and feedback given to the LLM.

For instance, I know some people have had success with getting claude to do game development. I have never bothered to learn much of anything about game development, but have been trying to get claude to do the work for me. Unsuccessful. It works for people who understand the problem domain, but not for those who don't. That's my theory.

▲

samrus 8 hours ago | parent [-]

It works for hard problems when the person already solves it and just needs the grunt work done

It also works for problems that have been solved a thousand times before, which impresses people and makes them think it is actually solving those problems

▲

daxfohl 8 hours ago | parent | next [-]

Which matches what they are. They're first and foremost pattern recognition engines extraordinaire. If they can identify some pattern that's out of whack in your code compared to something in the training data, or a bug that is similar to others that have been fixed in their training set, they can usually thwack those patterns over to your latent space and clean up the residuals. If comparing pattern matching alone, they are superhuman, significantly.

"Reasoning", however, is a feature that has been bolted on with a hacksaw and duct tape. Their ability to pattern match makes reasoning seem more powerful than it actually is. If your bug is within some reasonable distance of a pattern it has seen in training, reasoning can get it over the final hump. But if your problem is too far removed from what it has seen in its latent space, it's not likely to figure it out by reasoning alone.

	▲	charcircuit 7 hours ago \| parent [-]
		>"Reasoning", however, is a feature that has been bolted on with a hacksaw and duct tape. What do you mean by this? Especially for tasks like coding where there is a deterministic correct or incorrect signal it should be possible to train.

▲

thunky 5 hours ago | parent | prev [-]

> It also works for problems that have been solved a thousand times before

So you mean it works on almost all problems?

▲

baq 10 hours ago | parent | prev [-]

Don’t use it for hard questions like this then; you wouldn’t use a hammer to cut a plank, you’d try to make a saw instead