“Our systematic study exposes a phenomenon of constraint decay in LLM-based coding agents. While current models excel at unconstrained generation, their performance drops when forced to navigate explicit architectural rules. For end-users, this dichotomy implies that agents are reliable for rapid prototyping but remain unreliable for production-grade backend development.”

One major weakness of this study is that they didn’t fully test frontier models for cost reasons, so the specific performance results should be taken with a grain of salt. But the overall conclusion that models degrade when both behavior and architecture must be correct is interesting, and something to keep an eye on.

▲

an hour ago | parent | next [-]

[deleted]

▲

qsort 5 hours ago | parent | prev | next [-]

I think it's downstream of "you can't optimize for two different objectives".

If you only have functional requirements, then in effect you're doing some form of program synthesis, and RL can optimize that very hard.

If you have a mixture of functional and non-functional requirements, you are basically giving the model an incomplete specification, and it must in some way guess at the user's intent to fill in the blanks. This is also why adding to the prompt examples of the style of code you want (hats off to antirez for this particular tip ;)) is phenomenally powerful.

▲

apsurd 5 hours ago | parent [-]

Would you mind sharing antirez' suggestion?

▲

qsort 4 hours ago | parent [-]

I am obviously paraphrasing, but the general idea is that trying to synthesize style from a codebase into e.g. a markdown guide generally doesn't work very well. What achieves style transfer is providing the model with a lot of examples of the style, conventions, patterns you want.

To put it in practice: if you point claude/codex to a repository and you ask it to implement feature X using style guide Y, the code will probably work, but you can usually get better results by saying "do it in the style of this file, it was done well there".

▲

brandensilva 4 hours ago | parent | next [-]

Right more simply put it's great at being a copy cat, exploring similar data points that match your token needs.

It is not great at decision making or judgment calls that don't have a well defined spec or plan in place yet; like unofficial or unapproved tokens if you will. A lot of this stuff simply never has had specs as it has been internal to how companies work and their secret sauce.

The closest thing we have are governance and compliance policies due to legal/business needs requiring it so it's far more well documented than operational ones in how we work. It is more about the how versus the what here I guess is what I'm saying.

But yeah this is why it does great when there are tests, design systems, evals, and other artifacts to mirror. Far more reckless and unpredictable without these things, but still great for exploration and finding the data output you seek.

▲

withinboredom 2 hours ago | parent [-]

Doesn't that make sense? Its text prediction. If you give it examples, it can predict. Synthesizing "put semi-colons on new lines" requires it to generate its own examples 'in its head' (so to speak) and remember that. It won't.

It's like when I see people feeding it a whole bunch of "best practices" and expect it to follow them. It won't. But you could ask it questions about the best practices all day long.

	▲	brandensilva an hour ago \| parent [-]
		Yes, exactly. Any engineer deep on this stuff right now understands that grounded predictive engine sprinkled with RL training and are discovering what that means in terms of its strengths and weaknesses for company use.

▲

mikeyouse 4 hours ago | parent | prev | next [-]

I ran into similar issues as we started to roll out LLM generated financials in our org.. I’m so used to the old SQL workflow of “grab this data from this table, that data from that table, combine it into a final result that looks like xxxx” where the tables were outputs from reports in our ERP but I was having terrible results.

Ended up pointing Claude at a few sample files from our existing reporting, gave it read-only oauth access to the ERP and said “build a new report showing the cash by project as calculated by xxxx - yyyy + zzzz in the style of the existing reports” and it basically one-shot from there.

Kind of crazy and I built a bunch of redundant check-sums because I honestly didn’t think it would be able to replace like 6 workdays of effort for the 2 FTEs who generate that kind of thing manually every month but so far so good..

	▲	BlueTierOps 2 hours ago \| parent [-]
		[flagged]

▲

KaiShips 2 hours ago | parent | prev [-]

[flagged]

▲

zdragnar an hour ago | parent | prev | next [-]

I've noticed something similar with AI assist authored books as well. Early on it does alright, but after some chapters the beginning of each chapter repeats the end of the previous, and obvious LLM tells become more frequent.

The more it has to go on, the more it relies on repetition of what came before. It's also possible that authors start paying much less attention and put less effort into editing later chapters.

Despite the sheer volume on Amazon, LLMs are not at the point of writing well.

▲

piker an hour ago | parent [-]

Holy crap are you reading books that advertised somehow they were written with LLM assistance? Hard no here in 2026.

	▲	zdragnar 8 minutes ago \| parent [-]
		Oh no, they were not advertised as such. It's rather painfully obvious in the worst cases.

▲

nijave 4 hours ago | parent | prev | next [-]

Hmm, I have some anecdotal evidence this is true. Interactively working out a plan with Opus on multiple occasions it'd come up with an incompatible solution, I'll add additional context/requirements, and it has a tendency to "anchor" on it's original architecture and struggles to adapt. Sometimes it tries to sneak in changes for the original plan anyway.

	▲	whstl 3 hours ago \| parent \| next [-]
		Opus does this waaaay too much for my taste. It works fine for vibe-coders but for technical work it is infuriating.
	▲	UncleEntity 2 hours ago \| parent \| prev [-]
		I think the problem is they take the shortest path to the goal ...which may or may not coincide with what you have planned. Oh, and generally think instructions are merely suggestions and what you really want this this totally different thing and not the one in the plan you handed them plus, as a stoke of good luck, this other system is a lot easier to implement as well. I mean, I spend more tokens having them clean up all the places they didn't follow the the plan (if I catch it) or implementing what came out of a 'complete and tested' previous plan where they just stop as soon as all the pathetic new test pass and you discover half of it isn't even there when trying to implement the next thing on top of it. Though... I have been conducting an experiment, of sorts, where we've been cooking on these fairly complicated projects and I don't ever touch a single line of code, just yell at them a lot, and with suitable amounts of marijuana (they are very frustrating most of the time) it's been going pretty well. I also helps that they need to explain what they're doing to somebody fairly-baked -- maybe not such an HR friendly plan?

▲

Animats 2 hours ago | parent | prev | next [-]

That may be the same problem seen when prompts try to force "alignment" or "guardrails". There's a performance drop. Seemingly, a big chunk of the potential solution space has been made unreachable.

For example, if you apply "guardrails" to an image generator of about a year ago, all the people start looking alike. Story generators start using only a few standard names.

That was last year. Is it happening with the frontier models?

▲

jeremyjh 5 hours ago | parent | prev | next [-]

Even the strongest frontier model they used - GPT 5.2 - I would consider barely usable for agentic programming.

I’m not really interested in analysis of the weaknesses of such models because in my experience many weaknesses disappear entirely as models get stronger and reasoning effort is turned up. Especially if you tell them what you want them to do.

Also, it’s not surprising to learn that when more acceptance criteria are added the failure rate increases.

	▲	sigbottle 4 hours ago \| parent [-]
		Wait isn't gpt 5.2 good? Or is it not thinking / not codex? 5.2 was what sparked the late 2025 openai agentic programming revolution.

▲

xienze 4 hours ago | parent | prev [-]

> their performance drops when forced to navigate explicit architectural rules

Even the best models have trouble adhering to stuff as mundane as rules for how to style generated code (indent this much, name things with these patterns, etc.). Even the most die-hard AI-first coder will admit to that kind of stuff being not unheard-of. Yet they still delude themselves into thinking that these models will follow a sufficiently detailed spec to the letter, every time.