This is definitely one of my CORE problem as I use these tools for "professional software engineering." I really desperately need LLMs to maintain extremely effective context and it's not actually that interesting to see a new model that's marginally better than the next one (for my day-to-day).

However. Price is king. Allowing me to flood the context window with my code base is great, but given that the price has substantially increased, it makes sense to better manage the context window into the current situation. The value I'm getting here flooding their context window is great for them, but short of evals that look into how effective Sonnet stays on track, it's not clear if the value actually exists here.

▲

ants_everywhere 3 days ago | parent | next [-]

> I really desperately need LLMs to maintain extremely effective context

The context is in the repo. An LLM will never have the context you need to solve all problems. Large enough repos don't fit on a single machine.

There's a tradeoff just like in humans where getting a specific task done requires removing distractions. A context window that contains everything makes focus harder.

For a long time context windows were too small, and they probably still are. But they have to get better at understanding the repo by asking the right questions.

▲

onion2k 3 days ago | parent | next [-]

Large enough repos don't fit on a single machine.

I don't believe any human can understand a problem if they need to fit the entire problem blem domain in their head, and the scope of a domain that doesn't fit on a computer. You have to break it down into a manageable amount of information to tackle it in chunks.

If a person can do that, so can an LLM prompted to do that by a person.

▲

ehnto 3 days ago | parent | next [-]

I disagree, I may not have the whole codebase in my head in one moment but I have had all of it in my head at some point, and it is still there, that is not true of an LLM. I use LLMs and am impressed by them, but they just do not approximate a human in this particular area.

My ability to break a problem down does not start from listing the files out and reading a few. I have a high level understanding of the whole project at all times, and a deep understanding of the whole project stored, and I can recall that when required, this is not true of an LLM at any point.

We know this is a limitation and it's why we have various tools attempting to approximate memory and augment training on the fly, but they are approximations and they are in my opinion, not even close to real human memory and depth of understanding for data it was not trained on.

Even for mutations of scenarios it was trained on, which code is a great example of that. It is trained on billions of lines of code, yet still fails to understand my codebase intuitively. I have definitely not read billions of lines of code.

▲

onion2k 3 days ago | parent | next [-]

My ability to break a problem down does not start from listing the files out and reading a few.

If you're completely new to the problem then ... yes, it does.

You're assuming that you're working on a project that you've spent time on and learned the domain for, and then you're comparing that to an LLM being prompted to look at a codebase with the context of the files. Those things are not the same though.

A closer analogy to LLMs would be prompting it for questions when it has access (either through MCP or training) to the project's git history, documentation, notes, issue tracker, etc. When that sort of thing is commonplace, and LLMs have the context window size to take advantage of all that information, I suspect we'll be surprised how good they are even given the results we get today.

▲

ehnto 3 days ago | parent [-]

> If you're completely new to the problem then ... yes, it does.

Of course, because I am not new to the problem, whereas an LLM is new to it every new prompt. I am not really trying to find a fair comparison because I believe humans have an unfair advantage in this instance, and am trying to make that point, rather than compare like for like abilities. I think we'll find even with all the context clues from MCPs and history etc. they might still fail to have the insight to recall the right data into the context, but that's just a feeling I have from working with Claude Code for a while. Because I instruct it to do those things, like look through git log, check the documentation etc, and it sometimes finds a path through to an insight but it's just as likely to get lost.

I alluded to it somewhere else but my experience with massive context windows so far has just been that it distracts the LLM. We are usually guiding it down a path with each new prompt and have a specific subset of information to give it, and so pumping the context full of unrelated code at the start seems to derail it from that path. That's anecdotal, though I encourage you to try messing around with it.

As always, there's a good chance I will eat my hat some day.

▲

scott_s 3 days ago | parent [-]

> Of course, because I am not new to the problem, whereas an LLM is new to it every new prompt.

That is true for the LLMs you have access to now. Now imagine if the LLM had been trained on your entire code base. And not just the code, but the entire commit history, commit messages and also all of your external design docs. And code and docs from all relevant projects. That LLM would not be new to the problem every prompt. Basically, imagine that you fine-tuned an LLM for your specific project. You will eventually have access to such an LLM.

▲

snowfield 2 days ago | parent | next [-]

AI training doesn't work like that. you don't train it on context, you train it on recognition and patterns.

	▲	scott_s 2 days ago \| parent [-]
		You train on data. Context is also data. If you want a model to have certain data, you can bake it into the model during training, or provide it as context during inference. But if the "context" you want the model to have is big enough, you're going to want to train (or fine-tune) on it. Consider that you're coding a Linux device driver. If you ask for help from an LLM that has never seen the Linux kernel code, has never seen a Linux device driver and has never seen all of the documentation from the Linux kernel, you're going to need to provide all of this as context. And that's both going to be onerous on you, and it might not be feasible. But if the LLM has already seen all of that during training, you don't need to provide it as context. Your context may be as simple as "I am coding a Linux device driver" and show it some of your code.

▲

jimbokun 3 days ago | parent | prev [-]

Why haven’t the bug AI companies been pursuing that approach, vs just ramping up context window size?

	▲	menaerus 2 days ago \| parent \| next [-]
		Well, we don't really know if they aren't doing exactly that for their internal code repos, right? Conceptually, there is no difference between fine-tuning the LLM for being a law expert of specific country and fine-tuning the LLM for being an expert for given codebase. Former is already happening and is public. Latter is not yet public but I believe it is happening. Reason why big co are pursuing generic LLMs is because they serve as a foundation for basically any other derivative and domain-specific work.
	▲	scott_s 2 days ago \| parent \| prev [-]
		Because training one family of models with very large context windows can be offered to the entire world as an online service. That is a very different business model from training or fine-tuning individual models specifically for individual customers. Someone will figure out how to do that at scale, eventually. It might require the cost of training to reduce significantly. But large companies with the resources to do this for themselves will do it, and many are doing it.

▲

ehnto 3 days ago | parent | prev | next [-]

Additionally, the more information you put into the context the more confused the LLM will get, if you did dump the whole codebase into the context it would not suddenly understand the whole thing. It is still an LLM, all you have done is polluted the context with a million lines of unrelated code, and some lines of related code, which it will struggle to find in the noise (in my experience of much smaller experiments)

	▲	Bombthecat 3 days ago \| parent [-]
		I call this context decay. :) The bigger the context, the more stuff "decays" sometimes to complete different meanings

▲

PaulDavisThe1st 3 days ago | parent | prev | next [-]

> I disagree, I may not have the whole codebase in my head in one moment but I have had all of it in my head at some point, and it is still there, that is not true of an LLM.

All 3 points (you have had all of it your head at some point, it is still there, that is not true of an LLM) are mere conjectures, and not provable at this time, certainly not in the general case. You may be able to show this of some codebases for some developers and for some LLMs, but not all.

▲

fnordsensei 3 days ago | parent | next [-]

The brain can literally not process any piece of information without being changed by the act of processing it. Neuronal pathways are constantly being reinforced or weakened.

Even remembering alters the memory being recalled, entirely unlike how computers work.

▲

Lutger 3 days ago | parent | next [-]

I've always find it interesting that once I take a wrong turn finding my way through the city and I'm not deliberate about remembering this was, in fact, a mistake, I am more prone to taking the same wrong turn again the next time.

	▲	dberge 3 days ago \| parent [-]
		> once I take a wrong turn finding my way through the city... I am more prone to taking the same wrong turn again You may want to stay home then to avoid getting lost.

▲

johnisgood 3 days ago | parent | prev [-]

For humans, remembering strengthens that memory, even if it is dead wrong.

▲

jbs789 3 days ago | parent | prev | next [-]

I'm not sure the idea that a developer maintains a high level understanding is all that controversial...

	▲	animuchan 3 days ago \| parent [-]
		The trend for this idea's controversiality is shown on this very small chart: /

▲

ehnto 3 days ago | parent | prev [-]

I never intended to say it was true of all codebases for all developers, that would make no sense. I don't know all developers.

I think it's objectively true that the information is not in the LLM. It did not have all codebases to train with, and they do not (immediately) retrain on the codebases they encounter through usage.

▲

xwolfi 3 days ago | parent | prev | next [-]

You only worked on very small codebase then. When you work on giant ones, you Ctrl+F a lot, build a limited model of the problem space, and pray the unit tests will catch anything you might have missed...

▲

akhosravian 3 days ago | parent | next [-]

And when you work on a really big codebase you start having multiple files and have to learn tools more advanced than ctrl-f!!

	▲	ghurtado 3 days ago \| parent [-]
		> and have to learn tools more advanced than ctrl-f!! Such as ctrl-shift-f But this is an advanced topic, I don't wanna get into it

▲

ehnto 3 days ago | parent | prev | next [-]

We're measuring lengths of string, but I would not say I have worked on small projects. I am very familiar with discovery, and have worked on a lot of large legacy projects that have no tests just fine.

▲

jimbokun 3 days ago | parent | prev [-]

Why are LLMs so bad at doing the same thing?

▲

airbreather 3 days ago | parent | prev | next [-]

you will have abstractions - black boxing, interface overviews etc, humans can only hold so much detail in current context memory, some say 7 items on average.

	▲	ehnto 3 days ago \| parent \| next [-]
		Of course, but even those blackoxes are not empty, they've got a vague picture inside them based on prior experience. I have been doing this for a while so most things are just various flavours of the same stuff, especially in enterprise software. The important thing in this context is that I know it's all there, I don't have to grep the codebase to fill up my context, and my understanding of the holistic project does not change each time I am booted up.
	▲	jimbokun 3 days ago \| parent \| prev [-]
		And LLMs can’t leverage these abstractions nearly as well as humans…so far.

▲

ivape 3 days ago | parent | prev [-]

My ability to break a problem down does not start from listing the files out and reading a few.

I does, it’s just happening at lightning speed.

	▲	CPLX 3 days ago \| parent [-]
		We don't actually know that. If we had that level of understanding of how exactly our brains do what they do things would be quite different.

▲

krainboltgreene 3 days ago | parent | prev | next [-]

I have an entire life worth of context and I still remember projects I worked on 15 years ago.

▲

adastra22 3 days ago | parent [-]

Not with pixel perfect accuracy. You vaguely remember, although it may not feel like that because your brain fills in the details (hallucinates) as you recall. The comparisons are closer than you might think.

▲

vidarh 3 days ago | parent | next [-]

The comparison would be apt if the LLM was trained on your codebase.

▲

jimbokun 3 days ago | parent [-]

Isn’t that the problem?

I don’t see any progress on incrementally training LLMs on specific projects. I believe it’s called fine tuning, right?

Why isn’t that the default approach anywhere instead of the hack of bigger “context windows”?

	▲	gerhardi 2 days ago \| parent \| next [-]
		I’m not well versed enough on this but wouldn’t it be a problem with custom training that the specific project training codebases probably would likely have a lot of the implemented stuff, relevant for the domain, only once and in one way, compared to how the todays popular large models have been trained maybe with countless different ways to use common libraries for whatever various tasks with whatever Github ripped material fed in?
	▲	adastra22 2 days ago \| parent \| prev [-]
		Because fine-tuning can be used to remove restrictions from a model, so they don't give us plebs access to that.

▲

krainboltgreene 3 days ago | parent | prev [-]

You have no idea if I remember with pixel perfect accuracy (whatever that even means). There are plenty of people with photographic memory.

Also, you're a programmer you have no foundation of knowledge on which to make that assessment. You might as well opine on quarks or martian cellular life. My god the arrogance of people in my industry.

▲

adastra22 2 days ago | parent | next [-]

Repeated studies have shown that perfect "photographic memory" does not in fact exist. Nobody has it. Some people think that they do though, but when tested under lab conditions those claims don't hold up.

I don't believe these people are lying. They are self-reporting their own experiences, which unfortunately have the annoying property of being generated by the very mind that is living the experience.

What does it mean to have an eidetic memory? It means that when you remember something you vividly remember details, and can examine those details to your heart's content. When you do so, it feels like all those details are correct. (Or so I'm told, I'm the opposite with aphantasia.)

But it turns out if you actually have a photo reference and do a blind comparison test, people who report photographic memories actually don't do statistically any better than others in remembering specific fine details, even though they claim that they clearly remember.

The simpler explanation is that while all of our brains are provide hallucinated detail to fill the gaps of memories, their brains are wired up to present those made up details feel much more real than they do to others. That is all.

▲

HarHarVeryFunny 2 days ago | parent [-]

> Repeated studies have shown that perfect "photographic memory" does not in fact exist.

This may change your mind!

https://www.youtube.com/watch?v=jVqRT_kCOLI

▲

adastra22 2 days ago | parent [-]

No, a YouTube video won’t convince me over repeated, verified lab experiments.

▲

HarHarVeryFunny 2 days ago | parent [-]

So what do you make of the video - do you think it's fake, or are you just making the distinction between eidetic memory and photographic memory?

There are so many well documented cases of idiot savants with insane memory skills in various areas (books, music, dates/events, etc), that this type of snapshot visual memory (whatever you want to call it) doesn't seem surprising in that context - it'd really be a bit odd such diverse memory skills excluded one sensory modality (and it seems it doesn't).

▲

adastra22 2 days ago | parent [-]

I do not watch YouTube, sorry.

Hearsay is not reliable. Yes there are stories of savants. When you put them in a lab and actually see how good their memory is, it turns out to be roughly the same as everyone else's. The stories aren't true.

(They may be better at remembering weird facts or something, but when you actually calculate the information entropy of what they are remembering, it ends up being within the ballpark of what a neurotypical person remembers across of general span of life. That's why these people are idiot savants (to use your term). They allocate all their memory points to weird trivia and none to everyday common knowledge.

	▲	HarHarVeryFunny 2 days ago \| parent \| next [-]
		> They allocate all their memory points to weird trivia and none to everyday common knowledge. I think it's more complex than that - it's they way they are forming memories (i.e. what they remember) that is different to a normal person. In a normal person surprise/novelty (prediction failure) is the major learning signal that causes us to remember something - we're selective in what gets remembered (this is just mechanically how a normally operating brain works), whereas the savant appears to remember everything in certain modalities. I don't think that "using up all their memory" is why savants are "idiots", but rather just a reflection of something more severe that is wrong.
	▲	HarHarVeryFunny 2 days ago \| parent \| prev [-]
		If you refuse to look at evidence, then your opinion isn't worth much, is it?

▲

johnisgood 3 days ago | parent | prev [-]

> There are plenty of people with photographic memory.

I thought it was rare.

▲

melagonster 3 days ago | parent | prev | next [-]

Sure, this is why AGI looks possible sometimes. But companies should not require their users to create AGI for them.

▲

wraptile 3 days ago | parent | prev | next [-]

Right, the LLM doesn't need to know all of the code under utils.parse_id to know that this call will parse the ID. The best LLM results I get is when I manually define the the relative code graph of my problem similar how I'd imagine it my head which seems to provide optimal context. So bigger isn't really better.

▲

rocqua 3 days ago | parent [-]

I wonder why we can't have one LLM generate this understanding for another? Perhaps this is where teaming of LLMs gets its value. In managing high and low level context in different context windows.

	▲	mixedCase 3 days ago \| parent [-]
		This is a thing and doesn't require a separate model. You can set up custom prompts that will, based on another prompt describing the task to achieve, generate information about the codebase and a set of TODOs to accomplish the task, generating markdown files with a summarized version of the relevant knowledge and prompting you again to refine that summary if needed. You can then use these files to let the agent take over without going on a wild goose chase.

▲

friendzis 3 days ago | parent | prev [-]

Fitting the entire problem domain in their head is what engineers do.

Engineering is merely a search for optimal solution in this multidimensional space of problem domain(-s), requirements, limitations and optimization functions.

	▲	barnabee 3 days ago \| parent [-]
		_Good_ engineers fit their entire understanding of the problem domain in their head The best engineers understand how big a difference that is

▲

sdesol 3 days ago | parent | prev | next [-]

> But they have to get better at understanding the repo by asking the right questions.

How I am tackling this problem is making it dead simple for users to create analyzers that are designed to enriched text data. You can read more about how it would be used in a search at https://github.com/gitsense/chat/blob/main/packages/chat/wid...

The basic idea is, users would construct analyzers with the help of LLMs to extract the proper metadata that can be semantically searched. So when the user does an AI Assisted search with my tool, I would load all the analyzers (description and schema) into the system prompt and the LLM can determine which analyzers can be used to answer the question.

A very simplistic analyzer would be to make it easy to identify backend and frontend code so you can just use the command `!ask find all frontend files` and the LLM will construct a deterministic search that knows to match for frontend files.

▲

mrits 3 days ago | parent [-]

How is that better than just writing a line in the md?

▲

sdesol 3 days ago | parent [-]

I am not sure I follow what you are saying. What would the line be and how would it become deterministically searchable?

▲

mrits 3 days ago | parent [-]

frontend path: /src/frontend/* backend path: /src/*

I suppose the problem you have might be unique to nextJS ?

	▲	sdesol 3 days ago \| parent [-]
		The issue is frontend can be a loaded question, especially if you are dealing with legacy stuff, different frameworks, etc. You also can't tell what the frontend code does by looking at that single line. Now imagine as part of your analyzer, you have the following instructions for the llm: --- For all files in `src/frontend/` treat them as frontend code. For all files in 'src/' excluding `src/frontend` treat as backend. Create a metadata called `scope` which can be 'frontend', 'backend' or 'mix' where mix means the code can be used for both front and backend like utilities. Now for each file, create a `keywords` metadata that includes up to 10 unique keywords that describes the core functionality for the file. --- So with this you can say - `!ask find all frontend files` - `!ask find all mix use files` - `!ask find all frontend files that does [this]` and so forth. The whole point of analyzers is to make it easy for the LLM to map your natural language query to a deterministic search. If the code base is straightforward and follows a well known framework, asking for frontend or backend wouldn't even need an entry as you can just include in the instructions that I use framework X and the LLM would know what to consider.

▲

stuartjohnson12 3 days ago | parent | prev | next [-]

> An LLM will never have the context you need to solve all problems.

How often do you need more than 10 million tokens to answer your query?

▲

ants_everywhere 3 days ago | parent | next [-]

I exhaust the 1 million context windows on multiple models multiple times per day.

I haven't used the Llama 4 10 million context window so I don't know how it performs in practice compared to the major non-open-source offerings that have smaller context windows.

But there is an induced demand effect where as the context window increases it opens up more possibilities, and those possibilities can get bottlenecked on requiring an even bigger context window size.

For example, consider the idea of storing all Hollywood films on your computer. In the 1980s this was impossible. If you store them in DVD or Bluray quality you could probably do it in a few terabytes. If you store them in full quality you may be talking about petabytes.

We recently struggled to get a full file into a context window. Now a lot of people feel a bit like "just take the whole repo, it's only a few MB".

▲

brulard 3 days ago | parent [-]

I think you misunderstand how context in current LLMs works. To get the best results you have to be very careful to provide what is needed for immediate task progression, and postpone context thats needed later in the process. If you give all the context at once, you will likely get quite degraded output quality. Thats like if you want to give a junior developer his first task, you likely won't teach him every corner of your app. You would give him context he needs. It is similar with these models. Those that provided 1M or 2M of context (Gemini etc.) were getting less and less useful after cca 200k tokens in the context.

Maybe models would get better in picking up relevant information from large context, but AFAIK it is not the case today.

▲

remexre 3 days ago | parent | next [-]

That's a really anthropomorphizing description; a more mechanical one might be,

The attention mechanism that transformers use to find information in the context is, in its simplest form, O(n^2); for each token position, the model considers whether relevant information has been produced at the position of every other token.

To preserve performance when really long contexts are used, current-generation LLMs use various ways to consider fewer positions in the context; for example, they might only consider the 4096 "most likely" places to matter (de-emphasizing large numbers of "subtle hints" that something isn't correct), or they might have some way of combining multiple tokens worth of information into a single value (losing some fine detail).

▲

ants_everywhere 3 days ago | parent | prev | next [-]

> I think you misunderstand how context in current LLMs works.

Thanks but I don't and I'm not sure why you're jumping to this conclusion.

EDIT: Oh I think you're talking about the last bit of the comment! If you read the one before I say that feeding it the entire repo isn't a great idea. But great idea or not, people want to do it, and it illustrates that as context window increases it creates demand for even larger context windows.

	▲	brulard 2 days ago \| parent [-]
		I said that based on you saying you exhaust a million token context windows easily. I'm no expert on that, but I think the current state of LLMs works best if you are not approaching that 1M token limit, because large context (reportedly) deteriorates response quality quickly. I think state of the art usage is managing context in tens or low hundreds thousands tokens at most and taking advantage of splitting tasks across subtasks in time, or splitting context across multiple "expert" agents (see sub-agents in claude code).

▲

jimbokun 3 days ago | parent | prev [-]

It seems like LLM need to become experts at managing their OWN context.

Selectively gripping and searching the code to pull into context only those parts relevant to the task at hand.

	▲	brulard 2 days ago \| parent [-]
		That's what I'm thinking about a lot. Something like the models "activate" just some subset of parameters when working (if I understand the new models correctly). So that model could activate parts of context which are relevant for the task at hand

▲

rocqua 3 days ago | parent | prev [-]

It doesn't take me 10000000 tokens to have the context "this was the general idea of the code, these were unimportant implementation details, and this is where lifetimes were tricky."

And that context is the valuable bit for quickly getting back up to speed on a codebase.

▲

injidup 3 days ago | parent | prev | next [-]

All the more reason for good software engineering. Folders of files managing one concept. Files tightly focussed on sub problems of that concept. Keep your code so that you can solve problems in self contained context windows at the right level of abstraction

	▲	Sharlin 3 days ago \| parent [-]
		I fear that LLM-optimal code structure is different from human-optimal code structure, and people are starting to optimize for the former rather than the latter.

▲

mock-possum 3 days ago | parent | prev | next [-]

> The context is in the repo

Agreed but that’s a bit different from “the context is the repo”

It’s been my experience that usually just picking a couple files out to add to the context is enough - Claude seems capable of following imports and finding what it needs, in most cases.

I’m sure it depends on the task, and the structure of the codebase.

▲

manmal 3 days ago | parent | prev | next [-]

> The context is in the repo

No it’s in the problem at hand. I need to load all related files, documentation, and style guides into the context. This works really well for smaller modules, but currently falls apart after a certain size.

▲

alvis 3 days ago | parent | prev [-]

Everything in context hurts focus. It's like some people suffering from hyperthymesia. They are easily get distracted when the recall something

▲

rootnod3 4 days ago | parent | prev | next [-]

Flooding the context also means increasing the likelihood of the LLM confusing itself. Mainly because of the longer context. It derails along the way without a reset.

▲

Wowfunhappy 3 days ago | parent | next [-]

I keep reading this, but with Claude Code in particular, I consistently find it gets smarter the longer my conversations go on, peaking right at the point where it auto-compacts and everything goes to crap.

This isn't always true--some conversations go poorly and it's better to reset and start over--but it usually is.

	▲	jacobr1 2 days ago \| parent \| next [-]
		I've found there usually is some key context that is missing. Maybe it is project structure or a sampling of some key patterns from different parts of the codebase, or key data models. Getting those into CLAUDE.md reduces the need to keep building up (as large) context. As an example for one project, I realized things were getting better after it started writing integration tests. I wasn't sure if that was the act of writing the test forced it to reason about the they black box way the system would be used, or if there was another factor. Turns out it was just example usage. Extracting out the usage patterns into both the README and CLAUDE.md was itself a simple request, then I got similar performance on new tasks.
	▲	will_pseudonym 3 days ago \| parent \| prev [-]
		This is my exact experience as well. I wonder if I should switch to using Sonnet so that I can have more time before auto-compact gets forced on me.

▲

aliljet 4 days ago | parent | prev [-]

How do you know that?

▲

bigmadshoe 4 days ago | parent | next [-]

https://research.trychroma.com/context-rot

▲

joenot443 3 days ago | parent | next [-]

This is a good piece. Clearly it's a pretty complex problem and the intuitive result a layman engineer like myself might expect doesn't reflect the reality of LLMs. Regex works as reliably on 20 characters as it does 2m characters; the only difference is speed. I've learned this will probably _never_ be the case with LLMs, there will forever exist some level of epistemic doubt in its result.

When they announced Big Contexts in 2023, they referenced being able to find a single changed sentence in the context's copy of Great Gatsby[1]. This example seemed _incredible_ to me at the time but now two years later I'm feeling like it was pretty cherry-picked. What does everyone else think? Could you feed a novel into an LLM and expect it to find the single change?

[1] https://news.ycombinator.com/item?id=35941920

	▲	bigmadshoe 3 days ago \| parent \| next [-]
		This is called a "needle in a haystack" test, and all the 1M context models perform perfectly on this exact problem, at least when your prompt and the needle are sufficiently similar. As the piece above references, this is a totally insufficient test for the real world. Things like "find two unrelated facts tied together by a question, then perform reasoning based on them" are much harder. Scaling context properly is O(n^2). I'm not really up to date on what people are doing to combat this, but I find it hard to believe the jump from 100k -> 1m context window involved a 100x (10^2) slowdown, so they're probably taking some shortcut.
	▲	adastra22 3 days ago \| parent \| prev [-]
		Depends on the change.

▲

dang 3 days ago | parent | prev [-]

Discussed here:

Context Rot: How increasing input tokens impacts LLM performance - https://news.ycombinator.com/item?id=44564248 - July 2025 (59 comments)

▲

rootnod3 4 days ago | parent | prev | next [-]

The longer the context and the discussion goes on, the more it can get confused, especially if you have to refine the conversation or code you are building on.

Remember, in its core it's basically a text prediction engine. So the more varying context there is, the more likely it is to make a mess of it.

Short context: conversion leaves the context window and it loses context. Long context: it can mess with the model. So the trick is to strike a balance. But if it's an online models, you have fuck all to control. If it's a local model, you have some say in the parameters.

▲

giancarlostoro 3 days ago | parent | prev | next [-]

Here's a paper from MIT that covers how this could be resolved in an interesting fashion:

https://hanlab.mit.edu/blog/streamingllm

The AI field is reusing existing CS concepts for AI that we never had hardware for, and now these people are learning how applied Software Engineering can make their theoretical models more efficient. It's kind of funny, I've seen this in tech over and over. People discover new thing, then optimize using known thing.

▲

kridsdale3 3 days ago | parent | next [-]

The fact that this is happening is where the tremendous opportunity to make money as an experienced Software Engineer currently lies.

For instance, a year or two ago, the AI people discovered "cache". Imagine how many millions the people who implemented it earned for that one.

	▲	nxobject 6 hours ago \| parent \| next [-]
		What we need are "idea dice" or "concept dice" for CS – each side could have a vague architectural nudge like "parallelize", "interpret", "precompute", "predict and unwind", "declarative"...
	▲	giancarlostoro 3 days ago \| parent \| prev [-]
		I've been thinking the same, and its things that you don't need some crazy ML degree to know how to do... A lot of the algorithms are known... for a while now... Milk it while you can.

▲

mamp 3 days ago | parent | prev [-]

Unfortunately, I think the context rot paper [1] found that the performance degradation when context increased still occurred in models using attention sinks.

1. https://research.trychroma.com/context-rot

	▲	giancarlostoro 3 days ago \| parent [-]
		Saw that paper have not had a chance to read it yet, are there other techniques that help then? I assume theres a few different ones used.

▲

anonz4FWNqnX 4 days ago | parent | prev | next [-]

I've had similar experiences. I've gone back and forth between running models locally and using the commercial models. The local models can be incredibly useful (gemma, qwen), but they need more patience and work to get them to work.

One advantage to running locally[1] is that you can set the context length manually and see how well the llm uses it. I don't have an exact experience to relay, but it's not unusual for models to be allow longer contexts, but ignore that context.

Just making the context big doesn't mean the LLM is going to use it well.

[1] I've using lm studio on both a macbook air and a macbook pro. Even a macbook air with 16G can run pretty decent models.

	▲	nomel 3 days ago \| parent [-]
		A good example of this was the first Gemini model that allowed 1 million tokens, but would lose track of the conversation after a couple paragraphs.

▲

EForEndeavour 4 days ago | parent | prev | next [-]

https://onnyunhui.medium.com/evaluating-long-context-lengths...

▲

F7F7F7 4 days ago | parent | prev | next [-]

What do you think happens when things start falling outside of its context window? It loses access to parts of your conversation.

And that’s why it will gladly rebuild the same feature over and over again.

▲

fkyoureadthedoc 3 days ago | parent | prev | next [-]

https://github.com/adobe-research/NoLiMa

▲

lightbendover 4 days ago | parent | prev [-]

[dead]

▲

alexchamberlain 4 days ago | parent | prev | next [-]

I'm not sure how, and maybe some of the coding agents are doing this, but we need to teach the AI to use abstractions, rather than the whole code base for context. We as humans don't hold the whole codebase in our hear, and we shouldn't expect the AI to either.

▲

LinXitoW 3 days ago | parent | next [-]

They already do, or at least Claude Code does. It will search for a method name, then only load a chunk of that file to get the method signature, for example.

It will use the general information you give it to make educated guesses of where things are. If it knows the code is Vue based and it has to do something with "users", it might seach for "src/*/User.vue.

This is also the reason why the quality of your code makes such a large difference. The more consistent the naming of files and classes, the better the AI is at finding them.

	▲	felipeerias 3 days ago \| parent [-]
		Claude Code can get access to a language server like clangd through a MCP server, for example https://github.com/isaacphi/mcp-language-server

▲

sdesol 4 days ago | parent | prev | next [-]

LLMs (current implementation) are probabilistic so it really needs the actual code to predict the most likely next tokens. Now loading the whole code base can be a problem in itself, since other files may negatively affect the next token.

▲

photon_lines 3 days ago | parent | next [-]

Sorry -- I keep seeing this being used but I'm not entirely sure how it differs from most of human thinking. Most human 'reasoning' is probabilistic as well and we rely on 'associative' networks to ingest information. In a similar manner - LLMs use association as well -- and not only that, but they are capable of figuring out patterns based on examples (just like humans are) -- read this paper for context: https://arxiv.org/pdf/2005.14165. In other words, they are capable of grokking patterns from simple data (just like humans are). I've given various LLMs my requirements and they produced working solutions for me by simply 1) including all of the requirements in my prompt and 2) asking them to think through and 'reason' through their suggestions and the products have always been superior to what most humans have produced. The 'LLMs are probabilistic predictors' comments though keep appearing on threads and I'm not quite sure I understand them -- yes, LLMs don't have 'human context' i.e. data needed to understand human beings since they have not directly been fed in human experiences, but for the most part -- LLMs are not simple 'statistical predictors' as everyone brands them to be. You can see a thorough write-up I did of what GPT is / was here if you're interested: https://photonlines.substack.com/p/intuitive-and-visual-guid...

▲

sdesol 3 days ago | parent | next [-]

I'm not sure if I would say human reasoning is 'probabilistic' unless you are taking a very far step back and saying based on how the person lived, they have ingrained biases (weights) that dictates how they reason. I don't know if LLMs have a built in scepticism like humans do, that plays a significant role in reasoning.

Regardless if you believe LLMs are probabilistic or not, I think what we are both saying is context is king and what it (LLM) says is dictated by the context (either through training) or introduced by the user.

▲

photon_lines 3 days ago | parent | next [-]

'I don't know if LLMs have a built in scepticism like humans do' - humans don't have an 'in built skepticism' -- we learn in through experience and through being taught how to 'reason' within school (and it takes a very long time to do this). You believe that this is in-grained but you may have forgotten having to slog through most of how the world works and being tested when you went to school and when your parents taught you these things. On the context component: yes, context is vitally important (just as it is with humans) -- you can't produce a great solution unless you understand the 'why' behind it and how the current solution works so I 100% agree with that.

▲

ijidak 3 days ago | parent | prev | next [-]

For me, the way humans finish each other's sentences and often think of quotes from the same movies at the same time in conversation (when there is no clear reason for that quote to be a part of the conversation), indicates that there is a probabilistic element to human thinking.

Is it entirely probabilistic? I don't think so. But, it does seem that a chunk of our speech generation and processing is similar to LLMs. (e.g. given the words I've heard so far, my brain is guessing words x y z should come next.)

I feel like the conscious, executive mind humans have exercises some active control over our underlying probabilistic element. And LLMs lack the conscious executive.

e.g. They have our probabilistic capabilities, without some additional governing layer that humans have.

	▲	coderenegade 3 days ago \| parent [-]
		I think the better way to look at it is that probabilistic models seem to be an accurate model for human thought. We don't really know how humans think, but we know that they probably aren't violating information theoretic principles, and we observe similar phenomena when we compare humans with LLMs.

▲

Workaccount2 3 days ago | parent | prev [-]

Humans have a neuro-chemical system that performs operations with electrical signals.

That's the level to look at, unless you have a dualist view of the brain (we are channeling a super-natural forces).

▲

lll-o-lll 3 days ago | parent [-]

Yep, just like like looking at a birds feather through a microscope explains the principles of flight…

Complexity theory doesn’t have a mathematics (yet), but that doesn’t mean we can’t see that it exists. Studying the brain at the lowest levels haven’t lead to any major insights in how cognition functions.

	▲	brookst 3 days ago \| parent [-]
		I personally believe that quantum effects play a role and we’ll learn more once we understand the brain at that level, but I recognize that is an intuition and may well be wrong.

▲

didibus 3 days ago | parent | prev [-]

You seem possibly more knowledgeable then me on the matter.

My impression is that LLMs predict the next token based on the prior context. They do that by having learned a probability distribution from tokens -> next-token.

Then as I understand, the models are never reasoning about the problem, but always about what the next token should be given the context.

The chain of thought is just rewarding them so that the next token isn't predicting the token of the final answer directly, but instead predicting the token of the reasoning to the solution.

Since human language in the dataset contains text that describes many concepts and offers many solutions to problems. It turns out that predicting the text that describes the solution to a problem often ends up being the correct solution to the problem. That this was true was kind of a lucky accident and is where all the "intelligence" comes from.

▲

photon_lines 3 days ago | parent | next [-]

So - in the pre-training step you are right -- they are simple 'statistical' predictors but there are more steps involved in their training which turn them from simple predictors to being able to capture patterns and reason -- I tried to come up with an intuitive overview of how they do this in the write-up and I'm not sure I can give you a simple explanation here, but I would recommend you play around with Deep-Seek and other more advanced 'reasoning' or 'chain-of-reason' models and ask them to perform tasks for you: they are not simply statistically combining information together. Many times they are able to reason through and come up with extremely advanced working solutions. To me this indicates that they are not 'accidently' stumbling upon solutions based on statistics -- they actually are able to 'understand' what you are asking them to do and to produce valid results.

	▲	didibus 3 days ago \| parent [-]
		If you observe the failure modes of current models, you see that they fail in ways that align with probabilistic token prediction. I don't mean that the textual prediction is simple, it's very advanced and it learns all kinds of relationships, patterns and so on. But it doesn't have a real model and thinking process relating to the the actual problem. It thinks about what text could describe a solution that is linguistically and language semantically probable. Since human language embedds so many of the logics and ground truths that's good enough to result in a textual description that approximate or nails the actual underlying problem. And this is why we see them being able to solve quite advanced problems. I admit that people are wondering now, what's different about human thinking? Maybe we do the same, you invent a probable sounding answer and then check if it was correct, rinse and repeat until you find one that works. But this in itself is a big conjecture. We don't really know how human thinking works. We've found a method that works well for computers and now we wonder if maybe we're just the same but scaled even higher or with slight modifications. I've heard from ML experts though that they don't think so. Most seem to believe different architecture will be needed, world models, model ensembles with various specialized models with different architecture working together, etc. That LLMs fundamentaly are kind of limited by their nature as next token predictors.

▲

coderenegade 3 days ago | parent | prev [-]

I think the intuitive leap (or at least, what I believe) is that meaning is encoded in the media. A given context and input encodes a particular meaning that the model is able to map to an output, and because the output is also in the same medium (tokens, text), it also has meaning. Even reasoning can fit in with this, because the model generates additional meaningful context that allows it to better map to an output.

How you find the function that does the mapping probably doesn't matter. We use probability theory and information theory, because they're the best tools for the job, but there's nothing to say you couldn't handcraft it from scratch if you were some transcendent creature.

	▲	didibus 2 days ago \| parent [-]
		Yes exactly. The text of human natural language that it is trained on encodes the solutions to many problems as well as a lot of ground truths. The way I think of it is. First you have a random text generator. This generative "model" in theory can find the solution to all problems that text can describe. If you had a way to assert if it found the correct solution, you could run it and eventually it would generate the text that describes the working solution. Obviously inefficient and not practical. What if you made it so it skipped generating all text that aren't valid sensical English? Well now it would find the correct solution in way less iterations, but still too slow. What if it generated only text that made sense to follow the context of the question? Now you might start to see it 100-shot, 10-shot, maybe even 1-shot some problems. What if you tuned that to the max? Well you get our current crop of LLMs. What else can you do to make it better? Tune the dataset, remove text that describe wrong answers to prior context so it learns not to generate those. Add more quality answers to prior context, add more problems/solutions, etc. Instead of generating the answer to a mathematical equation the above way, generate the Python code to run to get the answer. Instead of generating the answer to questions about current real world events/facts (like the weather). Have it generate the web search query to find it. If you're asking a more complex question, instead of generating the answer directly, have it generate smaller logical steps towards the answer. Etc.

▲

nomel 3 days ago | parent | prev [-]

No, it doesn’t, nor do we. It’s why abstractions and documentations exist.

If you know what a function achieves, and you trust it to do that, you don’t need to see/hold its exact implementation in your head.

	▲	sdesol 3 days ago \| parent [-]
		But documentation doesn't include styling or preferred pattern, which is why I think a lot people complain that the LLM will just produce garbage. Also documentation is not guaranteed to be correct or up to date. To be able to produce the best code based on what you are hoping for, I do think having the actual code is necessary unless styling/design patterns are not important, then yes documentation will be suffice, provided they are accurate and up to date.

▲

anthonypasq 4 days ago | parent | prev | next [-]

the fact we cant keep the repo in our working memory is a flaw of our brains. i cant see how you could possibly make the argument that if you were somehow able to keep the entire codebase in your head that it would be a disadvantage.

	▲	SkyBelow 3 days ago \| parent [-]
		Information tradeoff. Even if you could keep the entire code base in memory, if something else has to be left out of memory, then you have to consider the value of an abstraction verses whatever other information is lost. Abstractions also apply to the business domain and works the same. You also have time tradeoffs. Like time to access memory and time to process that memory to achieve some outcome. There is also quality. If you can keep the entire code base in memory but with some chance of confusion, while abstractions will allow less chance of confusion, then the tradeoff of abstractions might be worth it still. Even if we assume a memory that has no limits, can access and process all information at constant speed, and no quality loss, there is still communication limitations to worry about. Energy consumption is yet another.

▲

4 days ago | parent | prev | next [-]

[deleted]

▲

siwatanejo 4 days ago | parent | prev | next [-]

I do think AIs are already using abstractions, otherwise you would be submitting all the source code of your dependencies into the context.

	▲	TheOtherHobbes 3 days ago \| parent [-]
		I think they're recognising patterns, which is not the same thing. Abstractions are stable, they're explicit in their domains, good abstractions cross multiple domains, and they typically come with a symbolic algebra of available operations. Math is made of abstractions. Patterns are a weaker form of cognition. They're implicit, heavily context-dependent, and there's no algebra. You have to poke at them crudely in the hope you can make them do something useful. Using LLMs feels more like the latter than the former. If LLMs were generating true abstractions they'd be finding meta-descriptions for code and language and making them accessible directly. AGI - or ASI - may be be able to do that some day, but it's not doing that now.

▲

F7F7F7 4 days ago | parent | prev | next [-]

There are a billion and one repos that claim to help do this. Let us know when you find one.

▲

throwaway314155 3 days ago | parent | prev [-]

/compact in Claude Code is effectively this.

	▲	brulard 3 days ago \| parent [-]
		Compact is a reasonable default way to do that, but quite often it discards important details. It's better to have CC to store important details, decisions and reasons in a document where it can be reviewed and modified if needed.

▲

HarHarVeryFunny 3 days ago | parent | prev | next [-]

Even 1 MB context is only roughly 20K LOC so pretty limiting, especially if you're also trying to fit API documents or any other lengthy material into the context.

Anthropic also recently said that they think that longer/compressed context can serve as an alternative (not sure what was the exact wording/characterization they used) to continual/incremental learning, so context space is also going to be competing with model interaction history if you want to avoid groundhog day and continually having to tell/correct the model the same things over and over.

It seems we're now firmly in the productization phase of LLM development, as opposed to seeing much fundamental improvement (other than math olympiad etc "benchmark" results, released to give the impression of progress). Yannic Kilcher is right, "AGI is not coming", at least not in the form of an enhanced LLM. Demis Hassabis' very recent estimate was for 50% chance of AGI by 2030 (i.e. still 15 years out).

While we're waiting for AGI, it seems a better approach to needing everything in context would be to lean more heavily on tool use, perhaps more similar to how a human works - we don't memorize the entire code base (at least not in terms of complete line-by-line detail, even though we may have a pretty clear overview of a 10K LOC codebase while we're in the middle of development) but rather rely on tools like grep and ctags to locate relevant parts of source code on an as-needed basis.

▲

km144 3 days ago | parent | next [-]

As you alluded to at the end of your post—I'm not really convinced 20k LOC is very limiting. How many lines of code can you fit in your working mental model of a program? Certainly less than 20k concrete lines of text at any given time.

In your working mental model, you have broad understandings of the broader domain. You have broad understandings of the architecture. You summarize broad sections of the program into simpler ideas. module_a does x, module_b does y, insane file c does z, and so on. Then there is the part of the software you're actively working on, where you need more concrete context.

So as you move towards the central task, the context becomes more specific. But the vague outer context is still crucial to the task at hand. Now, you can certainly find ways to summarize this mental model in an input to an LLM, especially with increasing context windows. But we probably need to understand how we would better present these sorts of things to achieve performance similar to a human brain, because the mechanism is very different.

▲

jacobr1 2 days ago | parent [-]

This is basically how claude code works today. You have it /init a description of the project structure into CLAUDE.md that is used for each invocation. There is some implicit knowledge in the project about common frameworks and languages. Then when working on something between the explicit and implicit knowledge and the task at hand it will grep for relevant material in the project, load either full or parts of files, and THEN it will start working on the task. But it dynamically builds the context of the codebase based on searching for the relevant bit. Short-circuiting this by having a good project summary makes it more efficient - but you don't need to literally copy in all the code files.

	▲	HarHarVeryFunny 2 days ago \| parent [-]
		Interesting - thanks!

▲

HarHarVeryFunny 3 days ago | parent | prev | next [-]

Just as a self follow-up, another motivation to lean on tool use rather than massive context (cf. short-term memory) is to keep LLM/AI written/modified code understandable to humans ...

At least part of the reason that humans use hierarchical decomposition and divide-and-conquor is presumably because of our own limited short term memory, since hierarchical organization (modules, classes, methods, etc) allows us to work on a problem at different levels of abstraction while only needing to hold that level of the hierarchy in memory.

Imagine what code might look like if written by something with no context limit - just a flat hierarchy of functions, perhaps, at least until it perhaps eventually learned, or was told, the other reasons for hierarchical and modular design/decomposition to assist in debugging and future enhancement, etc!

▲

aorobin 3 days ago | parent | prev | next [-]

>"Demis Hassabis' very recent estimate was for 50% chance of AGI by 2030 (i.e. still 15 years out)."

2030 is only 5 years out

	▲	Zircom 3 days ago \| parent [-]
		That was his point lol, if someone is saying it'll happen in 5 years, triple that for a real estimate.

▲

brookst 3 days ago | parent | prev [-]

1M tokens ~= 3.5M characters ~= 58k LOC at an average of 60 chars/line. 88k LOC at 40 chars/line

	▲	HarHarVeryFunny 3 days ago \| parent [-]
		OK - I was thinking 1M chars (@ 50 chars/line) vs tokens, but I'm not sure it makes much difference to the argument. There are plenty of commercial code bases WAY bigger, and as noted other things may also be competing for space in the context.

▲

benterix 4 days ago | parent | prev | next [-]

> it's not clear if the value actually exists here.

Having spent a couple of weeks on Claude Code recently, I arrived to the conclusion that the net value for me from agentic AI is actually negative.

I will give it another run in 6-8 months though.

▲

ericmcer 3 days ago | parent | next [-]

Agreed, daily Cursor user.

Just got out of a 15m huddle with someone trying to understand what they were doing in a PR before they admitted Claude generated everything and it worked but they weren't sure why... Ended up ripping about 200 LoC out because what Claude "fixed" wasn't even broken.

So never let it generate code, but the autocomplete is absolutely killer. If you understand how to code in 2+ languages you can make assumptions about how to do things in many others and let the AI autofill the syntax in. I have been able to swap to languages I have almost no experience in and work fairly well because memorizing syntax is irrelevant.

▲

daymanstep 3 days ago | parent | next [-]

> I have been able to swap to languages I have almost no experience in and work fairly well because memorizing syntax is irrelevant.

I do wonder whether your code does what you think it does. Similar-sounding keywords in different languages can have completely different meanings. E.g. the volatile keyword in Java vs C++. You don't know what you don't know, right? How do you know that the AI generated code does what you think it does?

▲

jacobr1 3 days ago | parent | next [-]

Beyond code-gen I think some techniques are very underutilized. One can generate tests, generate docs, explain things line by line. Explicitly explaining alternative approaches and tradeoffs is helpful too. While, as with everything in this space, there are imperfection, I find a ton of value in looking beyond the code into thinking through the use cases, alternative approaches and different ways to structure the same thing.

▲

pornel 3 days ago | parent [-]

I've wasted time debugging phantom issues due to LLM-generated tests that were misusing an API.

Brainstorming/explanations can be helpful, but also watch out for Gell-Mann amnesia. It's annoying that LLMs always sound smart whether they are saying something smart or not.

▲

Miraste 3 days ago | parent [-]

Yes, you can't use any of the heuristics you develop for human writing to decide if the LLM is saying something stupid, because its best insights and its worst hallucinations all have the same formatting, diction, and style. Instead, you need to engage your frontal cortex and rationally evaluate every single piece of information it presents, and that's tiring.

	▲	valenterry 3 days ago \| parent [-]
		It's like listening to a politician or lawyer, who might talk absolute bullshit in the most persuading words. =)

▲

spanishgum 3 days ago | parent | prev [-]

The same way I would with any of my own code - I would test it!

The key here is to spend less time searching, and more time understanding the search result.

I do think the vibe factor is going to bite companies in the long run. I see a lot of vibe code pushed by both junior and senior devs alike, where it's clear not enough time was spent reviewing the product. This behavior is being actively rewarded now, but I do think the attitude around building code as fast as possible will change if impact to production systems becomes realized as a net negative. Time will tell.

▲

qingcharles 3 days ago | parent | prev | next [-]

The other day I caught it changing the grammar and spelling in a bunch of static strings in a totally different part of a project, for no sane reason.

	▲	bdamm 3 days ago \| parent \| next [-]
		I've seen it do this as well. Odd things like swapping the severity level on log statements that had nothing to do with the task. Very careful review of my commits is the only way forward, for a long time.
	▲	ericmcer 2 days ago \| parent \| prev [-]
		That sounds similar to what it was doing here. It basically took a function like `thing = getThing(); id = thing.id` and created `id = getThingId()` and replaced hundreds of lines and made a new API endpoint. Not a huge deal because it works, but it seems like you would have 100,000 extra lines if you let Claude do whatever it wanted for a few months.

▲

senko 3 days ago | parent | prev | next [-]

> Just got out of a 15m huddle with someone trying to understand what they were doing in a PR before they admitted Claude generated everything and it worked but they weren't sure why...

But .. that's not the AI's fault. If people submit any PRs (including AI-generated or AI-assisted) without completely understanding them, I'd treat is as serious breach of professional conduct and (gently, for first-timers) stress that this is not acceptable.

As someone hitting the "Create PR" (or equivalent) button, you accept responsibility for the code in question. If you submit slop, it's 100% on you, not on any tool used.

▲

whatever1 3 days ago | parent | next [-]

If you have to review what the LLM wrote then there is no productivity gain.

Leadership asks for vibe coding

▲

senko 3 days ago | parent | next [-]

> If you have to review what the LLM wrote then there is no productivity gain.

I do not agree with that statement.

> Leadership asks for vibe coding

Leadership always asks for more, better, faster.

	▲	mangamadaiyan 3 days ago \| parent [-]
		> Leadership always asks for more, better, faster. More and faster, yes. Almost never better.

▲

swat535 3 days ago | parent | prev | next [-]

> If you have to review what the LLM wrote then there is no productivity gain.

You always have to review the code, whether it's written by another person, yourself or an AI.

I'm not sure how this translates into the loss of productivity?

Did you mean to say that the code AI generates is difficult to review? In those cases, it's the fault of the code author and not the AI.

Using AI like any other tool requires experience and skill.

▲

fooster 3 days ago | parent | prev | next [-]

I suggest you upgrade your code review skill. I find it vastly quicker in most cases to review code than write it in the first place.

	▲	whatever1 3 days ago \| parent [-]
		Anyone can skim code and type “looks good to me”.

▲

epolanski 3 days ago | parent | prev | next [-]

> If you have to review what the LLM wrote then there is no productivity gain

Stating something with confidence does not make it automatically true.

▲

WolfeReader 3 days ago | parent | prev [-]

I've seen AI create incorrect solutions and deceptive variable names. Reviewing the code is absolutely necessary.

▲

draxil 3 days ago | parent | prev [-]

But it's pretty much a given at this point that if you use agents to code for any length of time it starts to atrophy your ability to understand what's going on. So, yeah. it's a bit of a devils chalice.

▲

epolanski 3 days ago | parent | prev [-]

You're blaming the tool and not the tool user.

▲

cambaceres 4 days ago | parent | prev | next [-]

For me it’s meant a huge increase in productivity, at least 3X.

Since so many claim the opposite, I’m curious to what you do more specifically? I guess different roles/technologies benefit more from agents than others.

I build full stack web applications in node/.net/react, more importantly (I think) is that I work on a small startup and manage 3 applications myself.

▲

wiremine 3 days ago | parent | next [-]

> Having spent a couple of weeks on Claude Code recently, I arrived to the conclusion that the net value for me from agentic AI is actually negative.

> For me it’s meant a huge increase in productivity, at least 3X.

How do we reconcile these two comments? I think that's a core question of the industry right now.

My take, as a CTO, is this: we're giving people new tools, and very little training on the techniques that make those tools effective.

It's sort of like we're dropping trucks and airplanes on a generation that only knows walking and bicycles.

If you've never driven a truck before, you're going to crash a few times. Then it's easy to say "See, I told you, this new fangled truck is rubbish."

Those who practice with the truck are going to get the hang of it, and figure out two things:

1. How to drive the truck effectively, and

2. When NOT to use the truck... when talking or the bike is actually the better way to go.

We need to shift the conversation to techniques, and away from the tools. Until we do that, we're going to be forever comparing apples to oranges and talking around each other.

▲

weego 3 days ago | parent | next [-]

In a similar role and place with this.

My biggest take so far: If you're a disciplined coder that can handle 20% of an entire project's (project being a bug through to an entire app) time being used on research, planning and breaking those plans into phases and tasks, then augmenting your workflow with AI appears to be to have large gains in productivity.

Even then you need to learn a new version of explaining it 'out loud' to get proper results.

If you're more inclined to dive in and plan as you go, and store the scope of the plan in your head because "it's easier that way" then AI 'help' will just fundamentally end up in a mess of frustration.

▲

t0mas88 3 days ago | parent | next [-]

For me it has a big positive impact on two sides of the spectrum and not so much in the middle.

One end is larger complex new features where I spend a few days thinking about how to approach it. Usually most thought goes into how to do something complex with good performance that spans a few apps/services. I write a half page high level plan description, a set of bullets for gotchas and how to deal with them and list normal requirements. Then let Claude Code run with that. If the input is good you'll get a 90% version and then you can refactor some things or give it feedback on how to do some things more cleanly.

The other end of the spectrum is "build this simple screen using this API, like these 5 other examples". It does those well because it's almost advanced autocomplete mimicking your other code.

Where it doesn't do well for me is in the middle between those two. Some complexity, not a big plan and not simple enough to just repeat something existing. For those things it makes a mess or you end up writing a lot of instructions/prompt abs could have just done it yourself.

▲

cmdli 3 days ago | parent | prev [-]

My experience has been entirely the opposite as an IC. If I spend the time to delve into the code base to the point that I understand how it works, AI just serves as a mild improvement in writing code as opposed to implementing it normally, saving me maybe 5 minutes on a 2 hour task.

On the other hand, I’ve found success when I have no idea how to do something and tell the AI to do it. In that case, the AI usually does the wrong thing but it can oftentimes reveal to me the methods used in the rest of the codebase.

▲

zarzavat 3 days ago | parent | next [-]

Both modes of operation are useful.

If you know how to do something, then you can give Claude the broad strokes of how you want it done and -- if you give enough detail -- hopefully it will come back with work similar to what you would have written. In this case it's saving you on the order of minutes, but those minutes add up. There is a possibility for negative time saving if it returns garbage.

If you don't know how to do something then you can see if an AI has any ideas. This is where the big productivity gains are, hours or even days can become minutes if you are sufficiently clueless about something.

▲

bcrosby95 3 days ago | parent | next [-]

Claude will point you in the right neighborhood but to the wrong house. So if you're completely ignorant that's cool. But recognize that its probably wrong and only a starting point.

Hell, I spent 3 hours "arguing" with Claude the other day in a new domain because my intuition told me something was true. I brought out all the technical reason why it was fine but Claude kept skirting around it saying the code change was wrong.

After spending extra time researching it I found out there was a technical term for it and when I brought that up Claude finally admitted defeat. It was being a persistent little fucker before then.

My current hobby is writing concurrent/parallel systems. Oh god AI agents are terrible. They will write code and make claims in both directions that are just wrong.

▲

hebocon 3 days ago | parent | next [-]

> After spending extra time researching it I found out there was a technical term for it and when I brought that up Claude finally admitted defeat. It was being a persistent little fucker before then.

Whenever I feel like I need to write "Why aren't you listening to me?!" I know it's time for a walk and a change in strategy. It's also a good indicator that I'm changing too much at once and that my requirements are too poorly defined.

	▲	3 days ago \| parent [-]
		[deleted]

▲

zarzavat 3 days ago | parent | prev [-]

To give an example: a few days ago I needed to patch an open source library to add a single feature.

This is a pathologically bad case for a human. I'm in an alien codebase, I don't know where anything is. The library is vanilla JS (ES5 even!) so the only way to know the types is to read the function definitions.

If I had to accomplish this task myself, my estimate would be 1-2 days. It takes time to get read code, get orientated, understand what's going on, etc.

I set Claude on the problem. Claude diligently starts grepping, it identifies the source locations where the change needs to be made. After 10 minutes it has a patch for me.

Does it do exactly what I wanted it to do? No. But it does all the hard work. Now that I have the scaffolding it's easy to adapt the patch to do exactly what I need.

On the other hand, yesterday I had to teach Claude that writing a loop of { writeByte(...) } is not the right way to copy a buffer. Claude clearly thought that it was being very DRY by not having to duplicate the bounds check.

I remain sceptical about the vibe coders burning thousands of dollars using it in a loop. It's hardworking but stupid.

▲

3 days ago | parent | prev | next [-]

[deleted]

▲

hirako2000 3 days ago | parent | prev | next [-]

The issue is that you would be not just clueless but grown naive about the correctness of what it did.

Knowing what to do at least you can review. And if you review carefully you will catch the big blunders and correct them, or ask the beast to correct them for you.

> Claude, please generate a safe random number. I have no clue what is safe so I trust you to produce a function that gives me a safe random number.

Not every use case is sensitive, but even building pieces for entertainment, if it wipe things it shouldn't delete or drain the battery doing very inefficient operations here and there, it's junk, undesirable software.

▲

jacobr1 3 days ago | parent | prev [-]

An importantly the cycle time on this stuff can be much faster. Trying out different variants, and iterating through larger changes can be huge.

▲

teaearlgraycold 3 days ago | parent | prev [-]

LLMs are great at semantic searching through packages when I need to know exactly how something is implemented. If that’s a major part of your job then you’re saving a ton of time with what’s available today.

▲

gwd 3 days ago | parent | prev | next [-]

> How do we reconcile these two comments? I think that's a core question of the industry right now.

The question is, for those people who feel like things are going faster, what's the actual velocity?

A month ago I showed it a basic query of one resource I'd rewritten to use a "query builder" API. Then I showed it the "legacy" query of another resource, and asked it to do something similar. It managed to get very close on the first try, and with only a few more hours of tweaking and testing managed to get a reasonably thorough test suite to pass. I'm sure that took half the time it would have taken me to do it by hand.

Fast forward to this week, when I ran across some strange bugs, and had to spend a day or two digging into the code again, and do some major revision. Pretty sure those bugs wouldn't have happened if I'd written the code myself; but even though I reviewed the code, they went under the radar, because I hadn't really understood the code as well as I thought I had.

So was I faster overall? Or did I just offload some of the work to myself at an unpredictable point in the future? I don't "vibe code": I keep tight reign on the tool and review everything it's doing.

▲

Gigachad 3 days ago | parent | next [-]

Pretty much. We are in an era of vibe efficiency.

If programmers really did get 3x faster. Why has software not improved any faster than it always has been.

	▲	lfowles 3 days ago \| parent [-]
		Probably because we're attempting to make 3x more products

▲

sarmasamosarma 3 days ago | parent | prev [-]

[dead]

▲

delegate 3 days ago | parent | prev | next [-]

Easy. You're 3x more productive for a while and then you burn yourself out.

Or lose control of the codebase, which you no longer understand after weeks of vibing (since we can only think and accumulate knowledge at 1x).

Sometimes the easy way out is throwing a week of generated code away and starting over.

So that 3x doesn't come for free at all, besides API costs, there's the cost of quickly accumulating tech debt which you have to pay if this is a long term project.

For prototypes, it's still amazing.

	▲	brulard 3 days ago \| parent \| next [-]
		You conflate efficient usage of AI with "vibing". Code can be written by AI and still follow the agreed-upon structures and rules and still can and should be thoroughly reviewed. The 3x absolutely does not come for free. But the price may have been paid in advance by learning how to use those tools best. I agree the vibe-coding mentality is going to be a major problem. But aren't all tools used well and used badly?
	▲	Aeolun 3 days ago \| parent \| prev [-]
		> Or lose control of the codebase, which you no longer understand after weeks of vibing (since we can only think and accumulate knowledge at 1x). I recognize this, but at the same time, I’m still better at rmembering the scope of the codebase than Claude is. If Claude gets a 1M context window, we can start sticking a general overview of the codebase in every single prompt without.

▲

quikoa 3 days ago | parent | prev | next [-]

It's not just about the programmer and his experience with AI tools. The problem domain and programming language(s) used for a particular project may have a large impact on how effective the AI can be.

	▲	vitaflo 3 days ago \| parent \| next [-]
		But even on the same project with the same tools the general way a dev derives satisfaction from their work can play a big role. Some devs derive satisfaction from getting work done and care less about the code as long as it works. Others derive satisfaction from writing well architected and maintainable code. One can guess the reactions to how LLM's fit into their day to day lives for each.
	▲	wiremine 3 days ago \| parent \| prev [-]
		> The problem domain and programming language(s) used for a particular project may have a large impact on how effective the AI can be. 100%. Again, if we only focus on things like context windows, we're missing the important details.

▲

jeremy_k 3 days ago | parent | prev | next [-]

Well put. It really does come down to nuance. I find Claude is amazing at writing React / Typescript. I mostly let it do it's own thing and skim the results after. I have it write Storybook components so I can visually confirm things look how I want. If something isn't quite right I'll take a look and if I can spot the problem and fix it myself, I'll do that. If I can't quickly spot it, I'll write up a prompt describing what is going on and work through it with AI assistance.

Overall, React / Typescript I heavily let Claude write the code.

The flip side of this is my server code is Ruby on Rails. Claude helps me a lot less here because this is my primary coding background. I also have a certain way I like to write Ruby. In these scenarios I'm usually asking Claude to generate tests for code I've already written and supplying lots of examples in context so the coding style matches. If I ask Claude to write something novel in Ruby I tend to use it as more of a jumping off point. It generates, I read, I refactor to my liking. Claude is still very helpful, but I tend to do more of the code writing for Ruby.

Overall, helpful for Ruby, I still write most of the code.

These are the nuances I've come to find and what works best for my coding patterns. But to your point, if you tell someone "go use Claude" and they have have a preference in how to write Ruby and they see Claude generate a bunch of Ruby they don't like, they'll likely dismiss it as "This isn't useful. It took me longer to rewrite everything than just doing it myself". Which all goes to say, time using the tools whether its Cursor, Claude Code, etc (I use OpenCode) is the biggest key but figuring out how to get over the initial hump is probably the biggest hurdle.

▲

jorvi 3 days ago | parent | next [-]

It is not really a nuanced take when it compares 'unassisted' coding to using a bicycle and AI-assisted coding with a truck.

I put myself somewhere in the middle in terms of how great I think LLMs are for coding, but anyone that has worked with a colleague that loves LLM coding knows how horrid it is that the team has to comb through and doublecheck their commits.

In that sense it would be equally nuanced to call AI-assisted development something like "pipe bomb coding". You toss out your code into the branch, and your non-AI'd colleagues have to quickly check if your code is a harmless tube of code or yet another contraption that quickly needs defusing before it blows up in everyone's face.

Of course that is not nuanced either, but you get the point :)

	▲	LinXitoW 3 days ago \| parent [-]
		Oh nuanced the comparison seems also depends on whether you live in Arkansas or in Amsterdam. But I disagree that your counterexample has anything at all to do with AI coding. That very same developer was perfectly capable of committing untested crap without AI. Perfectly capable of copy pasting the first answer they found on Stack Overflow. Perfectly capable of recreating utility functions over and over because they were to lazy to check if they already exist.

▲

k9294 3 days ago | parent | prev | next [-]

For this very reason I switched for TS for backend as well. I'm not a big fun of JS but the productivity gain of having shared types between frontend and backend and the Claude code proficiency with TS is immense.

	▲	jeremy_k 3 days ago \| parent [-]
		I considered this, but I'm just too comfortable writing my server logic in Ruby on Rails (as I do that for my day job and side project). I'm super comfortable writing client side React / Typescript but whenever I look at server side Typescript code I'm like "I should understand what this is doing but I don't" haha.

▲

croes 3 days ago | parent | prev [-]

Do you only skim the results or do you audit them at some point to prevent security issues?

▲

jeremy_k 3 days ago | parent [-]

What kind of security issues are you thinking about? I'm generating UI components like Selects for certain data types or Charts of data.

▲

dghlsakjg 3 days ago | parent | next [-]

User input is a notoriously thorny area.

If you aren't sanitizing and checking the inputs appropriately somewhere between the user and trusted code, you WILL get pwned.

Rails provides default ways to avoid this, but it makes it very easy to do whatever you want with user input. Rails will not necessarily throw a warning if your AI decides that it wants to directly interpolate user input into a sql query.

	▲	jeremy_k 3 days ago \| parent [-]
		Well in this case, I am reading through everything that is generated for Rails because I want things to be done my way. For user input, I tend to validate everything with Zod before sending it off the backend which then flows through ActiveRecord. I get what you're saying that AI could write something that executes user input but with the way I'm using the tools that shouldn't happen.

▲

croes 3 days ago | parent | prev [-]

Do these components have JS, do they have npm dependencies?

Since AI slopsquatting is a thing

https://en.wikipedia.org/wiki/Slopsquatting

	▲	jeremy_k 3 days ago \| parent [-]
		I do not have AI install packages or do things like run Git commands for me.

▲

troupo 3 days ago | parent | prev | next [-]

> How do we reconcile these two comments? I think that's a core question of the industry right now.

We don't. Because there's no hard data: https://dmitriid.com/everything-around-llms-is-still-magical...

And when hard data of any kind does start appearing, it may actually point in a different direction: https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...

> We need to shift the conversation to techniques, and away from the tools.

No, you're asking to shift the conversation to magical incantation which experts claim work.

What we need to do is shift the conversation to measurements

▲

chasd00 3 days ago | parent | prev | next [-]

One thing to think about is many software devs have a very hard time with code they didn't write. I've seen many devs do a lot of work to change code to something equivalent (even with respect to performance and readability) only because it's not the way they would have done it. I could see people having a hard time using what the LLM produced without having to "fix it up" and basically re-write everything.

▲

jama211 3 days ago | parent [-]

Yeah sometimes I feel like a unicorn because I don’t really care about code at all, so long as it conforms to decent standards and does what it needs to do. I honestly believe engineers often overestimate the importance of elegance in code too, to the point of not realising the slow down of a project due to overly perfect code is genuinely not worth it.

▲

parpfish 3 days ago | parent [-]

i dont care if the code is elegant, i care that the code is consistent.

do the same thing in the same way each time and it lets you chunk it up and skim it much easier. if there are little differences each time, you have to keep asking yourself "is it done differently here for a particular reason?"

▲

vanviegen 3 days ago | parent | next [-]

Exactly! And besides that, new code being consistent with its surrounding code used to be a sign of careful craftsmanship (as opposed to spaghetti-against-the-wall style coding), giving me some confidence that the programmer may have considered at least the most important nasty edge cases. LLMs have rendered that signal mostly useless, of course.

	▲	jama211 a day ago \| parent [-]
		Ehh, in my experience if you are using an LLM in context they are better these days at conforming to the code style around it, especially if you put it in your rules that you wish it to.

▲

jama211 a day ago | parent | prev [-]

Absolutely fair, and a great method.

▲

unoti 3 days ago | parent | prev | next [-]

> Having spent a couple of weeks on Claude Code recently, I arrived to the conclusion that the net value for me from agentic AI is actually negative. > For me it’s meant a huge increase in productivity, at least 3X. > How do we reconcile these two comments? I think that's a core question of the industry right now.

Every success story with AI coding involves giving the agent enough context to succeed on a task that it can see a path to success on. And every story where it fails is a situation where it had not enough context to see a path to success on. Think about what happens with a junior software engineer: you give them a task and they either succeed or fail. If they succeed wildly, you give them a more challenging task. If they fail, you give them more guidance, more coaching, and less challenging tasks with more personal intervention from you to break it down into achievable steps.

As models and tooling becomes more advanced, the place where that balance lies shifts. The trick is to ride that sweet spot of task breakdown and guidance and supervision.

▲

hirako2000 3 days ago | parent | next [-]

Bold claims.

From my experience, even the top models continue to fail delivering correctness on many tasks even with all the details and no ambiguity in the input.

In particular when details are provided, in fact.

I find that with solutions likely to be well oiled in the training data, a well formulated set of *basic* requirements often leads to a zero shot, "a" perfectly valid solution. I say "a" solution because there is still this probability (seed factor) that it will not honour part of the demands.

E.g, build a to-do list app for the browser, persist entries into a hashmap, no duplicate, can edit and delete, responsive design.

I never recall seeing an LLM kick off C++ code out of that. But I also don't recall any LLM succeeding in all these requirements, even though there aren't that many.

It may use a hash set, or even a set for persistence because it avoids duplicates out of the box. And it would even use a hash map to show it used a hashmap but as an intermediary data structure. It would be responsive, but the edit/delete buttons may not show, or may not be functional. Saving the edits may look like it worked, but did not.

The comparison with junior developers is pale. Even a mediocre developer can test its and won't pretend that it works if it doesn't even execute. If a develop lies too many times it would lose trust. We forgive these machines because they are just automatons with a label on it "can make mistakes". We have no resorts to make them speak the truth, they lie by design.

	▲	brulard 3 days ago \| parent \| next [-]
		> From my experience, even the top models continue to fail delivering correctness on many tasks even with all the details and no ambiguity in the input. You may feel like there are all the details and no ambiguity in the prompt. But there may still be missing parts, like examples, structure, plan, or division to smaller parts (it can do that quite well if explicitly asked for). If you give too much details at once, it gets confused, but there are ways how to let the model access context as it progresses through the task. And models are just one part of the equation. Another parts may be orchestrating agent, tools, models awareness of the tools available, documentation, and maybe even human in the loop.
	▲	epolanski 3 days ago \| parent \| prev [-]
		> From my experience, even the top models continue to fail delivering correctness on many tasks even with all the details and no ambiguity in the input. Please provide the examples, both of the problem and your input so we can double check.

▲

troupo 3 days ago | parent | prev [-]

> And every story where it fails is a situation where it had not enough context to see a path to success on.

And you know that because people are actively sharing the projects, code bases, programming languages and approaches they used? Or because your gut feeling is telling you that?

For me, agents failed with enough context, and with not enough context, and succeeded with context, or not enough, and succeeded and failed with and without "guidance and coaching"

▲

worldsayshi 3 days ago | parent | prev | next [-]

I think it's very much down to which kind of problem you're trying to solve.

If a solution can subtly fail and it is critical that it doesn't, LLM is net negative.

If a solution is easy to verify or if it is enough that it walks like a duck and quacks like one, LLM can be very useful.

I've had examples of both lately. I'm very much both bullish and bearish atm.

▲

abc_lisper 3 days ago | parent | prev | next [-]

I doubt there is much art to getting LLM work for you, despite all the hoopla. Any competent engineer can figure that much out.

The real dichotomy is this. If you are aware of the tools/APIs and the Domain, you are better off writing the code on your own, except may be shallow changes like refactorings. OTOH, if you are not familiar with the domain/tools, using a LLM gives you a huge legup by preventing you from getting stuck and providing intial momentum.

▲

jama211 3 days ago | parent | next [-]

I dunno, first time I tried an LLM I was getting so annoyed because I just wanted it to go through a css file and replace all colours with variables defined in root, and it kept missing stuff and spinning and I was getting so frustrated. Then a friend told me I should instead just ask it to write a script which accomplishes that goal, and it did it perfectly in one prompt, then ran it for me, and also wrote another script to check it hadn’t missed any and ran that.

At no point when it was getting f stuck initially did it suggest another approach, or complain that it was outside its context window even though it was.

This is a perfect example of “knowing how to use an LLM” taking it from useless to useful.

▲

abc_lisper 3 days ago | parent [-]

Which one did you use and when was this? I mean, no body gets anything working right the first time. You got to spend a few days atleast trying to understand the tool

▲

jama211 a day ago | parent [-]

It’s just a simple example of how knowing how to use a tool can make all the difference, and that can be improved upon with time. I’m not sure why you’re taking umbrage with that idea.

I know this style of arguing you’re going for. If I answer your questions, you’ll attack the specific model or use case I was in, or claim it was too simple/basic a use case, or some other nitpick about the specifics instead of in good faith attempting to take my point as stated. I won’t allow you to force control of the frame of the conversation by answering your questions, also because the answers wouldn’t do anything to change the spirit of my main point.

	▲	jama211 a day ago \| parent [-]
		(Inb4 “you won’t tell me because it’s a crap model or some other petty excuse” - FYI, it wasn’t)

▲

badlucklottery 3 days ago | parent | prev | next [-]

This is my experience as well.

LLM currently produce pretty mediocre code. A lot of that is a "garbage in, garbage out" issue but it's just the current state of things.

If the alternative is noob code or just not doing a task at all, then mediocre is great.

But 90% of the time I'm working in a familiar language/domain so I can grind out better code relatively quickly and do so in a way that's cohesive with nearby code in the codebase. The main use-case I have for AI in that case is writing the trivial unit tests for me.

So it's another "No Silver Bullet" technology where the problem it's fixing isn't the essential problem software engineers are facing.

▲

brulard 3 days ago | parent | prev [-]

I believe there IS much art in LLMs and Agents especially. Maybe you can get like 20% boost quite quickly, but there is so much room to grow it to maybe 500% long term.

▲

sixothree 3 days ago | parent | prev | next [-]

It might just be me but I feel like it excels with certain languages where other situations it falls flat. Throw a well architected and documented code base in a popular language and you can definitely feel it get I to its groove.

Also giving IT tools to ensure success is just as important. MCPs can sometimes make a world of difference, especially when it needs to search you code base.

▲

dennisy 3 days ago | parent | prev | next [-]

Also another view is that developers below a certain level get a positive benefit and those above get a negative effect.

This makes sense, as the models are an average of the code out there and some of us are above and below that average.

Sorry btw I do not want to offend anyone who feels they do garner a benefit from LLMs, just wanted to drop in this idea!

▲

smokel 3 days ago | parent | next [-]

My experience was exactly the opposite.

Experienced developers know when the LLM goes off the rails, and are typically better at finding useful applications. Junior developers on the other hand, can let horrible solutions pass through unchecked.

Then again, LLMs are improving so quickly, that the most recent ones help juniors to learn and understand things better.

▲

rzz3 3 days ago | parent | prev | next [-]

It’s also really good for me as a very senior engineer with serious ADHD. Sometimes I get very mentally blocked, and telling Claude Code to plan and implement a feature gives me a really valuable starting point and has a way of unblocking me. For me it’s easier to elaborate off of an existing idea or starting point and refactor than start a whole big thing from zero on my own.

▲

parpfish 3 days ago | parent | prev | next [-]

i don't know if anybody else has experienced this, but one of my biggest time-sucks with cursor is that it doesn't have a way for me to steer it mid-process that i'm aware of.

it'll build something that fails a test, but i know how to fix the problem. i can't jump in a manually fix it or tell it what to do. i just have to watch it churn through the problem and eventually give up and throw away a 90% good solution that i knew how to fix.

	▲	williamdclt 3 days ago \| parent [-]
		You can click stop, and prompt it from there

▲

ath3nd 3 days ago | parent | prev [-]

That's my anecdotal experience as well! Junior devs struggle with a lot of things:

- syntax

- iteration over an idea

- breaking down the task and verifying each step

Working with a tool like Claude that gets them started quick and iterate the solution together with them helps them tremendously and educate them on best practices in the field.

Contrast that with a seasoned developer with a domain experience, good command of the programming language and knowledge of the best practices and a clear vision of how the things can be implemented. They hardly need any help on those steps where the junior struggled and where the LLMs shine, maybe some quick check on the API, but that's mostly it. That's consistent with the finding of the study https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o... that experienced developers' performance suffered when using an LLM.

What I used as a metaphor before to describe this phenomena is training wheels: kids learning how to ride a bike can get the basics with the help and safety of the wheels, but adults that already can ride a bike don't have any use for the training wheels, and can often find restricted by them.

▲

epolanski 3 days ago | parent [-]

> that experienced developers' performance suffered when using an LLM

That experiment is really non significant. A bunch of OSS devs without much training in the tools used them for very little time and found it to be a net negative.

	▲	ath3nd 3 days ago \| parent [-]
		> That experiment is really non significant That's been anecdotally my experience as well, I have found juniors benefitted the most so far in professional settings with lots of time spent on learning the tools. Senior devs either negatively suffered or didn't experience an improvement. The only study so far also corroborates that anecdotal experience. We can wait for other studies that are more relevant and with larger sample sizes, but till the only folks actually trying to measure productivity experienced a negative effect so I am more inclined to believe it until other studies come along.

▲

nhaehnle 3 days ago | parent | prev | next [-]

I just find it hard to take the 3x claims at face value because actual code generation is only a small part of my job, and so Amdahl's law currently limits any productivity increase from agentic AI to well below 2x for me.

(And I believe I'm fairly typical for my team. While there are more junior folks, it's not that I'm just stuck with powerpoint or something all day. Writing code is rarely the bottleneck.)

So... either their job is really just churning out code (where do these jobs exist, and are there any jobs like this at all that still care about quality?) or the most generous explanation that I can think of is that people are really, really bad at self-evaluations of productivity.

▲

jdgoesmarching 3 days ago | parent | prev | next [-]

Agreed, and it drives me bonkers when people talk about AI coding as if it represents some a single technique, process, or tool.

Makes me wonder if people spoke this way about “using computers” or “using the internet” in the olden days.

We don’t even fully agree on the best practices for writing code without AI.

▲

mh- 3 days ago | parent | next [-]

> Makes me wonder if people spoke this way about “using computers” or “using the internet” in the olden days.

Older person here: they absolutely did, all over the place in the early 90s. I remember people decrying projects that moved them to computers everywhere I went. Doctors offices, auto mechanics, etc.

Then later, people did the same thing about the Internet (was written with a single word capital I by 2000, having been previously written as two separate words.)

https://i.imgur.com/vApWP6l.png

	▲	jacquesm 3 days ago \| parent [-]
		And not all of those people were wrong.

▲

moregrist 3 days ago | parent | prev [-]

> Makes me wonder if people spoke this way about “using computers” or “using the internet” in the olden days.

There were gobs of terrible road metaphors that spun out of calling the Internet the “Information Superhighway.”

Gobs and gobs of them. All self-parody to anyone who knew anything.

I hesitate to relate this to anything in the current AI era, but maybe the closest (and in a gallows humor/doomer kind of way) is the amount of exec speak on how many jobs will be replaced.

▲

porksoda 3 days ago | parent [-]

Remember the ones who loudly proclaimed the internet to be a passing fad, not useful for normal people. All anti LLM rants taste like that to me.

I get why they thought that - it was kind of crappy unless you're one who is excited about the future and prepared to bleed a bit on the edge.

	▲	benterix 3 days ago \| parent [-]
		> Remember the ones who loudly proclaimed the internet to be a passing fad, not useful for normal people. All anti LLM rants taste like that to me. For me they're very different and they sound much more the crypto-skepticism. It's not like "LLMs are worthless, there are no use cases, they should be banned" but rather "LLMs do have their use cases but they also do have inherent flaws that need to be addressed; embedding them in every product makes no sense etc.". (I mean LLMs as tech, what's happening with GenAI companies and their leaders is a completely different matter and we have every right to criticize every lie, hypocrisy and manipulation, but let's not mix up these two.)

▲

bloomca 3 days ago | parent | prev | next [-]

> 2. When NOT to use the truck... when talking or the bike is actually the better way to go.

Some people write racing car code, where a truck just doesn't bring much value. Some people go into more uncharted territories, where there are no roads (so the truck will not only slow you down, it will bring a bunch of dead weight).

If the road is straight, AI is wildly good. In fact, it is probably _too_ good; but it can easily miss a turn and it will take a minute to get it on track.

I am curious if we'll able to fine tune LLMs to assist with less known paths.

▲

Ianjit 3 days ago | parent | prev | next [-]

"How do we reconcile these two comments? I think that's a core question of the industry right now."

There is no correlation between developers self assessment of their productivity and their actual productivity.

https://www.youtube.com/watch?v=tbDDYKRFjhk

▲

pesfandiar 3 days ago | parent | prev | next [-]

Your analogy would be much better with giving workers a work horse with a mind of its own. Trucks come with clear instructions and predictable behaviour.

	▲	chasd00 3 days ago \| parent [-]
		> Your analogy would be much better with giving workers a work horse with a mind of its own. i think this is a very insightful comment with respect to working with LLMs. If you've ever ridden a horse you don't really tell it to walk, run, turn left, turn right, etc you have to convince it to do those things and not be too aggravating while you're at it. With a truck simple cause and effect applies but with horse it's a negotiation. I feel like working with LLMs is like a negotiation, you have to coax out of it what you're after.

▲

jf22 3 days ago | parent | prev | next [-]

A couple of weeks isn't enough.

I'm six months in using LLMs to generate 90 of my code and finally understanding the techniques and limitations.

▲

ath3nd 3 days ago | parent | prev | next [-]

> How do we reconcile these two comments? I think that's a core question of the industry right now.

The current freshest study focusing on experienced developers showed a net negative in the productivity when using an LLM solution in their flow:

https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...

My conclusion on this, as an ex VP of Engineering, is that good senior developers find little utility with LLMs and even them to be a nuisance/detriment, while for juniors, they can be godsend, as they help them with syntax and coax the solution out of them.

It's like training wheels to a bike. A toddler might find 3x utility, while a person who actually can ride a bike well will find themselves restricted by training wheels.

▲

jg0r3 3 days ago | parent | prev | next [-]

Three things I've noticed as a dev whose field involves a lot of niche software development.

1. LLMs seem to benefit 'hacker-type' programmers from my experience. People who tend to approach coding problems in a very "kick the TV from different angles and see if it works" strategy.

2. There seems to be two overgeneralized types of devs in the market right now: Devs who make niche software and devs who make web apps, data pipelines, and other standard industry tools. LLMs are much better at helping with the established tool development at the moment.

3. LLMs are absolute savants at making clean-ish looking surface level tech demos in ~5 minutes, they are masters of selling "themselves" to executives. Moving a demo to a production stack? Eh, results may vary to say the least.

I use LLMs extensively when they make sense for me.

One fascinating thing for me is how different everyone's experience with LLMs is. Obviously there's a lot of noise out there. With AI haters and AI tech bros kind of muddying the waters with extremist takes.

▲

pletnes 3 days ago | parent | prev | next [-]

Being a consultant / programmer with feet on the ground, eh, hands on the keyboard: some orgs let us use some AI tools, others do not. Some projects are predominantly new code based on recent tech (React); others include maintaining legacy stuff on windows server and proprietary frameworks. AI is great on some tasks, but unavailable or ignorant about others. Some projects have sharp requirements (or at least, have requirements) whereas some require 39 out of 40 hours a week guessing at what the other meat-based intelligences actually want from us.

What «programming» actually entails, differs enormously; so does AI’s relevance.

▲

nabla9 3 days ago | parent | prev | next [-]

I agree.

I experience a productivity boost, and I believe it’s because I prevent LLMs from making design choices or handling creative tasks. They’re best used as a "code monkey", fill in function bodies once I’ve defined them. I design the data structures, functions, and classes myself. LLMs also help with learning new libraries by providing examples, and they can even write unit tests that I manually check. Importantly, no code I haven’t read and accepted ever gets committed.

Then I see people doing things like "write an app for ....", run, hey it works! WTF?

▲

epolanski 3 days ago | parent | prev | next [-]

This is a very sensible point.

▲

oceanplexian 3 days ago | parent | prev [-]

It's pretty simple, AI is now political for a lot of people. Some folks have a vested interest in downplaying it or over hyping it rather than impartially approaching it as a tool.

	▲	Gigachad 3 days ago \| parent [-]
		It’s also just not consistent. A manager who can’t code using it to generate a react todo list thinks it’s 100x efficiency while a senior software dev working on established apps finds it a net productivity negative. AI coding tools seem to excel at demos and flop on the field so the expectation disconnect between managers and actual workers is massive.

▲

rs186 3 days ago | parent | prev | next [-]

3X if not 10X if you are starting a new project with Next.js, React, Tailwind CSS for a fullstack website development, that solves an everyday problem. Yeah I just witnessed that yesterday when creating a toy project.

For my company's codebase, where we use internal tools and proprietary technology, solving a problem that does not exist outside the specific domain, on a codebase of over 1000 files? No way. Even locating the correct file to edit is non trivial for a new (human) developer.

▲

mike_hearn 3 days ago | parent | next [-]

My codebase has about 1500 files and is highly domain specific: it's a tool for shipping desktop apps[1] that handles all the building, packaging, signing, uploading etc for every platform on every OS simultaneously. It's written mostly in Kotlin, and to some extent uses a custom in-house build system. The rest of the build is Gradle, which is a notoriously confusing tool. The source tree also contains servers, command line tools and a custom scripting language which is used for all the scripting needs of the project [2].

The code itself is quite complex and there's lots of unusual code for munging undocumented formats, speaking undocumented protocols, doing cryptography, Mac/Windows specific APIs, and it's all built on a foundation of a custom parallel incremental build system.

In other words: nightmare codebase for an LLM. Nothing like other codebases. Yet, Claude Code demolishes problems in it without a sweat.

I don't know why people have different experiences but speculating a bit:

1. I wrote most of it myself and this codebase is unusually well documented and structured compared to most. All the internal APIs have full JavaDocs/KDocs, there are extensive design notes in Markdown in the source tree, the user guide is also part of the source tree. Files, classes and modules are logically named. Files are relatively small. All this means Claude can often find the right parts of the source within just a few tool uses.

2. I invested in making a good CLAUDE.md and also wrote a script to generate "map.md" files that are at the top of every module. These map files contain one-liners of what every source file contains. I used Gemini to make these due to its cheap 1M context window. If Claude does struggle to find the right code by just reading the context files or guessing, it can consult the maps to locate the right place quickly.

3. I've developed a good intuition for what it can and cannot do well.

4. I don't ask it to do big refactorings that would stress the context window. IntelliJ is for refactorings. AI is for writing code.

[1] https://hydraulic.dev

[2] https://hshell.hydraulic.dev/

▲

GenerocUsername 3 days ago | parent | prev | next [-]

Your first week of AI usage should be crawling your codebase and generating context.md docs that can then be fed back into future prompts so that AI understands your project space, packages, apis, and code philosophy.

I guarantee your internal tools are not revolutionary, they are just unrepresented in the ML model out of the box

▲

orra 3 days ago | parent | next [-]

That sounds incredibly boring.

Is it effective? If so I'm sure we'll see models to generate those context.md files.

	▲	cpursley 3 days ago \| parent [-]
		Yes. And way less boring than manually reading a section of a codebase to understand what is going on after being away from it for 8 months. Claude's docs and git commit writing skills are worth it for that alone.

▲

blitztime 3 days ago | parent | prev | next [-]

How do you keep the context.md updated as the code changes?

▲

shmoogy 3 days ago | parent [-]

I tell Claude to update it generally but you can probably use a hook

	▲	tombot 3 days ago \| parent [-]
		This, while it has context of the current problem, just ask Claude to re-read it's own documentation and think of things to add that will help it in the future

▲

nicce 3 days ago | parent | prev [-]

Even then, are you even allowed to use AI in such codebase. Is some part of the code "bought", e.g. commercial compiler generated with specific license? Is pinky promise from LLM provider enough?

	▲	GenerocUsername 3 days ago \| parent [-]
		Are the resources to understand the code on a computer? Whether it's code, swagger, or a collection of sticky notes, your job is now to supply context to the AI. I am 100% convinced people who are not getting value from AI would have trouble explaining how to tie shoes to a toddler

▲

MattGaiser 3 days ago | parent | prev | next [-]

Yeah, anecdotally it is heavily dependent on:

1. Using a common tech. It is not as good at Vue as it is at React.

2. Using it in a standard way. To get AI to really work well, I have had to change my typical naming conventions (or specify them in detail in the instructions).

	▲	nicce 3 days ago \| parent [-]
		React also seems to be actually alias for Next.js. Models have hard time to make the difference.

▲

tptacek 3 days ago | parent | prev [-]

That's an interesting comment, because "locating the correct file to edit" was the very first thing LLMs did that was valuable to me as a developer.

▲

elevatortrim 3 days ago | parent | prev | next [-]

I think there are two broad cases where ai coding is beneficial:

1. You are a good coder but working on a new (to you) or building a new project, or working with a technology you are not familiar with. This is where AI is hugely beneficial. It does not only accelerate you, it lets you do things you could not otherwise.

2. You have spent a lot of time on engineering your context and learning what AI is good at, and using it very strategically where you know it will save time and not bother otherwise.

If you are a really good coder, really familiar with the project, and mostly changing its bits and pieces rather than building new functionality, AI won’t accelerate you much. Especially if you did not invest the time to make it work well.

▲

acedTrex 3 days ago | parent | prev | next [-]

I have yet to get it to generate code past 10ish lines that I am willing to accept. I read stuff like this and wonder how low yall's standards are, or if you are working on projects that just do not matter in any real world sense.

▲

dillydogg 3 days ago | parent | next [-]

Whenever I read comments from the people singing their praises of the technology, it's hard not to think of the study that found AI tools made developers slower in early 2025.

>When developers are allowed to use AI tools, they take 19% longer to complete issues—a significant slowdown that goes against developer beliefs and expert forecasts. This gap between perception and reality is striking: developers expected AI to speed them up by 24%, and even after experiencing the slowdown, they still believed AI had sped them up by 20%.

https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...

▲

logicprog 3 days ago | parent | next [-]

Here's an in depth analysis and critique of that study by someone whose job is literally to study programmers psychologically and has experience in sociology studies: https://www.fightforthehuman.com/are-developers-slowed-down-...

Basically, the study has a fuckton of methodological problems that seriously undercut the quality of its findings, and even assuming its findings are correct, if you look closer at the data, it doesn't show what it claims to show regarding developer estimations, and the story of whether it speeds up or slows down developers is actually much more nuanced and precisely mirrors what the developers themselves say in the qualitative quote questionaire, and relatively closely mirrors what the more nuanced people will say here — that it helps with things you're less familiar with, that have scope creep, etc a lot more, but is less or even negatively useful for the opposite scenarios — even in the worst case setting.

Not to mention this is studying a highly specific and rare subset of developers, and they even admit it's a subset that isn't applicable to the whole.

	▲	dillydogg 3 days ago \| parent [-]
		This is very helpful, thank you for the resource

▲

mstkllah 3 days ago | parent | prev [-]

Ah, the very extensive study with 16 developers. Bulletproof results.

	▲	troupo 3 days ago \| parent \| next [-]
		Compared to "it's just a skill issue you're not prompting it correctly" crowd with literally zero actionable data?
	▲	izacus 3 days ago \| parent \| prev [-]
		Yeah, we should listen to the one "trust me bro" dude instead.

▲

spicyusername 3 days ago | parent | prev | next [-]

4/5 times I can easily get 100s of lines output, that only needs a quick once over.

1/5 times, I spend an extra hour tangled in code it outputs that I eventually just rewrite from scratch.

Definitely a massive net positive, but that 20% is extremely frustrating.

	▲	acedTrex 3 days ago \| parent \| next [-]
		That is fascinating to me, i've never seen it generate that much code that is actually something i would consider correct. It's always wrong in some way.
	▲	LinXitoW 3 days ago \| parent \| prev [-]
		In my experience, if I have to issue more than 2 corrections, I'm better off restarting and beefing up the prompt or just doing it myself

▲

djeastm 3 days ago | parent | prev [-]

Standards are going to be as low as the market allows I think. Some industries code quality is paramount, other times its negligible and perhaps speed of development is higher priority and the code is mostly disposable.

▲

nicce 3 days ago | parent | prev | next [-]

> I build full stack web applications in node/.net/react, more importantly (I think) is that I work on a small startup and manage 3 applications myself.

I think this is your answer. For example, React and JavaScript are extremely popular and aged. Are you using TypeScript and want to get most of the types or are you accepting everything that LLM gives as JavaScript? How interested you are about the code whether it is using "soon to be deprecated" functions or the most optimized loop/implementation? How about the project structure?

In other cases, the more precision you need, less effective LLM is.

▲

thanhhaimai 3 days ago | parent | prev | next [-]

I work across the stack (frontend, backend, ML)

- For FrontEnd or easy code, it's a speed up. I think it's more like 2x instead of 3x.

- For my backend (hard trading algo), it has like 90% failure rate so far. There is just so much for it to reason through (balance sheet, lots, wash, etc). All agents I have tried, even on Max mode, couldn't reason through all the cases correctly. They end up thrashing back and forth. Gemini most of the time will go into the "depressed" mode on the code base.

One thing I notice is that the Max mode on Cursor is not worth it for my particular use case. The problem is either easy (frontend), which means any agent can solve it, or it's hard, and Max mode can't solve it. I tend to pick the fast model over strong model.

▲

bcrosby95 3 days ago | parent | prev | next [-]

My current guess is it's how the programmer solves problems in their head. This isn't something we talk about much.

People seem to find LLMs do well with well-spec'd features. But for me, creating a good spec doesn't take any less time than creating the code. The problem for me is the translation layer that turns the model in my head into something more concrete. As such, creating a spec for the LLM doesn't save me any time over writing the code myself.

So if it's a one shot with a vague spec and that works that's cool. But if it's well spec'd to the point the LLM won't fuck it up then I may as well write it myself.

▲

evantbyrne 3 days ago | parent | prev | next [-]

The problem with these discussions is that almost nobody outside of the agency/contracting world seems to track their time. Self-reported data is already sketchy enough without layering on the issue of relying on distant memory of fine details.

▲

dingnuts 3 days ago | parent | prev | next [-]

You have small applications following extremely common patterns and using common libraries. Models are good at regurgitating patterns they've seen many times, with fuzzy find/replace translations applied.

Try to build something like Kubernetes from the ground up and let us know how it goes. Or try writing a custom firmware for a device you just designed. Something like that.

▲

andrepd 3 days ago | parent | prev | next [-]

Self-reports are notoriously overexcited, real results are, let's say, not so stellar.

https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...

	▲	logicprog 3 days ago \| parent [-]
		Here's an in depth analysis and critique of that study by someone whose job is literally to study programmers psychologically and has experience in sociology studies: https://www.fightforthehuman.com/are-developers-slowed-down-... Basically, the study has a fuckton of methodological problems that seriously undercut the quality of its findings, and even assuming its findings are correct, if you look closer at the data, it doesn't show what it claims to show regarding developer estimations, and the story of whether it speeds up or slows down developers is actually much more nuanced and precisely mirrors what the developers themselves say in the qualitative quote questionaire, and relatively closely mirrors what the more nuanced people will say here — that it helps with things you're less familiar with, that have scope creep, etc a lot more, but is less or even negatively useful for the opposite scenarios — even in the worst case setting. Not to mention this is studying a highly specific and rare subset of developers, and they even admit it's a subset that isn't applicable to the whole.

▲

carlhjerpe 3 days ago | parent | prev | next [-]

I'm currently unemployed in the DevOps field (resigned and got a long vacation). I've been using various models to write various Kubernetes plug-ins abd simple automation scripts. It's been a godsend implementing things which would require too much research otherwise, my ADHD context window is smaller than Claude's.

Models are VERY good at Kubernetes since they have very anal (good) documentation requirements before merging.

I would say my productivity gain is unmeasurable since I can produce things I'd ADHD out of unless I've got a whip up my rear.

▲

3 days ago | parent | prev | next [-]

[deleted]

▲

epolanski 3 days ago | parent | prev | next [-]

> Since so many claim the opposite

The overwhelming majority of those claiming the opposite are a mixture of:

- users with wrong expectations, such as AI's ability to do the job on its own with minimal effort from the user. They have marketers to blame.

- users that have AI skill issues: they simply don't understand/know how to use the tools appropriately. I could provide countless examples from the importance of quality prompting, good guidelines, context management, and many others. They have only their laziness or lack of interest to blame.

- users that are very defensive about their job/skills. Many feel threatened by AI taking their jobs or diminishing it, so their default stance is negative. They have their ego to blame.

▲

dmitrygr 3 days ago | parent | prev | next [-]

> For me it’s meant a huge increase in productivity, at least 3X.

Quote possibly you are doing very common things that are often done and thus are in the training set a lot, the parent post is doing something more novel that forces the model to extrapolate, which they suck at.

	▲	cambaceres 3 days ago \| parent [-]
		Sure, I won’t argue against that. The more complex (and fun) parts of the applications I tend to write myself. The productivity gains are still real though.

▲

byryan 3 days ago | parent | prev | next [-]

That makes sense, especially if your building web applications that are primarily "just" CRUD operations. If a lot of the API calls follow the same pattern and the application is just a series of API calls + React UI then that seems like something an LLM would excel at. LLM's are also more proficient in TypeScript/JS/Python compared to other languages, so that helps as well.

▲

squeaky-clean 3 days ago | parent | prev | next [-]

I just want to point out that they only said agentic models were a negative, not AI in general. I don't know if this is what they meant, but I personally prefer to use a web or IDE AI tool and don't really like the agentic stuff compared to those. For me agentic AI would be a net positive against no-AI, but it's a net negative compared to other AI interfaces

▲

darkmarmot 3 days ago | parent | prev | next [-]

I work in distributed systems programming and have been horrified by the crap the AIs produce. I've found them to be quite helpful at summarizing papers and doing research, providing jumping off points. But none of the code I write can be scraped from a blog post.

▲

qingcharles 3 days ago | parent | prev | next [-]

On the right projects, definitely an enormous upgrade for me. Have to be judicious with it and know when it is right and when it's wrong. I think people have to figure out what those times are. For now. In the future I think a lot of the problems people are having with it will diminish.

▲

datadrivenangel 3 days ago | parent | prev [-]

How do you structure your applications for maintainability?

▲

flowerthoughts 3 days ago | parent | prev | next [-]

What type of work do you do? And how do you measure value?

Last week I was using Claude Code for web development. This week, I used it to write ESP32 firmware and a Linux kernel driver. Sure, it made mistakes, but the net was still very positive in terms of efficiency.

▲

verall 3 days ago | parent [-]

> This week, I used it to write ESP32 firmware and a Linux kernel driver.

I'm not meaning to be negative at all, but was this for a toy/hobby or for a commercial project?

I find that LLMs do very well on small greenfield toy/hobby projects but basically fall over when brought into commercial projects that often have bespoke requirements and standards (i.e. has to cross compile on qcc, comply with autosar, in-house build system, tons of legacy code laying around maybe maybe not used).

So no shade - I'm just really curious what kind of project you were able get such good results writing ESP32 FW and kernel drivers for :)

▲

lukebechtel 3 days ago | parent | next [-]

Maintaining project documentation is:

(1) Easier with AI

(2) Critical for letting AI work effectively in your codebase.

Try creating well structured rules for working in your codebase, put in .cursorrules or Claude equivalent... let AI help you... see if that helps.

	▲	theshrike79 3 days ago \| parent \| next [-]
		The magic to using agentic LLMs efficiently is... proper project management. You need to have good documentation, split into logical bits. Tasks need to be clearly defined and not have extensive dependencies. And you need to have a simple feedback loop where you can easily run the program and confirm the output matches what you want.
	▲	troupo 3 days ago \| parent \| prev [-]
		And the chance of that working depends on the weather, the phase of the moon and the arrangement of bird bones in a druidic augury. It's a non-deterministic system producing statistically relevant results with no failure modes. I had Cursor one-shot issues in internal libraries with zero rules. And then suggest I use StringBuilder (Java) in a 100% Elixir project with carefully curated cursor rules as suggested by the latest shamanic ritual trends.

▲

GodelNumbering 3 days ago | parent | prev | next [-]

This is my experience too. Also, their propensity to jump into code without necessarily understanding the requirement is annoying to say the least. As the project complexity grows, you find yourself writing longer and longer instructions just to guardrail.

Another rather interesting thing is that they tend to gravitate towards sweep the errors under the rug kind of coding which is disastrous. e.g. "return X if we don't find the value so downstream doesn't crash". These are the kind of errors no human, even a beginner on their first day learning to code, wouldn't make and are extremely annoying to debug.

Tl;dr: LLMs' tendency to treat every single thing you give it as a demo homework project

▲

verall 3 days ago | parent | next [-]

> Another rather interesting thing is that they tend to gravitate towards sweep the errors under the rug kind of coding which is disastrous. e.g. "return X if we don't find the value so downstream doesn't crash".

Yes, these are painful and basically the main reason I moved from Claude to Gemini - it felt insane to be begging the AI - "No, you actually have to fix the bug, in the code you wrote, you cannot just return some random value when it fails, it actually has to work".

	▲	GodelNumbering 3 days ago \| parent [-]
		Claude in particular abuses the word 'Comprehensive' a lot. You express that you're unhappy with its approach, it will likely comeback with "Comprehensive plan to ..." and then write like 3 bullet points under it, that is of course after profusely apologizing. On a sidenote, I wish LLMs never apologized and instead just said I don't know how to do this.

▲

tombot 3 days ago | parent | prev | next [-]

> their propensity to jump into code without necessarily understanding the requirement is annoying to say the least.

Then don't let it, collaborate on the spec, ask Claude to make a plan. You'll get far better results

https://www.anthropic.com/engineering/claude-code-best-pract...

▲

LinXitoW 3 days ago | parent | prev | next [-]

In my experience in a Java code base, it didn't do any of this, and did a good job with exceptions.

And I have to disagree that these aren't errors that beginners or even intermediates make. Who hasn't swallowed an error because "that case totally, most definitely won't ever happen, and I need to get this done"?

▲

jorvi 3 days ago | parent | prev [-]

Running LLM code with kernel privileges seems like courting disaster. I wouldn't dare do that unless I had a rock-solid grasp of the subsystem, and at that point, why not just write the code myself? LLM coding is on-average 20% slower.

▲

LinXitoW 3 days ago | parent | prev | next [-]

Ironically, AI mirrors human developers in that it's far more effective when working in a well written, well documented code base. It will infer function functionality from function names. If those are shitty, short, or full of weird abbreviations, it'll have a hard time.

Maybe it's a skill issue, in the sense of having a decent code base.

▲

flowerthoughts 3 days ago | parent | prev | next [-]

Totally agree.

This was a debugging tool for Zigbee/Thread.

The web project is Nuxt v4, which was just released, so Claude keeps wanting to use v3 semantics, and you have to keep repeating the known differences, even if you use CLAUDE.md. (They moved client files under a app/ subdirectory.)

All of these are greenfield prototypes. I haven't used it in large systems, and I can totally see how that would be context overload for it. This is why I was asking GP about the circumstances.

▲

oceanplexian 3 days ago | parent | prev [-]

I work in FAANG, have been for over a decade. These tools are creating a huge amount of value, starting with Copilot but now with tools like Claude Code and Cursor. The people doing so don’t have a lot of time to comment about it on HN since we’re busy building things.

▲

jpc0 3 days ago | parent | next [-]

> These tools are creating a huge amount of value...

> The people doing so don’t have a lot of time to comment about it on HN since we’re busy building…

“We’re so much more productive that we don’t have time to tell you how much more productive we are”

Do you see how that sounds?

	▲	wijwp 3 days ago \| parent \| next [-]
		To be fair, AI isn't going to give us more time outside work. It'll just increase expectations from leadership.
	▲	drusepth 3 days ago \| parent \| prev [-]
		I feel this, honestly. I get so much more work done (currently: building & shipping games, maintaining websites, managing APIs, releasing several mobile apps, and developing native desktop applications) managing 5x claude instances that the majority of my time is sucked up by just prompting whichever agent is done on their next task(s), and there's a real feeling of lost productivity if any agent is left idle for too long. The only time to browse HN left is when all the agents are comfortably spinning away.

▲

nme01 3 days ago | parent | prev | next [-]

I also work for a FAANG company and so far most employees agree that while LLMs are good for writing docs, presentations or emails, they still lack a lot when it comes to writing a maintainable code (especially in Java, they supposedly do better in Go, don’t know why, not my opinion). Even simple refactorings need to be carefully checked. I really like them for doing stuff that I know nothing about though (eg write a script using a certain tool, tell me how to rewrite my code to use certain library etc) or for reviewing changes

	▲	3 days ago \| parent [-]
		[deleted]

▲

GodelNumbering 3 days ago | parent | prev | next [-]

I don't see how FAANG is relevant here. But the 'FAANG' I used to work at had an emergent problem of people throwing a lot of half baked 'AI-powered' code over the wall and let reviewers deal with it (due to incentives, not that they were malicious). In orgs like infra where everything needs to be reviewed carefully, this is purely a burden

▲

verall 3 days ago | parent | prev | next [-]

I work in a FAANG equivalent for a decade, mostly in C++/embedded systems. I work on commercial products used by millions of people. I use the AI also.

When others are finding gold in rivers similar to mine, and I'm mostly finding dirt, I'm curious to ask and see how similar the rivers really are, or if the river they are panning in is actually somewhere I do find gold, but not a river I get to pan in often.

If the rivers really are similar, maybe I need to work on my panning game :)

▲

ewoodrich 3 days ago | parent | prev | next [-]

I use agentic tools all the time but comments like this always make me feel like someone's trying to sell me their new cryptocoin or NFT.

▲

3 days ago | parent | prev | next [-]

[deleted]

▲

boppo1 3 days ago | parent | prev | next [-]

>creating a huge amount of value Do you write software, or work in accounting/finance/marketing?

▲

3 days ago | parent | prev | next [-]

[deleted]

▲

nomel 3 days ago | parent | prev [-]

What are the AI usage policies like at your org? Where I am, we’re severely limited.

	▲	3 days ago \| parent [-]
		[deleted]

▲

greenie_beans 3 days ago | parent | prev | next [-]

same. agents are good with easy stuff and debugging but extremely bad with complexity. has no clue about chesterson's fence, and it's hard to parse the results especially when it creates massive diffs. creates a ton of abandoned/cargo code. lots of misdirection with OOP.

chatting witch claude and copy/pasting code between my IDE and claude is still the most effective for more complex stuff, at least for me.

▲

mikepurvis 4 days ago | parent | prev | next [-]

For a bit more nuance, I think I would my overall net is about break even. But I don't take that as "it's not worth it at all, abandon ship" but rather that I need to hone my instinct of what is and is not a good task for AI involvement, and what that involvement should look like.

Throwing together a GHA workflow? Sure, make a ticket, assign it to copilot, check in later to give a little feedback and we're golden. Half a day of labour turned into fifteen minutes.

But there are a lot of tasks that are far too nuanced where trying to take that approach just results in frustration and wasted time. There it's better to rely on editor completion or maybe the chat interface, like "hey I want to do X and Y, what approach makes sense for this?" and treat it like a rubber duck session with a junior colleague.

▲

mark_l_watson 4 days ago | parent | prev | next [-]

I am sort of with you. I am down to asking Gemini Pro a couple of questions a day, use ChatGPT just a few times a week, and about once a week use gemini-cli (either a short free session, or a longer session where I provide my API key.)

That said I spend (waste?) an absurdly large amount of time each week experimenting with local models (sometimes practical applications, sometimes ‘research’).

▲

9cb14c1ec0 3 days ago | parent | prev | next [-]

The more I use Claude Code, the more aware I become of its limitations. On the whole, it's a useful tool, but the bigger the codebase the less useful. I've noticed a big difference on its performance on projects with 20k lines of code versus 100k. (Yes, I know. A 100k line project is still very small in the big picture)

	▲	Aeolun 3 days ago \| parent [-]
		I think one of thr big issues with CC is that it’ll read the first occurence of something, and then think it’s found it. Never mind that there are 17 instances spread throughout the codebase. I have to be really vigilant and tell it to search the codebase for any duplication, then resolve it, if I want it to keep going good at what it does.

▲

meowtimemania 3 days ago | parent | prev | next [-]

For me it depends on the task. For some tasks (maybe things that don't have good existing examples in my codebase?)

I'll spend 3x the time repeatedly asking claude to do something for me

▲

revskill 3 days ago | parent | prev | next [-]

Truth. To some extend, the agent doesn't know what it's doing at all, it lacks real brain, maybe we should just treat them as the hard worker.

▲

sorhaindop 2 days ago | parent | prev | next [-]

This exact phrase has been said by 3 different users... weird.

	▲	sorhaindop 2 days ago \| parent [-]
		"Having spent a couple of weeks on Claude Code recently, I arrived to the conclusion that the net value for me from agentic AI is actually negative" - smells like BS to me.

▲

jmartrican 3 days ago | parent | prev | next [-]

Maybe that is a skills issue.

▲

rootusrootus 3 days ago | parent [-]

If you are suggesting that LLMs are proving quite good at taking over the low skilled work that probably 90% of devs spend the majority of their time doing, I totally agree. It is the simplest explanation for why many people think they are magic, while some people find very little value.

On the occasion that I find myself having to write web code for whatever reason, I'm very happy to have Claude. I don't enjoy coding for the web, like at all.

▲

phist_mcgee 3 days ago | parent | next [-]

What a supremely arrogant comment.

	▲	rootusrootus 3 days ago \| parent [-]
		I often have such thoughts about things I read on HN but I usually follow the site guidelines and keep it to myself.

▲

logicprog 3 days ago | parent | prev [-]

I think that's definitely true — these tools are only really taking care of the relatively low skill stuff; synthesizing algorithms and architectures and approaches that have been seen before, automating building out for scaffolding things, or interpolating skeletons, and running relatively typical bash commands for you after making code changes, or implementing fairly specific specifications of how to approach novel architectures algorithms or code logic, automating exploring code bases and building understanding of what things do and where they are and how they relate and the control flow (which would otherwise take hours of laboriously grepping around and reading code), all in small bite sized pieces with a human in the loop. They're even able to make complete and fully working code for things that are a small variation or synthesization of things they've seen a lot before in technologies they're familiar with.

But I think that that can still be a pretty good boost — I'd say maybe 20 to 30%, plus MUCH less headache, when used right — even for people that are doing really interesting and novel things, because even if your work has a lot of novelty and domain knowledge to it, there's always mundane horseshit that eats up way too much of your time and brain cycles. So you can use these agents to take care of all the peripheral stuff for you and just focus on what's interesting to you. Imagine you want to write some really novel unique complex algorithm or something but you do want it to have a GUI debugging interface. You can just use Imgui or TKinter if you can make Python bindings or something and then offload that whole thing onto the LLM instead of having to have that extra cognitive load and have to page just to warp the meat of what you're working on out whenever you need to make a modification to your GUI that's more than trivial.

I also think this opens up the possibility for a lot more people to write ad hoc personal programs for various things they need, which is even more powerful when combined with something like Python that has a ton of pre-made libraries that do all the difficult stuff for you, or something like emacs that's highly malleable and rewards being able to write programs with it by making them able to very powerfully integrate with your workflow and environment. Even for people who already know how to program and like programming even, there's still an opportunity cost and an amount of time and effort and cognitive load investment in making programs. So by significantly lowering that you open up the opportunities even for us and for people who don't know how to program at all, their productivity basically goes from zero to one, an improvement of 100% (or infinity lol)

▲

wahnfrieden 4 days ago | parent | prev [-]

Did you try with using Opus exclusively?

▲

freedomben 4 days ago | parent [-]

Do you know if there's a way to force Claude code to do that exclusively? I've found a few env vars online but they don't seem to actually work

▲

atonse 4 days ago | parent | next [-]

You can type /config and then go to the setting to pick a model.

▲

gdudeman 4 days ago | parent | prev | next [-]

Yes: type /model and then pick Opus 4.1.

▲

wahnfrieden 4 days ago | parent | prev | next [-]

Peter Steinberger has been documenting his workflows and he relies exclusively on Opus at least until recently. (He also pays for a few Max 20x subscriptions at once to avoid rate limits.)

▲

artursapek 4 days ago | parent | prev [-]

You can "force" it by just paying them $200 (which is nothing compared to the value)

▲

parineum 3 days ago | parent | next [-]

Value is irrelevant. What's the return on investment you get from spending $200?

Collecting value doesn't really get you anywhere if nobody is compensating you for it. Unless someone is going to either pay for it for you or give you $200/mo post-tax dollars, it's costing you money.

▲

wahnfrieden 3 days ago | parent [-]

The return for me is faster output of features, fixes, and polish for my products which increases revenue above the cost of the tool. Did you need to ask this?

▲

parineum 3 days ago | parent [-]

Yes, I did. Not everybody has their own product that might benefit from a $200 subscription. Most of us work for someone else and, unless that person is paying for the subscription, the _value_ it adds is irrelevant unless it results in better compensation.

Furthermore, the advice was given to upgrade to a $200 subscription from the $20 subscription. The difference in value that might translate into income between the $20 option and the $200 option is very unclear.

▲

wahnfrieden 3 days ago | parent | next [-]

If you are employed you should petition your employer for tools you want. Maybe you can use it to take the day off earlier or spend more time socializing. Or to get a promotion or performance bonus. Hopefully not just to meet rising productivity expectations without being handed the tools needed to achieve that. Having full-time access to these tools can also improve your own skills in using them, to profit from in a later career move or from contributing toward your own ends.

	▲	parineum 3 days ago \| parent [-]
		I'm not disputing that. I'm just pushing back against the casual suggestion (not by you) to just go spend $200. No doubt that you should ask you employer for the tools you want/need to do your job but plenty of us are using this kind of thing casually and the response to "Any way I can force it to use [Opus] exclusively?" is "Spend $200, it's worth it." isn't really helpful, especially in the context where the poster was clearly looking to try it out to see if it was worth it.

▲

Aeolun 3 days ago | parent | prev [-]

If you have the money, and like coding your own stuff, the $200 is worth it. If you just code for the enterprise? Not so much.

▲

epiccoleman 3 days ago | parent | prev [-]

is Opus that much better than Sonnet? My sub is $20 a month, so I guess I'd have to buy that I'm going to get a 10x boost, which seems dubious

▲

theshrike79 3 days ago | parent | next [-]

With the $20 plan you get Opus on the web and in the native app. Just not in Claude Code.

IMO it's pretty good for design, but with code it gets in its head a bit too much and overthinks and overcomplicates solutions.

▲

artursapek 3 days ago | parent | prev [-]

Yes, Opus is much better at complicated architecture

	▲	noarchy 3 days ago \| parent [-]
		It does seem better in many regards, but the usage limits get hit quickly even with a paid account.

▲

TZubiri 3 days ago | parent | prev | next [-]

"However. Price is king. Allowing me to flood the context window with my code base is great"

I don't vibe code, but in general having to know all of the codebase to be able to do something is a smell, it's spagghetti, it's lack of encapsulation.

When I program I cannot think about the whole database, I have a couple of files open tops and I think about the code in those files.

This issue of having to understand the whole codebase, complaining about abstractions, microservices, and OOP, and wanting everything to be in a "simple" monorepo, or a monolith; is something that I see juniors do, almost exclusively.

▲

seanmmward 3 days ago | parent | prev | next [-]

The primary use case isn't just about shoving more code in context, although depending on the task, there is an irredicible minimum context needed for it to capture all the needed understanding. The 1M context model is a unique beast in terms of how you need to feed it, and its real power is being able to tackle long horizon tasks which require iterative exploration, in context learning, and resynthesis. Ie, some problems are breadth (go fix an api change in 100 files), other however require depth (go learn from trying 15 different ways to solve this problem). 1M Sonnet is unique in its capabilities for the latter in particular.

▲

sdesol 4 days ago | parent | prev | next [-]

> I really desperately need LLMs to maintain extremely effective context

I actually built this. I'm still not ready to say "use the tool yet" but you can learn more about it at https://github.com/gitsense/chat.

The demo link is not up yet as I need to finalize an admin tool but you should be able to follow the npm instructions to play around with.

The basic idea is, you should be able to load your entire repo or repos and use the context builder to help you refine it. Or you can can create custom analyzers that you can do 'AI Assisted' searches with like execute `!ask find all frontend code that does [this]` and the because the analyzer knows how to extract the correct metadata to support that query, you'll be able to easily build the context using it.

▲

hirako2000 4 days ago | parent | next [-]

Not clear how it gets around what is, ultimately, a context limit.

I've been fiddling with some process too, would be good if you shared the how. The readme looks like yet another full fledged app.

	▲	sdesol 3 days ago \| parent [-]
		Yes there is a context window limit, but I've found for most frontier models, you can generate very effective code if the context window is under 75,000 tokens provided the context is consistent. You have to think of everything from a probability point of view and the more logical the context, the greater the chances of better code. For example, if the frontend doesn't need to know the backend code (other than the interface) not including the backend code to solve a frontend one to solve a specific problem can reduce context size and improve the chances of expected output. You just need to ensure you include the necessary interface documenation. As for the full fledged app, I think you raised a good point and I should add a 'No lock in' section for why to use it. The app has a message tool that lets you pick and choose what messages to copy. Once you've copied the context (including any conversation messages that can help the LLM), you can use the context where ever you want. My strategy with the app is to be the first place you goto to start a conversation before you even generate code, so my focus is helping you construct contexts (the smaller the better) to feed into LLMs.

▲

handfuloflight 3 days ago | parent | prev | next [-]

Doesn't Claude Code do all of this automatically?

▲

sdesol 3 days ago | parent [-]

I haven't looked at Claud Code, so I don't know if they have analyzers or not that understands how to extract any type of data other than specific coding data that it is trained on. Based on the runtime for some tasks, I would not be surprised if it is going through all the files and asking "is this relevant"

My tool is mainly targeted at massive code bases and enterprise as I still believe the most efficient way to build accurate context is by domain experts.

Right now, I would say 95% of my code is AI generated (98% human architectured) and I am spending about $2 a day on LLM costs and the code generation part usually never runs more than 30 seconds for most tasks.

▲

handfuloflight 3 days ago | parent [-]

Well you should look at it, because it's not going through all files. I looked at your product and the workflow is essentially asking me to do manually what Claude Code does auto. Granted, manually selecting the context will probably lead to lower costs in any case because Claude Code invokes tool calls like grep to do its search, so I do see merit in your product in that respect.

▲

sdesol 3 days ago | parent [-]

Looking at the code, it does have some sort of automatic discovery. I also don't know how scalable Claude Code is. I've spent over a decade thinking about code search, so I know what the limitations are for enterprise code.

One of the neat tricks that I've developed is, I would load all my backend code for my search component and then I would ask the LLM to trace a query and create a context bundle for only the files that are affected. Once the LLM has finished, I just need to do a few clicks to refine a 80,000 token size window down to about 20,000 tokens.

I would not be surprised if this is one of the tricks that it does as it is highly effective. Also, yes my tool is manual, but I treat conversations as durable asset so in the future, you should be able to say, last week I did this, load the same files and LLM will know what files to bring into context.

▲

pacoWebConsult 3 days ago | parent | next [-]

FWIW Claude code conversations are also durable. You can resume any past conversation in your project. They're stored as jsonl files within your `$HOME/.claude` directory. This retains the actual context (including your prompts, assistant responses, tool usages, etc) from that conversation, not just the files you're affecting as context.

	▲	sdesol 3 days ago \| parent [-]
		Thanks for the info. I actually want to make it easy for people to review aider, plandex, claude code, etc. conversations so I will probably look at importing them. My goal isn't to replace the other tools, but to make them work smarter and more efficiently. I also think we will in a year or two, start measuring performance based on how developers interact with LLMs (so management will want to see the conversations). Instead of looking at code generated, the question is going to be, if this person is let go, what is the impact based on how they are contributing via their conversations.

▲

handfuloflight 3 days ago | parent | prev | next [-]

Excellent, I look forward to trying it out, at minimum to wean off dependency to Claude Code and it's likely current state of overspending on context. I agree with looking at conversations as durable assets.

	▲	sdesol 3 days ago \| parent [-]
		> current state of overspending on context The thing that is killing me when I hear about Claude Code and other agent tools is the amount of energy they must be using. People say they let the task run for an hour and I can't help but to think how much energy is being used and if Claude Code is being upfront with how much things will actually cost in the future.

▲

ec109685 3 days ago | parent | prev [-]

It greps around the code like an intern would. You have to have patience and be willing to document workflows and correct when it gets things wrong via CLAUDE.md files.

▲

sdesol 3 days ago | parent [-]

Honestly, grepping isn't a bad strategy if there is enough context to generate focused keywords/patterns to search. The "let Claude Code think for 10 minutes or more", makes a lot more sense now, as this brute force method can take some time.

	▲	ec109685 3 days ago \| parent [-]
		Yeah and it’s creative with its grepping.

▲

msikora 3 days ago | parent | prev | next [-]

Why not build this as an MCP so that people can plug it into their favorite platform?

	▲	sdesol 3 days ago \| parent [-]
		An MCP is definitely on the roadmap. My objective is to become the context engine for LLMs so having a MCP is required. However, there will be things from a UX perspective that you'll lose out on if you just use the MCP.

▲

kvirani 4 days ago | parent | prev [-]

Wait that's not how Cursor etc work? (I made assumptions)

	▲	sdesol 4 days ago \| parent \| next [-]
		I don't use Cursor so I can't say, but based on what I've read, they optimize for smaller context to reduce cost and probably for performance. The issue is, I think this is severely flawed as LLMs are insanely context sensitive and forgetting to include a reference file can lead to undesirable code. I am obviously biased, but I still think to get the best results, the context needs to be human curated to ensure everything the LLM needs will be present. LLMs are probabilistic, so the more relevant context, the greater the chances the final output is the most desired.
	▲	trenchpilgrim 4 days ago \| parent \| prev [-]
		Dunno about Cursor but this is exactly how I use Zed to navigate groups of projects

▲

hinkley 3 days ago | parent | prev | next [-]

Sounds to me like your problem has shifted from how much the AI tool costs per hour to how much it costs per token because resetting a model happens often enough that the price doesn't amortize out per hour. That giant spike every ?? months overshadows the average cost per day.

I wonder if this will become more universal, and if we won't see a 'tick-tock' pattern like Intel used, where they tweak the existing architecture one or more times between major design work. The 'tick' is about keeping you competitive and the 'tock' is about keeping you relevant.

▲

scotty79 2 days ago | parent | prev | next [-]

Maybe use a cheaper model to compose a relevant context for the more expensive one?

Even better, use expensive model to create a general set of guidelines for picking the right context for your project, that the cheaper model will use in the future to pick the right context.

▲

khalic 3 days ago | parent | prev | next [-]

This is a major issue with LLMs altogether, it probably has to do with the transformer architecture. We need another breakthrough in the field for this to become reality.

▲

jack_pp 3 days ago | parent | prev | next [-]

maybe we need LLMs trained on ASTs or create a new symbolic way to represent software that's faster to grok by LLMs and have a translator so we can verify the code

	▲	energy123 3 days ago \| parent [-]
		You could probably build a decent agentic harness that achieves something similar. Show the LLM a tree and/or call-graph representation of your codebase (e.g. `cargo diagram` and `cargo-depgraph`), which is token efficient. And give the LLM a tool call to see the contents of the desired subtree. More precise than querying a RAG chunk or a whole file. You could also have another optional tool call which routes the text content of the subtree through a smaller LLM that summarizes it into a maximum density snippet, which the LLM can use for a token efficient understanding of that subtree during early the planning phase. But I'd agree that an LLM built natively around AST is a pretty cool idea.

▲

dberge 3 days ago | parent | prev | next [-]

> the price has substantially increased

I’m assuming the credits required per use won’t increase in Cursor.

Hopefully this puts pressure on them to lower credits required for gpt-5.

▲

NuclearPM 3 days ago | parent | prev | next [-]

Problems

▲

3 days ago | parent | prev | next [-]

[deleted]

▲

fgbarben 3 days ago | parent | prev [-]

Allow me to flood the fertile plains of its consciousness with my seed... yes, yes, let it take root... this is important to me

	▲	fgbarben 3 days ago \| parent [-]
		Let me despoil the rich geography of your context window with my corrupted b2b SaaS workflows and code... absorb the pollution, rework it, struggling against the weight... yes, this pleases me, it is essential for the propagation of my germline