The author does not understand what LLMs and coding tools are capable of today.

> LLMs get endlessly confused: they assume the code they wrote actually works; when test fail, they are left guessing as to whether to fix the code or the tests; and when it gets frustrating, they just delete the whole lot and start over. This is exactly the opposite of what I am looking for. Software engineers test their work as they go. When tests fail, they can check in with their mental model to decide whether to fix the code or the tests, or just to gather more data before making a decision. When they get frustrated, they can reach for help by talking things through. And although sometimes they do delete it all and start over, they do so with a clearer understanding of the problem.

My experiences are based on using Cline with Anthropic Sonnet 3.7 doing TDD on Rails, and have a very different experience. I instruct the model to write tests before any code and it does. It works in small enough chunks that I can review each one. When tests fail, it tends to reason very well about why and fixes the appropriate place. It is very common for the LLM to consult more code as it goes to learn more.

It's certainly not perfect but it works about as well, if not better, than a human junior engineer. Sometimes it can't solve a bug, but human junior engineers get in the same situation too.

▲

alfalfasprout 6 days ago | parent | next [-]

It's funny always seeing comments like this. I call them "skill issue" comments.

The reality is the author very much understands what's available today. Zed, after all, is building out a lot of AI-focused features in its editor and that includes leveraging SOTA LLMs.

> It's certainly not perfect but it works about as well, if not better, than a human junior engineer. Sometimes it can't solve a bug, but human junior engineers get in the same situation too.

I wonder if comments like this are more of a reflection on how bad the hiring pool was even a few years ago than a reflection of how capable LLMs are. I would be distraught if I hired a junior eng with less wherewithal and capabilities than Sonnet 3.7.

	▲	materiallie 6 days ago \| parent \| next [-]
		This is a very friendly and cordial response. Given that the parent comment was implying that the creators of Zed don't actually know how to build software. Based on their credentials building Rails crud apps, I suppose.
	▲	ChromaticPanic 5 days ago \| parent \| prev [-]
		It's funny always seeing comments like this. I call them "humans are perfect" comments. We just assume that all human devs are good. I have met so many that reason like a wet paper bag. Arguably having a smaller context window than current LLM. I have seen and used so many buggy software written by humans, that I find it absurd that we expect LLMs to be perfect. If humans are "the standard" for intelligence, then there is no hope for these automated systems.

▲

YuukiRey 6 days ago | parent | prev | next [-]

I share examples of LLM fails on our company Slack and every week LLMs do the opposite of what I tell them.

I say capture logs without overriding console methods -> they override console methods.

YOU ARE NOT ALLOWED TO CHANGE THE TESTS -> test changed

Or they insert various sleep calls into a test to work around race conditions.

This is all from Claude Sonnet 4.

▲

carb 6 days ago | parent | next [-]

I've found better results when I treat LLMs like you would treat little kids. Don't tell them what NOT to do, tell them what TO do.

Say "keep your hands at your side, it's hot" and not "don't touch the stove, it's hot". If you say the latter, most kids touch the stove.

▲

alpaca128 6 days ago | parent | next [-]

If LLMs cannot reliably deal with this, how can they write reliable code? Following an instruction like "don't do X" is more basic than the logic of fizzbuzz.

This reminds me of the query "shirt without stripes" on any online image/product search.

▲

zahlman 5 days ago | parent [-]

Obligatory reminder that we used to live in a world where you could put "foo -bar" into a search engine, ctrl-F for foo on the top ten results and find it every time, and ctrl-F for bar on the top ten results and not find it.

	▲	alpaca128 3 days ago \| parent [-]
		Yeah, I've even had cases where DDG ignored my quoted string in the search. It's literally the whole point of the quotes but especially when it contains things like German umlauts it'll just accept any replacement letter for them. And yes, getting no results is acceptable, in fact it is the only correct outcome.

▲

amai 4 days ago | parent | prev | next [-]

Negation is a hard problem for AI and mainly unsolved:

- https://seantrott.substack.com/p/llms-and-the-not-problem

- https://github.com/elsamuko/Shirt-without-Stripes

▲

glitchcrab 6 days ago | parent | prev [-]

My eureka moment when I first started using Cursor a few weeks back was realising that I talking to it the same way I talk to my three year old and the results were fairly good (less so from my boy at times).

▲

IshKebab 6 days ago | parent [-]

Yeah it's also kind of funny people discovering all the LLM failure modes and saying "see! humans would never do that! it's not really intelligent!". None of those people have children...

▲

Chinjut 6 days ago | parent | next [-]

I don't want a computer that's as unreliable as a child. This is not what originally interested me about computers.

▲

IshKebab 5 days ago | parent | next [-]

Nobody said you did. I'm talking about the confidently incorrect assertions that humans would never display any of these unreliable behaviours.

	▲	tripzilch 3 days ago \| parent [-]
		They don't. At least not for the duration that LLMs keep it up. They really don't. If you want to pretend that being a 3 year old is not a transient state, and that controlling an AI is just like parenting an eternal 3 year old, there's probably a manga about that.

▲

jama211 4 days ago | parent | prev [-]

[flagged]

▲

tripzilch 3 days ago | parent | prev [-]

Maybe because none of those people are imagining children to be eternally stuck at that level of intelligence. At that age (regardless of being a parent or not) you can literally see them getting smarter over the course of weeks or months.

▲

sothatsit 6 days ago | parent | prev | next [-]

I have also had this happen, but only when my context is getting too long, at which point models stop reading my instructions. Or if there have been too many back and forths, this can happen as well.

Tthere is a steady decline in model's capabilities across the board as their contexts get longer. Wiping the slate clean regularly really helps to counteract this, but it can really become a pain to rebuild the context from scratch over and over. Unfortunately, I don't really know any other way to avoid the model's getting really dumb over time.

▲

maelito 6 days ago | parent | prev | next [-]

LLMs erasing your important comments is so irritating ! Happened to me often.

▲

toenail 6 days ago | parent | prev | next [-]

I simply had claude write me a linting tool that catches its repeated bad stuff..

▲

TheRealDunkirk 6 days ago | parent [-]

I was converting all the views in my Rails app from HAML to ERB. It was doing each one perfectly, so I told it to do the rest. It went through a few, then asked me if it could write a program, and run that. I thought, hey, cool, sure. I get it; it was trying to save tokens. Clever! However -- you know where this is going -- despite knowing all the rules, and demonstrating it could apply them, the program it wrote made a total dog's breakfast out of the rest of the files. Thankfully, I've learned to commit my working copy before big "AI" changes, and I just revert when it barfs. I forced Claude to do the rest "manually" at great token expense, but it did it correctly. I've asked it to write other scripts, which it has also mangled. So I haven't been impressed at Claude's "tool writing" capability yet, and I'm jealous of people who seem to have good luck.

	▲	polynomial 5 days ago \| parent [-]
		Imagine if you had to do this with an actual team member.

▲

paulcole 6 days ago | parent | prev | next [-]

> I share examples of LLM fails on our company Slack and every week LLMs do the opposite of what I tell them.

Must be fun.

▲

iamflimflam1 6 days ago | parent | prev | next [-]

Do you also share examples of when it works really well?

▲

pinoy420 6 days ago | parent | prev [-]

[dead]

▲

kubb 6 days ago | parent | prev | next [-]

I believe that they work particularly well for CRUD in known frameworks like Rails.

OTOH I tried building a native Windows Application using Direct2D in Rust and it was a disaster.

I wish people could be a bit more open about what they build.

▲

andrewmutz 6 days ago | parent | next [-]

I agree that it is probably easier for an LLM to write good code in any framework (like Rails) that has a lot of well-documented opinions about how things should be done. If there is a "right" place to put things, or a "right" way to model problems in a framework, its more likely that the model's opinions are going to line up with the human engineer's opinions.

	▲	alkonaut 6 days ago \| parent \| next [-]
		Also - that's easy for everyone. It's basically a framework so rigid/simple (Those are adjacent concepts for frameworks) that the business logic is almost boilerplate. That is, so long as you stay inside the guard rails. Ask it to make something in a rails app that's slightly beyond the CRUD scope and it will suffer - much like most humans would. So it's not that it's bad to let bots do boilerplate. But using very qualified humans for that to begin with was a waste to begin with. Hopefully in a few years none of us will need to do ANY part of CRUD work and we can do only the fun parts of software development.-
	▲	tripzilch 3 days ago \| parent \| prev [-]
		But isn't it crazy that while it's been impressively great at translating between human languages from the start, it's incapable of translating these well-documented best-ways-to-do-it things across domains or even programming languages.

▲

Aeolun 6 days ago | parent | prev | next [-]

I thought Claude got significantly smarter when I started using Rust. The big problem there is that I don’t understand the rust myself :P

	▲	klabb3 6 days ago \| parent [-]
		It’s the style. Responses are always eloquent and well structured. When you look at output for a domain you don’t know well, you give it benefit of the doubt because it sounds like a highly competent human, so you react similarly. When you use it with something you know very deeply, you naturally look more for substance rather than form, and thus spot the mistakes much easier. This breaks most illusions of amazing reasoning abilities etc. My ChatGPT is amazingly competent at gardening! Well, that’s how it feels anyway. Is it correct? I have no idea. It sounds right. Fortunately, it’s just a new hobby for me and the stakes are low. But generally I think it’s much better to be paranoid than gullible when it comes to confident sounding ramblings, whether it’s from an LLM or a marketing guru.

▲

sdesol 6 days ago | parent | prev | next [-]

> I wish people could be a bit more open about what they build.

I would say for the last 6 months, 95% of the code for my chat app (https://github.com/gitsense/chat) was AI generated (98% human architected). I believe what I created in the last 6 months was far from trivial. One of the features that AI helped a lot with, was the AI Search Assistant feature. You can learn more about it here https://github.com/gitsense/chat/blob/main/packages/chat/wid...

As a debugging partner, LLMs are invaluable. I could easily load all the backend search code into context and have it trace a query and create a context bundle with just the affected files. Once I had that, I would use my tool to filter the context to just those files and then chat with the LLM to figure out what went wrong or why the search was slow.

I very much agree with the author of the blog post about why LLMs can't really build software. AI is an industry game changer as it can truly 3x to 4x senior developers in my opinion. I should also note that I spend about $2 a day on LLM API calls (99% to Gemini 2.5 Flash) and I probably have to read 200+ LLM generated messages a day and reply back in great detail about 5 times a day (think of an email instead of chat message).

Note: The demo on that I have in the README hasn't been setup, as I am still in the process of finalizing things for release but the NPM install instructions should work.

▲

leptons 6 days ago | parent | next [-]

> probably have to read 200+ LLM generated messages a day and reply back in great detail about 5 times a day (think of an email instead of chat message).

I can think of nothing more tiresome than having to read 200 emails a day, or LLM chat messages. And then respond in detail 5 of those times. It wouldn't lead to "3x to 4x" performance gain after tallying up all the time reading messages and replying. I'm not sure people that use LLMs this way are really tracking their time enough to say with any confidence that "3x to 4x" is anywhere close to reality.

▲

sdesol 6 days ago | parent [-]

A lot of the messages are revisions so it is not as tedious as it may seem. As for the "3x to 4x", this is my own experience. It is possible that I am an outlier, but 80% of the generated AI code that I have are one-shot. I spend an hour or two (usually spread over days thinking about the problem) to accomplish something that would have taken a week or more for me to do.

I'm going to start producing metrics regarding how much code is AI generated along with some complexity metrics.

I am obviously bias, but this definitely feels like a paradigm shift and if people do not fully learn to adapt to it, it might be too late. I am not sure if you have ever watched Gattaca, but this sort of feels like it...the astronaut part, that is.

The profession that I have known for decades is starting to feel very different, in the same way that while watching Gattaca, my perception of astronauts changed. It was strange, but plausible and that is what I see for the software industry. Those that can articulate the problem I believe will become more valuable than the silent genius.

▲

leptons 6 days ago | parent | next [-]

The same noise was made about pair programming and it hasn't really caught on. Using LLMs to write code is one way of getting code written, but it isn't necessarily the best, and it seems kind of fad-ish honestly. Yes, I use "AI" in my coding workflow, but it's overall more annoying than it is helpful. If you're naturally 3x-4x times slower than I am, then congratulations, you're now getting up to speed. It's all pretty subjective I think.

▲

sdesol 6 days ago | parent [-]

> It's all pretty subjective I think.

This is very measurable, as you are not measuring against others, but yourself. The baseline is you, so it is very easy to determine if you become more productive or not. What you are saying is, you do not believe "you" can leverage AI to be more efficient than you currently are, which may well be true due to your domain and expertise.

▲

leptons 6 days ago | parent [-]

No matter what "AI" can or can't do for me, it's being forced on us all anyway, which kind of sucks. Every time I select something the AI wrote it's collecting a statistic and I'm sure someone is probably monitoring how much we use the "AI" and that could become a metric for job performance, even if it doesn't really raise quality or amplify my output very much.

▲

sdesol 6 days ago | parent [-]

> being forced on us all anyway, which kind of sucks

Business is business, and if you can demonstrate that you are needed they will keep you, for the most part, but business also has politics.

> probably monitoring how much we use the "AI" and that could become a metric for job performance

I will bet on this and take it one step further. They (employer) are going to want to start tracking LLM conversations. If everybody is using AI, they (employer) will need differentiators to justify pay raises, promotions and so forth.

	▲	leptons 6 days ago \| parent [-]
		>> how much we use the "AI" and that could become a metric for job performance > they (employer) will need differentiators to justify pay raises, promotions and so forth. That is exactly what I meant.

▲

normie3000 6 days ago | parent | prev [-]

> if people do not fully learn to adapt to it, it might be too late

Why would it ever be too late?

▲

sdesol 6 days ago | parent [-]

Age discrimination, saturated market, no longer a team fit (everybody is using AI and they have metrics to backup performance gains), etc.

▲

normie3000 6 days ago | parent [-]

Can't someone who doesn't use it just..start using it?

▲

sdesol 6 days ago | parent [-]

Sure it can become a hobby.

▲

normie3000 6 days ago | parent [-]

Are you implying that someone starting to use AI now has already been left so far behind by experienced users that they would never catch up? That seems ridiculous - it seems to be getting better understood with time, which should make catching up increasingly easier.

	▲	sdesol 6 days ago \| parent [-]
		No I mean trying to start in a few years. Basically if you feel ai is a fad and are trying to wait things out.

▲

Rexxar 6 days ago | parent | prev | next [-]

> I would say for the last 6 months, 95% of the code for my chat app was AI generated

Why did you squash 6 months of work in two commits ?

	▲	sdesol 5 days ago \| parent [-]
		It's actually more than 6 months. 6 months was when I developed enough to start chatting with AI to be really productive. Moving forward once the licence is in place and the files become unminifed you can track exactly what ai generated.

▲

QuadmasterXLII 6 days ago | parent | prev [-]

What happens when you tell the AI to set up the demo in the README?

	▲	sdesol 6 days ago \| parent [-]
		It summarized the instructions required to install and setup. It (Gemini and Sonnet) did fail to mention that I need to setup a server and create a DNS entry for the sub domain.

▲

wg0 6 days ago | parent | prev | next [-]

The author isn't wrong that LLMs don't work like an engineer and often fail miserably.

Here's what works however:

Mostly CRUD apps or REST API in Rails, Django or other Microframeworks such as FastAPI etc.

Or with React.

In that too, focus on small components and small steps or else you'll fail to get the results.

▲

quantumHazer 6 days ago | parent | prev | next [-]

yeah, tipically they are building a to do list and organizer app and have not found that github is flooded with college students' project of their revolutionary to-do apps

	▲	kubb 6 days ago \| parent [-]
		I don’t want to dismiss or disrespect anyone’s work. But I never see precise descriptions of categories of tasks that work well, it’s all based on vibes.

▲

stingraycharles 6 days ago | parent | prev [-]

I recently built a data streaming connector in Go with all kinds of bells and whistles attached (yaml based data parsers, circuit breakers, e2e stress testing frameworks, etc). Worked like a charm, I estimate it made two months of work about two weeks.

But you need to get your workflow right.

▲

littlestymaar 6 days ago | parent | prev | next [-]

> The author does not understand what LLMs and coding tools are capable of today.

Claiming that the people making an AI coding tool (Zed) don't know LLM coding tools is both preposterous and extremely arrogant.

	▲	geraneum 6 days ago \| parent [-]
		Oh well… you should see what some people comment under the posts from the likes of Yann LeCun. It’s very entertaining.

▲

lowsong 6 days ago | parent | prev | next [-]

> it works about as well, if not better, than a human junior engineer.

I see this line of reasoning a lot from AI-advocates and honestly it's depressing. Do you see less experienced engineers as nothing more than outputters of code? Is the entire point of being "junior" at something that you can learn and grow, which these LLM tools cannot.

▲

kordlessagain 6 days ago | parent | next [-]

That's not a line of reasoning. It's an opinion, and they matter. You don't get to make opinions go away just because you don't like them and want to conflate problem sets.

	▲	lowsong 6 days ago \| parent [-]
		I'm not disputing that they believe that these models are "as good as a junior engineer", by whatever metric you want to measure that on. My point is the very fact that someone uses that as an argument in support LLMs is... profoundly sad.

▲

zamadatix 6 days ago | parent | prev [-]

They're just comparing levels of work output but you're the one assuming that must mean a junior has no other value worth engaging.

▲

quantumHazer 6 days ago | parent | prev | next [-]

it's very well documented behavior that models try to pass failed test with hacks and tricks (hard coding solutions and so on)

▲

greymalik 6 days ago | parent [-]

It is also true that you can instruct them not to do that, with success.

▲

quantumHazer 6 days ago | parent [-]

It is also true that models doesn't give a ** about instructions sometimes and the do whatever text predictions is more likely (even with reasoning)

	▲	swat535 6 days ago \| parent [-]
		Another issue is that LLMS have no ability to learn anything. Even if you supply them with the file content, they are not able to recall it, or if they do, they will quickly forget. For example, if you tell them that the "Invoice" model has fields x, y, z and supply part of the schema. A few responses later, in the response it will give you an Invoice model that has a,b,c , because those are the most common ones. Adding to this, you have them writing tautology tests, removing requirements to fix the bugs and hallucinating new requirements and you end up with catastrophic consequences.

▲

bunderbunder 6 days ago | parent | prev | next [-]

From what I've experienced, this depends very much on the programming language, platform, and business domain.

I haven't tried it with Rails myself (haven't touched Ruby in years, to be honest), but it doesn't surprise me that it would work well there. Ruby on Rails programming culture is remarkably consistent about how to do things. I would guess that means that the LLM is able to derive a somewhat (for lack of a better word) saner model from its training data.

By contrast, what it does with Python can get pretty messy pretty quickly. One of the biggest problems I've had with it is that it tends to use a random hodgepodge of different Python coding idioms. That makes TDD particularly challenging because you'll get tests that are well designed for code that's engineered to follow one pattern of changes, written against a SUT that follows conventions that lead to a completely different pattern of changes. The result is horribly brittle tests that repeatedly break for spurious reasons.

And then iterating on it gets pretty wild, too. My favorite behavior is when the real defect is "oops I forgot to sort the results of the query" and the suggested solution is "rip out SqlAlchemy and replace it with Django."

R code is even worse; even getting it to produce code that follows a spec in the first place can be a challenge.

▲

jarjoura 6 days ago | parent | prev | next [-]

My experience so far is that, if you're limiting the "capacity" to junior engineer, yes, especially when it's seen a problem before. It's able to quickly realize a solution and confirm the solution works.

It does not works so well for any problems it has not seen before. At that point you need to explain the problem, and instruct the solution. So a that point, you're just acting as a mentor instead of using your capacity to just implement the solution yourself.

My whole team has really bought into the "claude-code" way of doing side tasks that have been on the backlog for years, think like simple refactors, or secondary analytic systems. Basically any well-trodden path that is mostly constrained by time that none of us are given, are perfect for these agents right now.

Personally I'm enjoying the ability to highlight a section of code and ask the LLM to explain this to me like I'm 5, or look for any potential race conditions. For those archiac, fragile monolithic blocks of code that stick around long after the original engineers have left, it's magical to use the LLM to wrap my head around that.

I haven't found it can write these things any better though, and that is the key here. It's not very good at creating new things that aren't commonly seen. It also has a code style that is quite different than what already exists. So when it does inject code, often times it has to be rewritten to fit the style around it. Already, I'm hearing whispers of people say things like "code written for the AI to read." That's where my eyes roll because the payoff for the extra mental bandwidth doesn't seem worth it right now.

▲

itsalotoffun 6 days ago | parent | prev | next [-]

> It works in small ... chunks

Yup.

> I ... review each one

Yup.

These two practices are core to your success. GenAI hangs reliably hangs itself given longer rope.

▲

llmsRstubborn 6 days ago | parent | prev | next [-]

Agreed. This passage in particular

> when test fail, they are left guessing as to whether to fix the code or the tests; and when it gets frustrating, they just delete the whole lot and start over

Is the EXACT OPPOSITE of what LLM's tend to do. They are very stubborn in their approach and will keep at it often until you rollback to a previous prompt. Them deleting code tends to happen on command, except specifically if I do TDD, which may as well be a preemptive command to do so.

▲

6 days ago | parent | prev | next [-]

[deleted]

▲

shivenigma 3 days ago | parent | prev | next [-]

> The author does not understand what LLMs and coding tools are capable of today.

Not really, I would say they used it well and understood the limitations of LLM exactly. No matter how much polished the output or how good it is, LLMs can't build mental models of a codebase like a human does because they are just statistical machines.

▲

serf 6 days ago | parent | prev | next [-]

in my experience TDD is a very powerful paradigm for use with LLMs.

it does a good enough job of wrangling behavior via implied context of the test-space that it seems to really reduce the amount of explanation needed and surprise garbage output.

▲

raincole 6 days ago | parent | prev | next [-]

> The author does not understand what LLMs and coding tools are capable of today.

Uh...

This author is developing "LLMs and coding tools of today." It's not like they're just making a typical CRUD Rails app.

▲

solarkraft 6 days ago | parent | prev | next [-]

I have had cline add ignore tags when instructed to fix type errors. That was frustrating but fixed with some stern words. I now know this failure mode and can handle it in the standard prompt (actually I expect that evetually the Cline people will).

▲

tiahura 6 days ago | parent | prev | next [-]

I spent about 4 hours yesterday vibe coding a rip-off of a commercial product in a language I’ve never used, and I’m a lawyer, not a programmer.

▲

xmorse 6 days ago | parent | prev | next [-]

it probably because the author uses the useless implementation of the Zed agent

	▲	madacol 5 days ago \| parent [-]
		"useless" is too unfair, but I do think the agent is not as good as claude code, or cursor

▲

anentropic 6 days ago | parent | prev [-]

it also tends to write bad tests, like a junior...