Remix.run Logo
KellyCriterion 14 hours ago

> there really is no moat.

For ChatGPT and Gemini, yes.

But for Claude, they have a very deep & big one: Its the only model that gets production ready output on the first detailled prompt. Yesterday I used my tokens til noon, so I tried some output from Gemini & Co. I presented a working piece of code which is already in production:

1. It changed without noticing things like "Touple.First.Date.Created" and "Touple.Second.Date.Created" and it rendered the code unworking by chaning to "Touple.FirstDate" and "Touple.SecondDate"

2. There was a const list of 12 definitions for a given context, when telling to rewrite the function it just cut 6 of these 12 definitions, making the code not compiling - I asked why they were cut: "Sorry, I was just too lazy typing" ?? LOL

3. There is a list include holding some items "_allGlobalItems" - it changed the name in the function simply to "_items", code didnt compile

As said, a working version of a similar function was given upfront.

With Claude, I never have such issues.

ptnpzwqd 13 hours ago | parent | next [-]

I have used Claude (incl. Opus 4.6) fairly extensively, and Claude still spits out quality that is far below what I would call production ready - both littered with smaller issues, but also the occasional larger blunder. Particularly when doing anything non-trivial, and even when guiding it in detail (although that admittedly reduces the amount of larger structural issues).

Maybe it is tech stack dependent (I have mostly used it with C#/.NET), but I have heard people say the same for C#. The only conclusion I have been able to draw from this, is that people have very different definitions of production ready, but I would really like to see some concrete evidence where Claude one-shots a larger/complex C# feature or the like (with or without detailed guidance).

KellyCriterion 11 hours ago | parent | next [-]

> C#/.NET

same here :)

> one-shots a larger/complex C# feature

I can show you a timeseries data-renderer which was created with 1 initial very large prompt and then 3 following "change this and that" prompts. The file is around 5000 lines and everything works fine & exactly as specified.

allajfjwbwkwja 5 hours ago | parent | next [-]

> The file is around 5000 lines

Yep, this is another case of different standards for "production ready."

KellyCriterion 2 hours ago | parent [-]

Caught, good one! :-))

++1

ptnpzwqd 11 hours ago | parent | prev [-]

Feel free to share it, would be very curious - ideally alongside the prompts.

KellyCriterion 5 hours ago | parent [-]

Do you have an email address?

ptnpzwqd 2 hours ago | parent [-]

You can use this: hnthrowaway.outboard407@passmail.net

skeledrew 8 hours ago | parent | prev | next [-]

I don't get it though. Why do you expect perfect responses? Humans continually make mistakes, and AI is trained on human data. Yet there seems to be this higher bar of expectation for the latter. Somehow people expect this thing that's been around for a few weeks/months, and cannot learn anything more beyond its training cutoff date, to always do a better job than a human who's been around for 20+ years and is able to learn on their own until death.

ptnpzwqd 8 hours ago | parent [-]

I don't expect that - am merely responding to the parent comments claim that Claude consistently one-shots production ready code (which does not at all match my observations).

peteforde 12 hours ago | parent | prev | next [-]

I see this over and over again. I don't dispute your experience. My experience with ESP32 development has been unreasonably positive. My codebase is sitting around 600k LoC and is the product of several hundred Opus 4.x Plan -> Agent -> Debug loops. I review everything that goes through, but I'm reviewing the business logic and domain gotchas, not dumb crap like what you and so many others describe.

What is so strange to me is that surely there is more C# out there than ESP-IDF code? I don't have a good explanation beyond saying that my codebase is extensively tested and used; I would know very quickly if it suddenly started shitting the bed in the way you explain.

whaleidk 7 hours ago | parent | next [-]

600k lines of code for anything on the ESP32 sounds like the absolute polar opposite of “good”

ivan_gammel 12 hours ago | parent | prev | next [-]

The more code is out there, the worse is the average in the training dataset. There will be legacy approaches and APIs, poor design choices, popular use cases irrelevant for your context etc that increase the chances of output not matching your expectations. In Java world this is exactly how it works. I need 3-5 iterations with Claude to get things done the way I expect, sometimes jumping straight to manual refactoring and then returning the result to Claude for review and learning. My CLAUDE.md (multiple of them) are growing big with all patterns and anti-patterns identified this way. To overcome this problem model needs specialized training, that I don‘t think the industry knows how to approach (it has to beat the effort put in the education system for humans).

mjdiloreto 9 hours ago | parent | next [-]

I also believe this must be true. Try asking Claude to program in Forth, I find the results to be unreasonably good. That's probably because most of the available Forth to train on is high quality.

re-thc 10 hours ago | parent | prev [-]

> To overcome this problem model needs specialized training, that I don‘t think the industry knows how to approach

We already have coding tuned models i.e. Codex. We should just have language / technology specific models with a focus on recent / modern usage.

Problem with something like Java is too old -- too many variants. Make a cut off like at least above Java 8 or 17.

xienze 9 hours ago | parent | prev [-]

> My experience with ESP32 development has been unreasonably positive. My codebase is sitting around 600k LoC and is the product of several hundred Opus 4.x Plan -> Agent -> Debug loops.

I feel like this is an example of people having different standards of what “good” code is and hence the differing opinions of how good these tools are. I’m not an embedded developer but 600K LOC seems like a lot in that context, doesn’t it? Again I could be way off base here but that sounds like there must be a lot of spaghetti and copy-paste all over the codebase for it to end up that large.

surajrmal 8 hours ago | parent [-]

I don't think it's that large. Keep in mind embedded projects take few if any dependencies. The standard library in most languages is far bigger than 600k loc.

whaleidk 7 hours ago | parent | next [-]

I work with ESP32 devices and 600k lines of code is insane.

the__alchemist 4 hours ago | parent | prev [-]

I'm curious: What does this device do?

je42 13 hours ago | parent | prev | next [-]

Interesting - what kind of structural issues have you encountered?

Is these more related to the existing source code or is this a bad pattern thar you would never do regardless of the existing code?

huflungdung 12 hours ago | parent | prev [-]

[dead]

AlecSchueler 13 hours ago | parent | prev | next [-]

> Its the only model that gets production ready output on the first detailled prompt. Yesterday I used my tokens til noon, so I tried some output from Gemini & Co. I presented a working piece of code which is already in production:

One does often hear that where LLMs shine is with greenfield code generation but they all start to struggle working with pre-existing code. It could be that this wasn't a like for like comparison.

That said I do personally feel Claude to produce far better results than competitors.

piva00 9 hours ago | parent | next [-]

> One does often hear that where LLMs shine is with greenfield code generation but they all start to struggle working with pre-existing code. It could be that this wasn't a like for like comparison.

In my experience working in a large codebase with a good set of standards that's not the case, I can supply examples already existing in the codebase for Claude to use as a guidance and it generates quite decent code.

I think it's because there's already a lot of decent code for it to slurp and derive from, good quality tests at the functional level (so regressions are caught quickly).

I do understand though that on codebases with a hodge podge of styles, varying quality of tests, etc. it probably doesn't work as well as in my experience but I'm quite impressed about how I can do the thinking, add relevant sections of the code to the context (including protocols, APIs, etc.), describe what I need to be done, and get a plan back that most times is correct or very close to correct, which I can then iterate over to fix gaps/mistakes it made, and get it implemented.

Of course, there are still tasks it fails and I don't like doing multiple iterations to correct course, for those I do them manually with the odd usage here and there to refactor bits and pieces.

Overall I believe if your codebase was already healthy you can have LLMs work quite well with pre-existing code.

jacquesm 13 hours ago | parent | prev | next [-]

> One does often hear that where LLMs shine is with greenfield code generation but they all start to struggle working with pre-existing code.

Don't we all?

astrange 2 hours ago | parent | next [-]

I'm better at pre-existing code, if only because empty text files give me writers block.

AlecSchueler 11 hours ago | parent | prev | next [-]

Whether we do or not it's besides the point. The comparison was between Claude, which produced competent greenfield code, and Gemini which struggled with brownfield. The comparison is stacked in Claude's favour.

seba_dos1 12 hours ago | parent | prev [-]

Nope.

ivan_gammel 12 hours ago | parent | prev [-]

Greenfield implementation is not flawless as well.

ajshahH 8 hours ago | parent [-]

The only sources of these “it works flawlessly” I know of are:

- literal Claude ads I see online

- my underperforming coworkers whose code I’ve had to cleanup and know first hand that no, it wasn’t flawless

This kind of sentiment is gaslighting CTOs everywhere though. Very annoying.

ben_w 14 hours ago | parent | prev | next [-]

That's been my experience too. I'm using the recent free trial of OpenAI Plus to vibe code, and from this I would say that if Claude Code is a junior with 1-3 years of experience, OpenAI's Codex is like a student coder.

Oreb 13 hours ago | parent [-]

Does it depend on what type of programming you do? Doing Swift/SwiftUI work, I have exactly the opposite experience. I’ve been using both recently, and I want to use Claude alone (especially after the last week’s events), but Codex is just so much faster and better.

boxedemp 3 hours ago | parent | next [-]

I find it very much matters. I find Gemini better for pretty frontends, Claude opus for planning. Gemini and opus for code reviews. Codex is great when I want the LLM do follow instructions more strictly- good if you already have a detailed design.

Definitely depends on your use.

ben_w 11 hours ago | parent | prev [-]

Swift/SwiftUI are two of the three experimental projects I'm using Codex on, the other is a physics simulation in python.

It keeps trying to re-invent the wheel, does a bad job of it.

The physics sim was supposed to be a thin wrapper around existing libraries, but instead of that it tried to write all the simulation code itself as a "fallback" (but it was broken), and never actually installed the real simulators that already did this stuff despite being told to use them in the first place. The last few dozen(!) prompts from me have been pairs of ~["Find all cases where you've re-invented the wheel, add them to the planning document", "now do them"]. And it's still not finished removing the original nonsense, so far as I can tell.

One of the two Swift experiments is just a dice roller, it took about 10 rounds of non-compiling metal shaders (I don't know metal, which is why I didn't give up and do that by hand after 4) before I managed to get that to work, and when it did work it immediately broke it again on the next four rounds. It wrote its own chart instead of using Swift Charts, and did it badly. It tried to put all the hamburger menu options into a UIAlertController. Something blocks the UI for several seconds when you change the dice font. I didn't count how many attempts it took to correctly label the D4.

The other Swift experiment was a musical instrument app, that got me to the prototype stage, eventually, but in a way that still felt like a student's project rather than a junior's project.

skeledrew 8 hours ago | parent | next [-]

> Find all cases where you've re-invented the wheel

Did you put in the original prompt the "wheels" you wanted it to use? It's a toss-up when you aren't very specific about what you want.

ben_w 7 hours ago | parent [-]

For the swift apps, at least half of the errors are of a type where I wouldn't expect to have needed to tell someone to not do it like that, and only a student could reasonably be expected to not know better.

For the python physics sim, step 1 was to generate the plan, the prompt included "I want actual plasma physics, including high-density, high-field regimes, externally applied fields, etc., so consider which FOSS libraries would suit this.", and then it proceeded itself to choose some existing libraries, and I made sure those specific named FOSS libraries actually ended up in the plan.

My first clue this wasn't going to work was that even from step 1 it was pushing for writing all the simulation code and not actually using e.g. WarpX despite that it itself had suggested WarpX. In fact, even when WarpX was in the plan, it was "integrate" rather than "just use this from the get-go".

I may well throw the whole thing out and try again with Claude when this trial expires. Most of the runs have been comically non-physical, to the extent you don't even need a physics degree to notice, or even a physics GCSE.

ben_w 9 hours ago | parent | prev [-]

(Just outside edit window, I now realise I was ambiguous in this comment, it was more like "Find all cases where you've re-invented the wheel, add their removal to the planning document")

otabdeveloper4 12 hours ago | parent | prev | next [-]

> Its the only model that gets production ready output on the first detailled prompt.

That's, just, like, your opinion, man.

KellyCriterion 11 hours ago | parent [-]

...and of a lot of colleagues in and out of my sector :)

littlestymaar 14 hours ago | parent | prev | next [-]

> But for Claude, they have a very deep & big one: Its the only model that gets production ready output on the first detailled promp

That's not a moat though. Claude itself wasn't there 6 months ago and there's no reason to think Chinese open models won't be at this level in a year at most.

To keep its current position Claude has to keep improving at the same pace as the competitor.

jccx70 8 hours ago | parent | prev [-]

[dead]