Remix.run Logo
porise 13 hours ago

I wish the people who wrote this let us know what king of codebases they are working on. They seem mostly useless in a sufficiently large codebase especially when they are messy and interactions aren't always obvious. I don't know how much better Claude is than ChatGPT, but I can't get ChatGPT to do much useful with an existing large codebase.

CSMastermind 2 hours ago | parent | next [-]

Are you using Codex?

I'm not sure how big your repos are but I've been effective working with repos that have thousands of files and tens of thousands of lines of code.

If you're just prototyping it will hit wall when things get unwieldy but that's normally a sign that you need to refactor a bit.

Super strict compiler settings, static analysis, comprehensive tests, and documentation help a lot. As does basic technical design. After a big feature is shipped I do a refactor cycle with the LLM where we do a comprehensive code review and patch things up. This does require human oversight because the LLMs are still lacking judgement on what makes for good code design.

The places where I've seen them be useless is working across repositories or interfacing with things like infrastructure.

It's also very model-dependent. Opus is a good daily driver but Codex is much better are writing tests for some reason. I'll often also switch to it for hard problems that Claude can't solve. Gemini is nice for 'I need a prototype in the next 10 minutes', especially for making quick and dirty bespoke front-ends where you don't care about the design just the functionality.

madhadron an hour ago | parent [-]

> tens of thousands of lines of code

Perhaps this is part of it? Tens of thousands of lines of code seems like a very small repo to me.

CameronBanga 13 hours ago | parent | prev | next [-]

This is an antidotal example, but I released this last week after 3 months of work on it as a "nights and weekdends" project: https://apps.apple.com/us/app/skyscraper-for-bluesky/id67541...

I've been working in the mobile space since 2009, though primarily as a designer and then product manager. I work in kinda a hybrid engineering/PM job now, and have never been a particularly strong programmer. I definitely wouldn't have thought I could make something with that polish, let alone in 3 months.

That code base is ~98% Claude code.

bee_rider 13 hours ago | parent [-]

I don’t know if “antidotal example” is a pun or a typo but I quite like it.

CameronBanga 12 hours ago | parent | next [-]

Lol typing on my phone during lunch and meant anecdotal. But let's leave it anyways. :)

oasisbob 13 hours ago | parent | prev [-]

That is fun.

Not sure if it's an American pronunciation thing, but I had to stare at that long and hard to see the problem and even after seeing it couldn't think of how you could possibly spell the correct word otherwise.

bsder 8 hours ago | parent [-]

> Not sure if it's an American pronunciation thing

It's a bad American pronunciation thing like "Febuwary" and "nuculer".

If you pronounce the syllables correctly, "an-ec-dote", "Feb-ru-ar-y", "nu-cle-ar" the spellings follow.

English has it's fair share of spelling stupidities, but if people don't even pronounce the words correctly there is no hope.

fy20 3 hours ago | parent | prev | next [-]

At my dayjob my team uses it on our main dashboard, which is a pretty large CRUD application. The frontend (Vue) is a horrible mess, as it was originally built by people who know just enough to be dangerous. Over time people have introduced new standards without cleaning up the old code - for example, we have three or four different state management techologies.

For this the LLM struggles a bit, but so does a human. The main issues are it messes up some state that it didnt realise was used elsewhere, and out test coverage is not great. We've seen humans make exactly the same kind of mistakes. We use MCP for Figma so most of the time it can get a UI 95% done, just a few tweaks needed by the operator.

On the backend (Typescript + Node, good test coverage) it can pretty much one-shot - from a plan - whatever feature you give it.

We use opus-4.5 mostly, and sometimes gpt-5.2-codex, through Cursor. You aren't going to get ChatGPT (the web interface) to do anything useful, switch to Cursor, Codex or Claude Code. And right now it is worth paying for the subscription, you don't get the same quality from cheaper or free models (although they are starting to catch up, I've had promising results from GLM-4.7).

TaupeRanger 13 hours ago | parent | prev | next [-]

Claude and Codex are CLI tools you use to give the LLM context about the project on your local machine or dev environment. The fact that you're using the name "ChatGPT" instead of Codex leads me to believe you're talking about using the web-based ChatGPT interface to work on a large codebase, which is completely beside the point of the entire discussion. That's not the tool anyone is talking about here.

5 hours ago | parent [-]
[deleted]
danielvaughn 13 hours ago | parent | prev | next [-]

It's important to understand that he's talking about a specific set of models that were release around november/december, and that we've hit a kind of inflection point in model capabilities. Specifically Anthropic's Opus 4.5 model.

I never paid any attention to different models, because they all felt roughly equal to me. But Opus 4.5 is really and truly different. It's not a qualitative difference, it's more like it just finally hit that quantitative edge that allows me to lean much more heavily on it for routine work.

I highly suggest trying it out, alongside a well-built coding agent like the one offered by Claude Code, Cursor, or OpenCode. I'm using it on a fairly complex monorepo and my impressions are much the same as Karpathy's.

yasoob 3 hours ago | parent | prev | next [-]

Another personal example. I spent around a month last year in January on this application: https://apps.apple.com/us/app/salam-prayer-qibla-quran/id674...

I had never used Swift before that and was able to use AI to whip up a fairly full-featured and complex application with a decent amount of code. I had to make some cross-cutting changes along the way as well that impacted quite a few files and things mostly worked fine with me guiding the AI. Mind you this was a year ago so I can only imagine how much better I would fare now with even better AI models. That whole month was spent not only on coding but on learning Swift enough to fix problems when AI started running into circles and then learning about Xcode profiler to optimize the application for speed and improving perf.

keerthiko 13 hours ago | parent | prev | next [-]

Almost always, notes like these are going to be about greenfield projects.

Trying to incorporate it in existing codebases (esp when the end user is a support interaction or more away) is still folly, except for closely reviewed and/or non-business-logic modifications.

That said, it is quite impressive to set up a simple architecture, or just list the filenames, and tell some agents to go crazy to implement what you want the application to do. But once it crosses a certain complexity, I find you need to prompt closer and closer to the weeds to see real results. I imagine a non-technical prompter cannot proceed past a certain prototype fidelity threshold, let alone make meaningful contributions to a mature codebase via LLM without a human engineer to guide and review.

reubenmorais 12 hours ago | parent | next [-]

I'm using it on a large set of existing codebases full of extremely ugly legacy code, weird build systems, tons of business logic and shipping directly to prod at neckbreaking growth over the last two years, and it's delivering the same type of value that Karpathy writes about.

jjfoooo4 11 hours ago | parent | prev | next [-]

That was true for me, but is no longer.

It's been especially helpful in explaining and understanding arcane bits of legacy code behavior my users ask about. I trigger Claude to examine the code and figure out how the feature works, then tell it to update the documentation accordingly.

chrisjj 7 hours ago | parent [-]

> I trigger Claude to examine the code and figure out how the feature works, then tell it to update the documentation accordingly.

And how do you verify its output isn't total fabrication?

_dark_matter_ 3 hours ago | parent [-]

Your docs are a contact. You can verify that contract using integration tests

1123581321 12 hours ago | parent | prev [-]

These models do well changing brownfield applications that have tests because the constraints on a successful implementation are tight. Their solutions can be automatically augmented by research and documentation.

mh2266 4 hours ago | parent [-]

I don't exactly disagree with this but I have seen models simply deleting the tests, or updating the tests to pass and declaring the failures were "unrelated to my changes", so it helpfully fixed them

hnben 8 minutes ago | parent [-]

Yes. You have to treat the model like an eager yet incompetent worker, i.e. don't go full yolo mode and review everything they do.

gwd 9 hours ago | parent | prev | next [-]

For me, in just the golang server instance and the core functional package, `cloc` reports over 40k lines of code, not counting other supporting packages. I spent the last week having Claude rip out the external auth system and replace it with a home-grown one (and having GPT-codex review its changes). If anything, Claude makes it easier on me as a solo founder with a large codebase. Rather than having to re-familiarize myself with code I wrote a year ago, I describe it at a high level, point Claude to a couple of key files, and then tell it to figure out what it needs to do. It can use grep, language server, and other tools to poke around and see what's going on. I then have it write an "epic" in markdown containing all the key files, so that future sessions already know the key files to read.

I really enjoyed the process. As TFA says, you have to keep a close eye on it. But the whole process was a lot less effort, and I ended up doing mor than I would otherwise have done.

ph4te 13 hours ago | parent | prev | next [-]

I don't know how big sufficiently large codebase is, but we have a 1mil loc Java application, that is ~10years old, and runs POS systems, and Claude Code has no issues with it. We have done full analyses with output details each module, and also used it to pinpoint specific issues when described. Vibe coding is not used here, just analysis.

BeetleB 8 hours ago | parent | prev | next [-]

> They seem mostly useless in a sufficiently large codebase especially when they are messy and interactions aren't always obvious.

What type of documents do you have explaining the codebase and its messy interactions, and have you provided that to the LLM?

Also, have you tried giving someone brand new to the team the exact same task and information you gave to the LLM, and how effective were they compared to the LLM?

> I don't know how much better Claude is than ChatGPT, but I can't get ChatGPT to do much useful with an existing large codebase.

As others have pointed out, from your comment, it doesn't sound like you've used a tool dedicated for AI coding.

(But even if you had, it would still fail if you expect LLMs to do stuff without sufficient context).

jwr 4 hours ago | parent | prev | next [-]

I successfully use Claude Code in a large complex codebase. It's Clojure, perhaps that helps (Clojure is very concise, expressive and hence token-dense).

culi 4 hours ago | parent [-]

Perhaps it's harder to "do Closure wrong" than it is to do JavaScript or Python or whatever other extremely flexible multi-paradigm high-level language

wcedmisten 3 hours ago | parent [-]

Having spent 3 years of my career working with Clojure, I think it actually gives you even more rope to shoot yourself with than Python/JS.

E.g. macros exist in Clojure but not Python/JS, and I've definitely been plenty stumped by seeing them in the codebase. They tend to be used in very "clever" patterns.

On the other hand, I'm a bit surprised Claude can tackle a complex Clojure codebase. It's been a while since I attempted using an LLM for Clojure, but at the time it failed completely (I think because there is relatively little training data compared to other mainstream languages). I'll have to check that out myself

smusamashah 9 hours ago | parent | prev | next [-]

The code base I work on at $dayjob$ is legacy, has few files with 20k lines each and a few more with around 10k lines each. It's hard to find things and connect dots in the code base. Dont think LLMs able to navigate and understand code bases of that size yet. But have seen lots of seemingly large projects shown here lately that involve thousands of files and millions of lines of code.

jumploops 9 hours ago | parent [-]

I’ve found that LLMs seem to work better on LLM-generated codebases.

Commercial codebases, especially private internal ones, are often messy. It seems this is mostly due to the iterative nature of development in response to customer demands.

As a product gets larger, and addresses a wider audience, there’s an ever increasing chance of divergence from the initial assumptions and the new requirements.

We call this tech debt.

Combine this with a revolving door of developers, and you start to see Conway’s law in action, where the system resembles the organization of the developers rather than the “pure” product spec.

With this in mind, I’ve found success in using LLMs to refactor existing codebases to better match the current requirements (i.e. splitting out helpers, modularizing, renaming, etc.).

Once the legacy codebase is “LLMified”, the coding agents seem to perform more predictably.

YMMV here, as it’s hard to do large refactors without tests for correctness.

(Note: I’ve dabbled with a test first refactor approach, but haven’t gone to the lengths to suggest it works, but I believe it could)

mh2266 4 hours ago | parent | next [-]

are LLM codebases not messy?

Claude by default, unless I tell it not to, will write stuff like:

    // we need something to be true
    somethingPasses = something()
    if (!somethingPasses) {
        return false
    }

    // we need somethingElse to be true
    somethingElsePasses = somethingElse()
    if (!somethingElsePasses) {
        return false
    }

    return true
instead of the very simple boolean logic that could express this in one line, with the "this code does what it obviously does" comments added all over the place.

generally unless you tell it not to, it does things in very verbose ways that most humans would never do, and since there's an infinite number of ways that it can invent absurd verbosity, it is hard to preemptively prompt against all of them.

to be clear, I am getting a huge amount of value out of it for executing a bunch of large refactors and "modernization" of a (really) big legacy codebase at scale and in parallel. but it's not outputting the sort of code that I see when someone prompts it "build a new feature ...", and a big part of my prompts is screaming at it not to do certain things or to refuse the task if it at any point becomes unsure.

jumploops 3 hours ago | parent [-]

Yeah to be clear it will have the same issues as a flyby contributor if prompted to.

Meaning if you ask it “handle this new condition” it will happily throw in a hacky conditional and get the job done.

I’ve found the most success in having it reason about the current architecture (explicitly), and then to propose a set of changes to accomplish the task (2-5 ways), review, and then implement the changes that best suit the scope of the larger system.

dexdal 3 hours ago | parent [-]

The failure mode is missing constraints, not “coding skill”. Treat the model as a generator that must operate inside an explicit workflow: define the invariant boundaries, require a plan/diff before edits, run tests and static checks, and stop when uncertainty appears. That turns “hacky conditional” behaviour into controlled change.

jumploops 2 hours ago | parent [-]

Yes, exactly.

The LLM is onboarding to your codebase with each context window, all it knows is what it’s seen already.

olig15 7 hours ago | parent | prev [-]

Surely because LLM generated code is part of the training data for the model, so code/patterns it can work with is closer to its training data.

tunesmith 13 hours ago | parent | prev | next [-]

If you have a ChatGPT account, there's nothing stopping you from installing codex cli and using your chatgpt account with it. I haven't coded with ChatGPT for weeks. Maybe a month ago I got utility out of coding with codex and then having ChatGPT look at my open IDE page to give comments, but since 5.2 came out, it's been 100% codex.

bluGill 10 hours ago | parent | prev | next [-]

I've been trying Claude on my large code base today. When I give it the requirements I'd give an engineer and so "do it" it just writes garbage that doesn't make sense and doesn't seem to even meet the requirements (if it does I can't follow how - though I'll admit to giving up before I understood what it did, and I didn't try it on a real system). When I forced it to step back and do tiny steps - in TDD write one test of the full feature - it did much better - but then I spent the next 5 hours adjusting the code it wrote to meet our coding standards. At least I understand the code, but I'm not sure it is any faster (but it is a lot easier to see things wrong than come up with green field code).

Which is to say you have to learn to use the tools. I've only just started, and cannot claim to be an expert. I'll keep using them - in part because everyone is demanding I do - but to use them you clearly need to know how to do it yourself.

simonw 10 hours ago | parent | next [-]

Have you tried showing it a copy of your coding standards?

I also find pointing it to an existing folder full of code that conforms to certain standards can work really well.

bluGill 6 hours ago | parent | next [-]

At least some of them that it violated it has seen.

bflesch 9 hours ago | parent | prev [-]

Yeah let's share all your IP for the vague promise that it will somehow work ;)

simonw 8 hours ago | parent | next [-]

You just gave me a revelation as to why some people report being unable to get decent results out of coding agents!

CamperBob2 7 hours ago | parent | prev [-]

(Shrug) If you're not willing to make that tradeoff, you'll be outcompeted by people who are. Your call.

rob 9 hours ago | parent | prev | next [-]

I've been playing around with the "Superpowers" [0] plugin in Claude Code on a new small project and really like it. Simple enough to understand quickly by reading the GitHub repo and seems to improve the output quality of my projects.

There's basically a "brainstorm" /slash command that you go back and forth with, and it places what you came up with in docs/plans/YYYY-MM-DD-<topic>-design.md.

Then you can run a "write-plan" /slash command on the docs/plans/YYYY-MM-DD-<topic>-design.md file, and it'll give you a docs/plans/YYYY-MM-DD-<topic>-implementation.md file that you can then feed to the "execute-plan" /slash command, where it breaks everything down into batches, tasks, etc, and actually implements everything (so three /slash commands total.)

There's also "GET SHIT DONE" (GSD) [1] that I want to look at, but at first glance it seems to be a bit more involved than Superpowers with more commands. Maybe it'd be better for larger projects.

[0] https://github.com/obra/superpowers

[1] https://github.com/glittercowboy/get-shit-done

gverrilla 4 hours ago | parent | prev [-]

it's all about the context. observe what files it opened, etc. good luck

epolanski 8 hours ago | parent | prev | next [-]

1. Write good documentation, architecture, how things work, code styling, etc.

2. Put your important dependencies source code in the same directory. E.g. put a `_vendor` directory in the project, in it put the codebase at the same tag you're using or whatever: postgres, redis, vue, whatever.

3. Write good plans and requirements. Acceptance criteria, context, user stories, etc. Save them in markdown files. Review those multiple times with LLMs trying to find weaknesses. Then move to implementation files: make it write a detailed plan of what it's gonna change and why, and what it will produce.

4. Write very good prompts. LLMs follow instructions well if they are clear "you should proactively do X", is a weak instruction if you mean "you must do X".

5. LLMs are far from perfect, and full of limits. Karpathy sums their cons very well in his long list. If you don't know their limits you'll mismanage the expectations and not use them when they are a huge boost and waste time on things they don't cope well with. On top of that: all LLMs are different in their "personality", how they adhere to instruction, how creative they are, etc.

Okkef 13 hours ago | parent | prev | next [-]

Try Claude code. It’s different.

After you tried it, come back.

Imustaskforhelp 12 hours ago | parent [-]

I think its not Claude code per se itself but rather the (Opus 4.5 model?) or something in an agentic workflow.

I tried a website which offered the Opus model in their agentic workflow & I felt something different too I guess.

Currently trying out Kimi code (using their recent kimi 2.5) for the first time buying any AI product because got it for like 1.49$ per month. It does feel a bit less powerful than claude code but I feel like monetarily its worth it.

Y'know you have to like bargain with an AI model to reduce its pricing which I just felt really curious about. The psychology behind it feels fascinating because I think even as a frugal person, I already felt invested enough in the model and that became my sunk cost fallacy

Shame for me personally because they use it as a hook to get people using their tool and then charge next month 19$ (I mean really Cheaper than claude code for the most part but still comparative to 1.49$)

datsci_est_2015 6 hours ago | parent | prev | next [-]

Also I never see anyone talking about code reviews, which is one of the primary ways that software engineering departments manage liability. We fired someone recently because they couldn’t explain any of the slop they were trying to get merged. Why tf would I accept the liability of managing code that someone else can’t even explain?

I guess this is fine when you don’t have customers or stakeholders that give a shit lol.

redox99 8 hours ago | parent | prev | next [-]

What do you even mean by "ChatGPT"? Copy pasting code into chatgpt.com?

AI assisted coding has never been like that, which would be atrocious. The typical workflow was using Cursor with some model of your choice (almost always an Anthropic model like sonnet before opus 4.5 released). Nowadays (in addition to IDEs) it's often a CLI tool like Claude Code with Opus or Codex CLI with GPT Codex 5.2 high/xhigh.

languid-photic 12 hours ago | parent | prev | next [-]

They build Claude Code fully with Claude Code.

Macha 11 hours ago | parent | next [-]

Which is equal parts praise and damnation. Claude Code does do a lot of nice things that people just kind of don't bother for time cost / reward when writing TUIs that they've probably only done because they're using AI heavily, but equally it has a lot of underbaked edges (like accidentally shadowing the user's shell configuration when it tries to install terminal bindings for shift-enter even though the terminal it's configuring already sends a distinct shift-enter result), and bugs (have you ever noticed it just stop, unfinished?).

simianwords 10 hours ago | parent [-]

i haven't used Claude Code but come on.. it is a production level quality application used seriously by millions.

xyzsparetimexyz 8 hours ago | parent | next [-]

Look up the flickering issue. The program was created by dunces.

gsk22 8 hours ago | parent | prev [-]

If you haven't used it, how can you judge its quality level?

vindex10 8 hours ago | parent | prev [-]

Ah, now I understand why @autocomplete suddenly got broken between versions and still not fixed )

maxdo 13 hours ago | parent | prev | next [-]

chatGPT is not made to write code. Get out of stone age :)

spaceman_2020 13 hours ago | parent | prev [-]

I'm afraid that we're entering a time when the performance difference between the really cutting edge and even the three-month-old tools is vast

If you're using plain vanilla chatgpt, you're woefully, woefully out of touch. Heck, even plain claude code is now outdated

shj2105 11 hours ago | parent [-]

Why is plain Claude code outdated? I thought that’s what most people are using right now that are AI forward. Is it Ralph loops now that’s the new thing?

spaceman_2020 9 hours ago | parent [-]

Plain Claude Code doesn’t have enough scaffolding to handle large projects

At a base level, people are “upgrading” their Claude Code with custom skills and subagents - all text files saved in .claude/agents|skills.

You can also use their new tasks primitive to basically run a Ralph-like loop

But at the edges, people are using multiple instances, each handling different aspects in parallel - stuff like Gas Town

Tbf you can still get a lot of mileage out of vanilla Claude Code. But I’ve found that even adding a simple frontend design skill improves the output substantially

duckmysick 7 hours ago | parent [-]

Is there anywhere where we can learn more about creating your own agents/skills? Maybe some decent public repos that you could recommend.

spaceman_2020 2 hours ago | parent [-]

You can just ask Claude to create them. They’re just markdown files

Anthropic’s own repo is as good place as any

https://github.com/anthropics/skills