Remix.run Logo
rcarmo 2 days ago

I had a similar set of experiences with GPT 5.x over the holiday break, across somewhat more disparate domains: https://taoofmac.com/space/notes/2025/12/31/1830

I hacked together a Swift tool to replace a Python automation I had, merged an ARM JIT engine into a 68k emulator, and even got a very decent start on a synth project I’ve been meaning to do for years.

What has become immensely apparent to me is that even gpt-5-mini can create decent Go CLI apps provided you write down a coherent spec and review the code as if it was a peer’s pull request (the VS Code base prompts and tooling steer even dumb models through a pretty decent workflow).

GPT 5.2 and the codex variants are, to me, every bit as good as Opus but without the groveling and emojis - I can ask it to build an entire CI workflow and it does it in pretty much one shot if I give it the steps I want.

So for me at least this model generation is a huge force multiplier (but I’ve always been the type to plan before coding and reason out most of the details before I start, so it might be a matter of method).

heavyset_go 2 days ago | parent | next [-]

To add to the anecdata, today GPT 5.2-whatever hallucinated the existence of two CLI utilities, and when corrected, then hallucinated the existence of non-existent, but plausible, features/options of CLI utilities that do actually exist.

I had to dig through source code to confirm whether those features actually existed. They don't, so the CLI tools GPT recommended aren't actually applicable to my use case.

Yesterday, it hallucinated features of WebDav clients, and then talked up an abandoned and incomplete project on GitHub with a dozen stars as if it was the perfect fit for what I was trying to do, when it wasn't.

I only remember these because they're recent and CLI related, given the topic, but there are experiences like this daily across different subjects and domains.

simonw 2 days ago | parent | next [-]

Were you running it inside a coding agent like Codex?

If so then it should have realized its mistake when it tried to run those CLI commands and saw the error message. Then it can try something different instead.

If you were using a regular chat interface and expecting it to know everything without having an environment to try things out then yeah, you're going to be disappointed.

heavyset_go 2 days ago | parent | next [-]

No, Codex doesn't have permission to install random software on my machine and then execute it to see if it's real or a hallucination.

CLI utility here means software with a CLI, not classic Unix-y CLI tools.

The WebDav hallucinations happened in the chat interface.

varenc 2 days ago | parent [-]

It's not an all or nothing permission. How I use claude code it has to ask me for permission for every CLI tool use. This seems like reasonable way to balance security with utility and would allow the agent to correct itself when it hallucinates CLI tools. Or just run it in an isolated container where it can't break anything and give it full perms.

heavyset_go a day ago | parent [-]

I don't want any LLM tool prompting me to install and run software it makes up on the fly.

Typosquatting is a thing, for example, and I'm sure hallucination squatting will be, too.

I also don't want to run anything in a "sandbox", either. Containers are not sandboxes despite things like the Gemini CLI pretending they are.

hashhar 2 days ago | parent | prev [-]

Codex for me behaves very junior engineer-ish. Claude is smarter and tries to think long term.

A great example of their behaviours for a problem that isn't 100% specified in detail (because detail would need iterations) is available at https://gist.github.com/hashhar/b1215035c19a31bbe4b58f44dbb4....

I gave both Codex (GPT5-ExHi) and Claude (Opus 4.5 Thinking) the exact same prompts and the end results were very different.

The most interesting bit was asking both of them to try to justify why there were differences and then critiquing each other's code. Claude was so good at this - took the best parts of GPTs code, fixed a bug there and ended up with a pretty nice implementation.

The Claude generated code was much more well-organised too (less script-like, more program like).

tezza 2 days ago | parent | prev [-]

Yeah, it needs a steady hand on the tiller. However throw together improvements of 70%, -15%, 95%, 99%, -7% across all the steps and overall you're way ahead.

SimonW's approach of having a suite of dynamic tools (agents) grind out the hallucinations is a big improvement.

In this case expressing the feeback validation and investing in the setup may help smooth these sharp edges.

kgeist 2 days ago | parent | prev | next [-]

I tried generating code with ChatGPT 5.2, but the results weren't that great:

1) It often overcomplicates things for me. After I refactor its code, it's usually half the size and much more readable. It often adds unnecessary checks or mini-features 'just in case' that I don't need.

2) On the other hand, almost every function it produces has at least one bug or ignores at least one instruction. However, if I ask it to review its own code several times, it eventually finds the bugs.

I still find it very useful, just not as a standalone programming agent. My workflow is that ChatGPT gives me a rough blueprint and I iterate on it myself, I find this faster and less error-prone. It's usually most useful in areas where I'm not an expert, such as when I don't remember exact APIs. In areas where I can immediately picture the entire implementation in my head, it's usually faster and more reliable to write the code myself.

rcarmo 2 days ago | parent [-]

Well, like I pointed out somewhere else, VS Code gives it a set of prompts and tools that makes it very effective for me. I see that a lot of people are still copy/pasting stuff instead of having the “integrated” experience, and it makes a real difference.

(Cue the “you’re holding it wrong meme” :))

Kerrick 2 days ago | parent | prev | next [-]

Gemini 3 Pro (High) via Antigravity has been similarly great recently. So have tools that I imagine call out to these higher-power models: Amp and Junie. In a two-week blur I brought forth the bulk of a Ruby library that includes bindings to the Ratatui rust crate for making TUIs in Ruby. During that time I also brought forth documentation, example applications, build and devops tooling, and significant architectural decisions & roadmaps for the future. It's pretty unbelievable, but it's all there in the git and CI history. https://sr.ht/~kerrick/ratatui_ruby/

I think the following things are true now:

- Vibe Coding is, more than ever, "autopilot" in the aviation sense, not the colloquial sense. You have to watch it, you are responsible, the human has do run takeoff/landing (the hard parts), but it significantly eases and reduces risk on a bulk of the work.

- The gulf of developer experience between today's frontier tooling and six months ago is huge. I pushed hard to understand and use these tools throughout last year, and spent months discouraged--back to manual coding. Folks need to re-evaluate by trying premium tools, not free ones.

- Tooling makers have figured out a lot of neat hacks to work around the limitations of LLMs to make it seem like they're even better than they are. Junie integrates with your IDE, Antigravity has multiple agents maintaining background intel on your project and priorities across chats. Antigravity also compresses contexts and starts new ones without you realizing it, calls to sub-agents to avoid context pollution, and other tricks to auto-manage context.

- Unix tools (sed, grep, awk, etc.) and the git CLI (ls-tree, show, --stat, etc.) have been a huge force-multiplier, as they keep the context small compared to raw ingestion of an entire file, allowing the LLMs to get more work done in a smaller context window.

- The people who hire programmers are still not capable of Vibe Coding production-quality web apps, even with all these improvements. In fact, I believe today this is less of a risk than I feared 10 months ago. These are advanced tools that need constant steering, and a good eye for architecture, design, developer experience, test quality, etc. is the difference between my vibe coded Ruby [0] (which I heavily stewarded) and my vibe coded Rust [1] (I don't even know what borrow means).

[0]: https://git.sr.ht/~kerrick/ratatui_ruby/tree/stable/item/lib

[1]: https://git.sr.ht/~kerrick/ratatui_ruby/tree/stable/item/ext...

chaostheory 2 days ago | parent [-]

Were they able to link Antigravity to your paid subscription? I have a Google ultra AI sub and antigrav ran out of credits within 30 minutes for me. Of course that was a few weeks ago, and I’m hoping that they fixed this

Kerrick 2 days ago | parent [-]

Yes. I was on a 30-day trial of Google AI Pro and I got a few big wins each out of Gemini 3 Pro (High) and Claude 4.5 Opus (Thinking) before my quota got reset. Then I'd cycle through Gemini 3 Flash and Amp Free (or paid Junie credits if I got antsy) until my quota reset.

You can see this pattern in my AI attribution commit footers. It was such a noticeable difference to me that I signed up for Google AI Ultra. I got the email receipt January 3, 2026 at 11:21 AM Central, and I have not hit a single quota limit since. Yo

indigodaddy 2 days ago | parent [-]

Limits reset every 5 hours on the pro plan right?

IgorPartola 2 days ago | parent | prev [-]

The thing is that CLI utilities code is probably easier to write for an LLM than most other things. In my experience an LLM does best with backend and terminal things. Anything that resembles boilerplate is great. It does well refactoring unit tests, wrapping known code in a CLI, and does decent work with backend RESTful APIs. Where it fails utterly is things like HTML/CSS layout, JavaScript frontend code for SPAs, and particularly real world UI stuff that requires seeing and interacting with a web page/app where things like network latency and errors, browser UI, etc. can trip it up. Basically when the input and output are structured and known an LLM will do well. When they are “look and feel” they fail and fail until they make the code unmaintainable.

This experience for me is current but I do not normally use Opus so perhaps I should give it a try and figure out if it can reason around problems I myself do not foresee (for example a browser JS API quirk that I had never seen).

simonw 2 days ago | parent | next [-]

I've been having a surprising amount of success recently telling Claude Code to test the frontend it's building using Playwright, including interacting with the UI and having it take its own screenshots to feed into its vision ability to "see" what's going on.

throwup238 2 days ago | parent | next [-]

That works well with QT and desktop apps as well. Asking Claude Code to write an MCP integrated into a desktop all implementing the same features as Playwright is a half hour exercise.

johnfn 2 days ago | parent | prev [-]

It's kind of funny that we posted basically the exact comment at the same time, down to quoting "see"!

smoe 2 days ago | parent | prev | next [-]

In my experience with a combo of Claude Code and Gemini Pro (and having added Codex to the mix about a week ago as well), it matters less whether it’s CLI, backend, frontend, DB queries, etc. but more how cookiecutter the thing you’re building is. For building CRUD views or common web application flows, it crushes it, especially if you can point it to a folder and just tell it to do more of the same, adapted to a new use case.

But yes, the more specific you get and the more moving pieces you have, the more you need to break things down into baby steps. If you don’t just need it to make A work, but to make it work together with B and C. Especially given how eager Claude is to find cheap workarounds and escape hatches, botching things together in any way seemingly to please the prompter as fast as possible.

rcarmo 2 days ago | parent | prev | next [-]

Since one of my holiday projects was completely rebuilding the Node-RED dashboard in Preact, I have to challenge that a bit. How were you using the model?

johnfn 2 days ago | parent | prev [-]

I couldn't disagree more. I've had Claude absolutely demolish large HTML/CSS/JS/React projects. One key is to give it some way to "see" and interact with the page. I usually use Playwright for this. Allowing it to see its own changes and iterate on them was the key unlock for me.