Remix.run Logo
jdw64 4 hours ago

I think GPT writes code the best. How well will it write in version 5.6? It gives me chills.

Recently, I went head-to-head with GPT on nearly 2,000 lines of code, and GPT's solution was superior and faster. I even referenced multiple codebases on GitHub while trying, but they were incomparable to GPT.

So using GPT brings both fear and excitement.

The fear comes from realizing that this level of code is now the average for most people. The excitement comes from knowing that I can now study and learn at this level too.

I'm really looking forward to seeing how much more advanced the code will be with the upgrade to 5.6.

Topfi 2 hours ago | parent | next [-]

Purely subjective, but I tend to prefer reading Opus 4.8 output over GPT 5.5 code, even when the latter can have a higher overall ceiling. The former is just a bit more convenient to review.

seviu 4 hours ago | parent | prev | next [-]

I am on the opposite camp. Open models are starting to perform better. GPT 5.5 keeps on messing things up.

On the contrary, pi + glm + DeepSeek… bliss.

Fable was a different kind of beast though. Rip.

square_usual 2 hours ago | parent | next [-]

Every time I use opus these days I go shut up... you are not fable.. Hard to imagine how just three days with it changed how I saw LLM use.

ftkftk an hour ago | parent [-]

Same.

baq 4 hours ago | parent | prev | next [-]

Yeah, Opus/GPT need multiple rounds of reviews from each other to get to clean auto review. Fable was like, it is done and indeed… crickets in bot comments. ‘No issues’ galore.

aaroninsf an hour ago | parent [-]

I wonder if this will hold as other models with different biases achieve parity.

arizen 4 hours ago | parent | prev | next [-]

Ditto on GLM 5.2 + DeepSeek V4 Flash combo.

For most important work (complex, cross-domain inquiries etc.), I still rely on Codex GPT 5.5 though.

whalesalad 3 hours ago | parent | prev | next [-]

GPT-5.5 has been really hard to beat imho. I've spent $$$ on Opus, Deepseek v4 Pro and recently started to dogfood GLM-5.2 (which is not bad) but I cannot really trust any of them (almost blind) like I can trust GPT-5.5. It gives me tremendous confidence. I cannot say the same for any of the others I mentioned.

baddash an hour ago | parent | prev | next [-]

how much does your setup cost you? just curious

enraged_camel 4 hours ago | parent | prev [-]

>> I am on the opposite camp. Open models are starting to perform better. GPT 5.5 keeps on messing things up.

I'm working in a 600k+ LoC codebase that has complex domain-specific logic and lots of moving parts. I find that Codex 5.5 is pretty good at surgical fixes, but does not go out of its way to explore and figure out what those surgical fixes might break. So I only use it to work on parts of the system that are pretty isolated from everything else so that risk of regression is small.

HarHarVeryFunny 4 hours ago | parent | prev | next [-]

I'm suspect on how much of a coding advance it will be.

Seems odd that their announcement has zero coding benchmarks, with the closest related thing being terminal bench.

hereme888 3 hours ago | parent | next [-]

Tracking model performance on Artificial Analysis makes me think these models are constantly optimized/tuned in some way or another. GPT 5.5 was scoring in the mid 60's when it was first released, now it's almost 10 points higher.

jdw64 4 hours ago | parent | prev | next [-]

Maybe I'll know once I try it? Honestly, for small functions or methods, I don't think there's a huge difference between models. But the larger the code gets, the more noticeable the difference seems to be.

Personally, I think this kind of coding experience varies from person to person

vanuatu 4 hours ago | parent | prev | next [-]

sadly with all the labs benchmaxxing I feel like you just have to try the model for a while to really evaluate how good it is, especially for each individual use case

MangoCoffee 2 hours ago | parent | prev | next [-]

>zero coding benchmarks

"What gets measured gets managed"

artursapek 4 hours ago | parent | prev [-]

They claim extreme performance on ExploitBench, which Mythos was touted as being incredible at. https://x.com/OpenAI/status/2070555278576439306

HarHarVeryFunny 2 hours ago | parent | next [-]

My guess is that it's same base model as 5.5, but with additional post-training to improve and benchmaxx on a few things like that.

If they really thought it was competitive with Mythos/Fable across the board, then why wouldn't they release a broader set of benchmarks, and why price it day 1 at 1/2 the cost of Fable?

andriy_koval 3 hours ago | parent | prev [-]

On graph, they are still slightly bellow Mythos. Maybe enough to not be prohibited by US government?

8bitsout 3 hours ago | parent | prev | next [-]

Is it possible for you to provide examples? What were you trying to solve? What was your solution and why was GPT's solution superior and faster?

ignoramous 2 hours ago | parent [-]

> ... why was GPT's solution superior and faster?

Not saying that's the case with OP, but I've found folks sometimes just rationalize it so [0] as they're paying top dollar for it (especially, when compared to may be less capable but affordable models).

[0] https://en.wikipedia.org/wiki/Choice-supportive_bias

stagger87 4 hours ago | parent | prev | next [-]

> I even referenced multiple code bases on GitHub

Well, GPT referenced every GitHub code base, no wonder it won! :)

25 minutes ago | parent | prev | next [-]
[deleted]
pawelduda 4 hours ago | parent | prev | next [-]

How do you judge what is a good or bad thing to learn from a LLM? So you don't have to unlearn the bad bits later

jdw64 4 hours ago | parent | next [-]

When I searched for papers on using LLMs, I found that typically, you can have an LLM generate code and then ask it to find GitHub projects similar to that code. Then you can learn by looking at the pull requests and seeing how they structure things In the old days, if I wanted to understand why memory offsets, padding techniques, or data layout structures were written a certain way, I had to stare at a senior programmer's code all day or wait for them to reply. But LLMs, while they do flatter me, explain things at a level I can actually understand. And LLMs don't get annoyed.

jdw64 4 hours ago | parent | prev [-]

There's a lot of tacit knowledge in programming.

-Why do you cut API boundaries this way? -Why do you change the order of struct fields? -Why do you deliberately insert padding?

Most of it depends on the background and context. Sometimes you add it, sometimes you don't. To understand this tacit knowledge, you need access to senior developers. But their attitude often depends on how promising the student is and what background they come from. On top of that, you don't have to rely on the respondent's mood, authority, or availability.

Programming is fundamentally a field that requires seniors. In my case, I had no such seniors at all. I learned to code by buying codebases from failed companies and studying them. My first job didn't hire me as an employee—they hired me as the CEO of a subcontracting company (because that was structurally more advantageous for the contract). So I wasn't given the patience to learn programming fundamentals gradually. I had to pay penalties if I failed. Most of the projects I worked on were the kind where failure meant bankruptcy for me. Naturally, there was no one to teach me.

Most of my knowledge comes from reverse-engineering the code I purchased.

People say LLM code contains falsehoods, but commercially sold code has always had falsehoods too. Honestly, if we're just talking ratios, LLM code has fewer falsehoods.

In that sense, I still think it's a matter of context. If LLM code is false, was human code ever really true? LLMs do lie. They generate plenty of incorrect code. But humans do the same thing. If a problem comes up, you just look it up then and there. For me, LLMs and humans aren't all that different.

hereme888 3 hours ago | parent [-]

What do you think of modern open-source codebases presently available to the public? Is closed-source/proprietary code that much better?

Razengan 2 hours ago | parent | prev | next [-]

Codex 5.4/5.5 has been great for me as well compared to Claude Opus.

I've been mostly using it for Godot/GDScript code reviews, rubber duckying, asking it for better ideas for naming stuff (one of the hardest problems in programing)

I still can't trust it for generating code for entire files/classes/projects, because it's still icky, creating unnecessary variables and functions, using multiple `if`s instead of `and` or `or`, but it's good enough for generating Mac/iOS apps for my personal use in SwiftUI because fuck trying to keep up with Apple's documentation, or even migrating ancient Visual Basic stuff I made as a kid up to SwiftUI :)

> So using GPT brings both fear and excitement.

Only excitement for me. I've never been more productive, not because I ask AI to make something for me, but it helps me make what I was already going to, but better and quicker.

AI like any other tool could help smart people be smarter and dumb people be dumber, rather kinda like Toklien's Ring: You could be Sauron or you could be Bilbo or Frodo, or you could be Gollum :)

fatata123 3 hours ago | parent | prev [-]

No offense but have you considered the strong possibility that you’re just not good at what you do? I am occassionally pleased but mostly annoyed or disappointed… but never getting anything close to chills. That sounds downright weird.

adamtaylor_13 an hour ago | parent [-]

No offense but have you considered the strong possibility that you're just holding it wrong? You're entitled to your opinion, but OP is hardly the first person to say something like this and is surrounded by tons of folks saying the exact same thing. Just because it sounds weird to you, doesn't mean it's not true.

applfanboysbgon an hour ago | parent [-]

By definition, 50% of developers are below average, so there are indeed "tons of folks" who are not very good at what they do.

salutis an hour ago | parent [-]

That is not how averages work. By definition of mean, perhaps.

Xenoamorphous 37 minutes ago | parent | next [-]

Indeed. Most people have more arms than average, which must be 1.9 something.

cl3misch 24 minutes ago | parent | prev [-]

That is how a median is defined, not the mean.