Remix.run Logo
pron 2 hours ago

> where are we seeing that it failed?

Anthropic said the experiment failed to produce a workable C compiler:

- I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.

- The compiler successfully builds many projects, but not all. It's not yet a drop-in replacement for a real compiler.

(source: https://www.anthropic.com/engineering/building-c-compiler)

Software that cannot be evolved is dead software. That in some PR communications they misrepresented their own engineer's report is beside the point.

> It compiled multiple projects successfully albeit less optimized.

150,000x slower (https://github.com/harshavmb/compare-claude-compiler) is not "less optimised". It's unworkable.

> Like I think people don't realize not even 7 months ago it wasn't writing this at all.

There's no doubt that producing a C compiler that isn't workable and is effectively bricked as it cannot be evolved but still compiles some programs is great progress, but it's still a long way off of auonomously building production software. Can today's LLM do amazing things and offer tremendous help in software development? Absolutely. Can they write production software without careful and close human supervision? Not yet. That's not disparagement, just an observation of where we are today.

ianbutler 2 hours ago | parent [-]

> Can they write production software without careful and close human supervision? Not yet. That's not disparagement, just an observation of where we are today.

I never claimed they could! I just view this as a successful experiment. I don't think anthropic was making that claim with their experiment either.

It feels reflexive to the moment to argue against that claim, but I tend to operate with a bit more nuance than "all good" or "all bad".

pron an hour ago | parent | next [-]

The experiment failed to produce a workable C compiler despite 1. the job not being particularly hard, 2. the available specs and tests are of a completely higher class of quality than almost any software, not to mention the availability of other implementations that the model trained on.

You can call that a success (as it did something impresssive even though it failed to produce a workable C compiler) but my point in bringing this up was to show that today's models are not yet able to produce production software without close supervision, even when uncharacteristically good specs and hand-written tests exist.

10 minutes ago | parent | prev [-]
[deleted]