Remix.run Logo
NitpickLawyer 3 hours ago

This is a much more reasonable take than the cursor-browser thing. A few things that make it pretty impressive:

> This was a clean-room implementation (Claude did not have internet access at any point during its development); it depends only on the Rust standard library. The 100,000-line compiler can build Linux 6.9 on x86, ARM, and RISC-V. It can also compile QEMU, FFmpeg, SQlite, postgres, redis

> I started by drafting what I wanted: a from-scratch optimizing compiler with no dependencies, GCC-compatible, able to compile the Linux kernel, and designed to support multiple backends. While I specified some aspects of the design (e.g., that it should have an SSA IR to enable multiple optimization passes) I did not go into any detail on how to do so.

> Previous Opus 4 models were barely capable of producing a functional compiler. Opus 4.5 was the first to cross a threshold that allowed it to produce a functional compiler which could pass large test suites, but it was still incapable of compiling any real large projects.

And the very open points about limitations (and hacks, as cc loves hacks):

> It lacks the 16-bit x86 compiler that is necessary to boot [...] Opus was unable to implement a 16-bit x86 code generator needed to boot into 16-bit real mode. While the compiler can output correct 16-bit x86 via the 66/67 opcode prefixes, the resulting compiled output is over 60kb, far exceeding the 32k code limit enforced by Linux. Instead, Claude simply cheats here and calls out to GCC for this phase

> It does not have its own assembler and linker;

> Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.

Ending with a very down to earth take:

> The resulting compiler has nearly reached the limits of Opus’s abilities. I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.

All in all, I'd say it's a cool little experiment, impressive even with the limitations, and a good test-case as the author says "The resulting compiler has nearly reached the limits of Opus’s abilities". Yeah, that's fair, but still highly imrpessive IMO.

geraneum 3 hours ago | parent | next [-]

> This was a clean-room implementation

This is really pushing it, considering it’s trained on… internet, with all available c compilers. The work is already impressive enough, no need for such misleading statements.

raincole an hour ago | parent | next [-]

It's not a clean-room implementation, but not because it's trained on the internet.

It's not a clean-room implementation because of this:

> The fix was to use GCC as an online known-good compiler oracle to compare against

Calavar 12 minutes ago | parent | next [-]

By the classical definition of a clean room implementation, it's something that's made by looking at the output but not at the source.

I agree that having a reference compiler available is a huge caveat though. Their developing against a programmatic checker for a spec that's already had millions of man hours put into it. This is an optimal scenario for agentic coding, but the vast majorty of problems that people are going to want to tackle with agentic coding are not going to look like that

35 minutes ago | parent | prev [-]
[deleted]
3 hours ago | parent | prev | next [-]
[deleted]
antirez 3 hours ago | parent | prev | next [-]

The LLM does not contain a verbatim copy of whatever it saw during the pre-training stage, it may remember certain over-represented parts, otherwise it has a knowledge about a lot of things but such knowledge, while about a huge amount of topics, is similar to the way you could remember things you know very well. And, indeed, if you give it access to internet or the source code of GCC and other compilers, it will implement such a project N times faster.

halxc 3 hours ago | parent | next [-]

We all saw verbatim copies in the early LLMs. They "fixed" it by implementing filters that trigger rewrites on blatant copyright infringement.

It is a research topic for heaven's sake:

https://arxiv.org/abs/2504.16046

RyanCavanaugh 3 hours ago | parent | next [-]

The internet is hundreds of billions of terabytes; a frontier model is maybe half a terabyte. While they are certainly capable of doing some verbatim recitations, this isn't just a matter of teasing out the compressed C compiler written in Rust that's already on the internet (where?) and stored inside the model.

philipportner an hour ago | parent | next [-]

This seems related, it may not be a codebase but they are able to extract "near" verbatim books out of Claude Sonnet.

https://arxiv.org/pdf/2601.02671

> For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984 (Section 4).

Aurornis 5 minutes ago | parent [-]

Their technique really stretched the definition of extracting text from the LLM.

They used a lot of different techniques to prompt with actual text from the book, then asked the LLM to continue the sentences. I only skimmed the paper but it looks like there was a lot of iteration and repetitive trials. If the LLM successfully guessed words that followed their seed, they counted that as "extraction". They had to put in a lot of the actual text to get any words back out, though. The LLM was following the style and clues in the text.

You can't literally get an LLM to give you books verbatim. These techniques always involve a lot of prompting and continuation games.

seba_dos1 18 minutes ago | parent | prev | next [-]

> The internet is hundreds of billions of terabytes; a frontier model is maybe half a terabyte.

The lesson here is that the Internet compresses pretty well.

mft_ 36 minutes ago | parent | prev [-]

(I'm not needlessly nitpicking, as I think it matters for this discussion)

A frontier model (e.g. latest Gemini, Gpt) is likely several-to-many times larger than 500GB. Even Deepseek v3 was around 700GB.

But your overall point still stands, regardless.

Aurornis 11 minutes ago | parent | prev | next [-]

Simple logic will demonstrate that you can't fit every document in the training set into the parameters of an LLM.

Citing a random arXiv paper from 2025 doesn't mean "they" used this technique. It was someone's paper that they uploaded to arXiv, which anyone can do.

ben_w 2 hours ago | parent | prev | next [-]

We saw partial copies of large or rare documents, and full copies of smaller widely-reproduced documents, not full copies of everything. An e.g. 1 trillion parameter model is not a lossless copy of a ten-petabyte slice of plain text from the internet.

The distinction may not have mattered for copyright laws if things had gone down differently, but the gap between "blurry JPEG of the internet" and "learned stuff" is more obviously important when it comes to e.g. "can it make a working compiler?"

tza54j 2 hours ago | parent | next [-]

We are here in a clean room implementation thread, and verbatim copies of entire works are irrelevant to that topic.

It is enough to have read even parts of a work for something to be considered a derivative.

I would also argue that language models who need gargantuan amounts of training material in order to work by definition can only output derivative works.

It does not help that certain people in this thread (not you) edit their comments to backpedal and make the followup comments look illogical, but that is in line with their sleazy post-LLM behavior.

ben_w an hour ago | parent [-]

> It is enough to have read even parts of a work for something to be considered a derivative.

For IP rights, I'll buy that. Not as important when the question is capabilities.

> I would also argue that language models who need gargantuan amounts of training material in order to work by definition can only output derivative works.

For similar reasons, I'm not going to argue against anyone saying that all machine learning today, doesn't count as "intelligent":

It is perfectly reasonable to define "intelligence" to be the inverse of how many examples are needed.

ML partially makes up for being (by this definition) thick as an algal bloom, by being stupid so fast it actually can read the whole internet.

antirez 2 hours ago | parent | prev | next [-]

Besides, the fact an LLM may recall parts of certain documents, like I can recall incipits of certain novels, does not mean that when you ask LLM of doing other kind of work, that is not recalling stuff, the LLM will mix such things verbatim. The LLM knows what it is doing in a variety of contexts, and uses the knowledge to produce stuff. The fact that for many people LLMs being able to do things that replace humans is bitter does not mean (and is not true) that this happens mainly using memorization. What coding agents can do today have zero explanation with memorization of verbatim stuff. So it's not a matter of copyright. Certain folks are fighting the wrong battle.

shakna 38 minutes ago | parent [-]

During a "clean room" implementation, the implementor is generally selected for not being familiar with the workings of what they're implementing, and banned from researching using it.

Because it _has_ been enough, that if you can recall things, that your implementation ends up not being "clean room", and trashed by the lawyers who get involved.

I mean... It's in the name.

> The term implies that the design team works in an environment that is "clean" or demonstrably uncontaminated by any knowledge of the proprietary techniques used by the competitor.

If it can recall... Then it is not a clean room implementation. Fin.

philipportner an hour ago | parent | prev | next [-]

Granted, these are some of the most widely spread texts, but just fyi:

https://arxiv.org/pdf/2601.02671

> For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984 (Section 4).

ben_w an hour ago | parent [-]

Already aware of that work, that's why I phrased it the way I did :)

Edit: actually, no, I take that back, that's just very similar to some other research I was familiar with.

boroboro4 2 hours ago | parent | prev [-]

While I mostly agree with you, it worth noting modern llms are trained on 10-20-30T of tokens which is quite comparable to their size (especially given how compressible the data is)

soulofmischief 2 hours ago | parent | prev [-]

The point is that it's a probabilistic knowledge manifold, not a database.

PunchyHamster 2 hours ago | parent [-]

we all know that.

PunchyHamster 2 hours ago | parent | prev [-]

So it will copy most code with adding subtle bugs

3 hours ago | parent | prev | next [-]
[deleted]
inchargeoncall 2 hours ago | parent | prev [-]

[flagged]

teaearlgraycold an hour ago | parent [-]

With just a few thousand dollars of API credits you too can inefficiently download a lossy copy of a C compiler!

modeless 3 hours ago | parent | prev | next [-]

There seem to still be a lot of people who look at results like this and evaluate them purely based on the current state. I don't know how you can look at this and not realize that it represents a huge improvement over just a few months ago, there have been continuous improvements for many years now, and there is no reason to believe progress is stopping here. If you project out just one year, even assuming progress stops after that, the implications are staggering.

zamadatix an hour ago | parent | next [-]

The improvements in tool use and agentic loops have been fast and furious lately, delivering great results. The model growth itself is feeling more linear lately, but what you can do with models as part of an overall system has been increasing in growth rate and that has been delivering a lot of value. It matters less if the model natively can keep infinite context or figure things out on its own in one shot so long as it can orchestrate external tools to achieve that over time.

nozzlegear 2 hours ago | parent | prev | next [-]

Every S-curve looks like an exponential until you hit the bend.

NitpickLawyer an hour ago | parent | next [-]

We've been hearing this for 3 years now. And especially 25 was full of "they've hit a wall, no more data, running out of data, plateau this, saturated that". And yet, here we are. Models keep on getting better, at more broad tasks, and more useful by the month.

nozzlegear an hour ago | parent | next [-]

> We've been hearing this for 3 years now

Not from me you haven't!

> "they've hit a wall, no more data, running out of data, plateau this, saturated that"

Everyone thought Moore's Law was infallible too, right until they hit that bend. What hubris to think these AI models are different!

But you've probably been hearing that for 3 years too (though not from me).

> Models keep on getting better, at more broad tasks, and more useful by the month.

If you say so, I'll take your word for it.

Cyphase 44 minutes ago | parent | next [-]

25 is 2025.

nozzlegear 25 minutes ago | parent [-]

Oh my bad, the way it was worded made me read it as the name of somebody's model or something.

torginus 43 minutes ago | parent | prev [-]

Except for Moore's law, everyone knew decades ahead of what the limits of Dennard scaling are (shrinking geometry through smaller optical feature sizes), and roughly when we would get to the limit.

Since then, all improvements came at a tradeoff, and there was a definite flattening of progress.

nozzlegear 23 minutes ago | parent [-]

> Since then, all improvements came at a tradeoff, and there was a definite flattening of progress.

Idk, that sounds remarkably similar to these AI models to me.

fmbb 33 minutes ago | parent | prev [-]

> And yet, here we are.

I dunno. To me it doesn’t even look exponential any more. We are at most on the straight part of the incline.

bopbopbop7 27 minutes ago | parent [-]

People are confusing exponential improvement with the exponential pre-IPO marketing budget increase at Anthropic and OpenAI.

raincole an hour ago | parent | prev [-]

This quote would be more impactful if people haven't been repeating it since gpt-4 time.

kimixa 39 minutes ago | parent | next [-]

People have also been saying we'd be seeing the results of 100x quality improvements in software with corresponding decease in cost since gpt-4 time.

So where is that?

nozzlegear an hour ago | parent | prev [-]

I agree, I have been informed that people have been repeating it for three years. Sadly I'm not involved in the AI hype bubble so I wasn't aware. What an embarrassing faux pas!

chasd00 an hour ago | parent | prev [-]

i have to admit, even if model and tooling progress stopped dead today the world of software development has forever changed and will never go back.

gmueckl 3 hours ago | parent | prev | next [-]

The result is hardly a clean room implementation. It was rather a brute force attempt to decompress fuzzily stored knowledge contained within the network and it required close steering (using a big suite of tests) to get a reasonable approximation to the desired output. The compression and storage happened during the LLM training.

Prove this statement wrong.

libraryofbabel 2 hours ago | parent | next [-]

Nobody disputes that the LLM was drawing on knowledge in its training data. Obviously it was! But you'll need to be a bit more specific with your critique, because there is a whole spectrum of interpretations, from "it just decompressed fuzzily-stored code verbatim from the internet" (obviously wrong, since the Rust-based C compiler it wrote doesn't exist on the internet) all the way to "it used general knowledge from its training about compiler architecture and x86 and the C language."

Your post is phrased like it's a two sentence slam-dunk refutation of Anthropic's claims. I don't think it is, and I'm not even clear on what you're claiming precisely except that LLMs use knowledge acquired during training, which we all agree on here.

NitpickLawyer 3 hours ago | parent | prev | next [-]

> Prove this statement wrong.

If all it takes is "trained on the Internet" and "decompress stored knowledge", then surely gpt3, 3.5, 4, 4.1, 4o, o1, o3, o4, 5, 5.1, 5.x should have been able to do it, right? Claude 2, 3, 4, 4.1, 4.5? Surely.

shakna 43 minutes ago | parent | next [-]

Well, "Reimplement the c4 compiler - C in four functions" is absolutely something older models can do. Because most are trained, on that quite small product - its 20kb.

But reimplementing that isn't impressive, because its not a clean room implementation if you trained on that data, to make the model that regurgitates the effort.

gmueckl an hour ago | parent | prev | next [-]

This comparison is only meaningful with comparable numbers of parameters and context window tokens. And then it would mainly test the efficiency and accuracy of the information encoding. I would argue that this is the main improvement over all model generations.

hn_acc1 an hour ago | parent | prev | next [-]

Are you really asking for "all the previous versions were implemented so poorly they couldn't even do this simple, basic LLM task"?

geraneum 3 hours ago | parent | prev [-]

Perhaps 4.5 could also do it? We don’t know really until we try. I don’t trust the marketing material as much. The fact that the previous version (smaller versions) couldn’t or could do it does not really disprove that claim.

Marha01 3 hours ago | parent | prev | next [-]

Even with 1 TB of weights (probable size of the largest state of the art models), the network is far too small to contain any significant part of the internet as compressed data, unless you really stretch the definition of data compression.

jesse__ 2 hours ago | parent | next [-]

This sounds very wrong to me.

Take the C4 training dataset for example. The uncompressed, uncleaned, size of the dataset is ~6TB, and contains an exhaustive English language scrape of the public internet from 2019. The cleaned (still uncompressed) dataset is significantly less than 1TB.

I could go on, but, I think it's already pretty obvious that 1TB is more than enough storage to represent a significant portion of the internet.

FeepingCreature an hour ago | parent [-]

This would imply that the English internet is not much bigger than 20x the English Wikipedia.

That seems implausible.

kgeist an hour ago | parent | prev | next [-]

A lot of the internet is duplicate data, low quality content, SEO spam etc. I wouldn't be surprised if 1 TB is a significant portion of the high-quality, information-dense part of the internet.

FeepingCreature an hour ago | parent [-]

I would be extremely surprised if it was that small.

gmueckl an hour ago | parent | prev [-]

This is obviously wrong. There is a bunch of knowledge embedded in those weights, and some of it can be recalled verbatim. So, by virtue of this recall alone, training is a form of lossy data compression.

0xCMP an hour ago | parent | prev | next [-]

I challenge anyone to try building a C compiler without a big suite of tests. Zig is the most recent attempt and they had an extensive test suite. I don't see how that is disqualifying.

If you're testing a model I think it's reasonable that "clean room" have an exception for the model itself. They kept it offline and gave it a sandbox to avoid letting it find the answers for itself.

Yes the compression and storage happened during the training. Before it still didn't work; now it does much better.

hn_acc1 an hour ago | parent [-]

The point is - for a NEW project, no one has an extensive test suite. And if an extensive test suite exists, it's probably because the product that uses it also exists, already.

If it could translate the C++ standard INTO an extensive test suite that actually captures most corner cases, and doesn't generate false positives - again, without internet access and without using gcc as an oracle, etc?

brutalc 3 hours ago | parent | prev [-]

No one needs to prove you wrong. That’s just personal insecurity trying to justify ones own worth.

linuxtorvals 3 hours ago | parent [-]

[flagged]

panzi an hour ago | parent | prev | next [-]

> clean-room implementation

Except its trained on all source out there, so I assume on GCC and clang. I wonder how similar the code is to either.

dyauspitr an hour ago | parent | prev [-]

> Claude did not have internet access at any point during its development

Why is this even desirable? I want my LLM to take into account everything there is out there and give me the best possible output.

simonw 27 minutes ago | parent [-]

It's desirable if you're trying to build a C compiler as a demo of coding agent capabilities without all of the Hacker News commenters saying "yeah but it could just copy implementation details from the internet".