Sounds like they didn’t build a proper clean room setup: the agent writing the code could see the original code.

Question: if they had built one using AI teams in both “rooms”, one writing a spec the other implementing, would that be fine? You’d need to verify spec doesn’t include source code, but that’s easy enough.

It seems to mostly follow the IBM-era precedent. However, since the model probably had the original code in its training data, maybe not? Maybe valid for closed source project but not open-source? Interesting question.

▲

swiftcoder 7 hours ago | parent | next [-]

> Sounds like they didn’t build a proper clean room setup: the agent writing the code could see the original code.

It doesn't matter how they structure the agents. Since chardet is in the LLM training set, you can't claim any AI implementation thereof is clean room.

▲

scosman 7 hours ago | parent | next [-]

Yeah I mention that in the question.

Might still be valid for closed source projects (probably is).

I think courts would need to weigh in on the open source side. There’s legal precedent is that you can use a derived work to generate a new unique work (the spec derived for the copyrighted code is very much a derived work). There are rulings that LLMs are transformative works, not just copies of training data.

LLMs can’t reproduce their entire training set. But this thinking is also ripe for misuse. I could always train or fine-tune a model on the original work so that it can reproduce the original. We quickly get into statistical arguments here.

It’s a really interesting question.

	▲	jacquesm 5 hours ago \| parent \| next [-]
		I just wrote a long comment about that, but yes, you are on to something here. The key to me is that the LLM itself is a derived work and that by definition it can not produce something original. Which in turn would make profiting off such a derived work created by an automated process from copyrighted works a case of wholesale copyright infringement. If you can get a judge to agree on that I predict the price of RAM will come down again.
	▲	swiftcoder 4 hours ago \| parent \| prev [-]
		> There’s legal precedent is that you can use a derived work to generate a new unique work (the spec derived for the copyrighted code is very much a derived work) Indeed, but in the clean room scenario, the party who implements the spec has to be a separate entity that has never seen the code. Whether or not the LLM is copyright infringing is a separate question - it definitely has (at least some) familiarity with the code in question, which makes the "clean room" argument an uphill battle

▲

bsza 6 hours ago | parent | prev [-]

So by that logic, you're not legally allowed to implement your own character detector and license it as your own if you've ever looked at chardet's source code? I'm confused. I thought copyright laws protect intellectual property as-is, not the impression it leaves on someone.

▲

jacquesm 5 hours ago | parent | next [-]

Well, you are not making things easier for yourself by looking at that source code if the author of chardet brings a case for copyright infringement against you.

The question is: if you had not looked at chardet's source would you still be able to create your work? If the answer is 'yes' then you probably shouldn't have looked at the source, you just made your defense immeasurably harder. And if the answer is 'no' then you probably should have just used chardet and respected its license.

▲

bsza 4 hours ago | parent [-]

Sorry, but that sounds like a witch hunt to me, not modern law. Isn't the burden of proof on the accuser? I.e. the accuser has to prove that "this piece of code right here is a direct refactoring of my code, and here are the trivial and mechanical steps to produce one from the other"? And if they present no such evidence, we can all go home?

▲

nz an hour ago | parent | next [-]

Not all legal systems put the burden of proof on the accuser. In fact, many legal systems have indefinite detentions, in which the government effectively imprisons a suspect, sometimes for months at a time. To take it a step further, the plea-bargain system of the USA, is really just a method to skip the entire legal process. After all, proving guilt is expensive, so why not just strong-arm a suspect into confessing? It also has the benefit of holding someone responsible for an injustice, even if the actual perpetrator cannot be found. By my personal standards, this is a corrupt system, but by the standards of the legal stratum of society, those KPIs look _solid_.

By contrast, in Germany (IIRC), false confessions are _illegal_, meaning that objective evidence is required.

Many legal systems follow the principle of "innocent until proven guilty", but also have many "escape hatches" that let them side-step the actual process that is supposed to guarantee that ideal principle.

EDIT: And that is just modern society. Past societies have had trial by ordeal and trial by combat, neither of which has anything to do with proof and evidence. Many such archaic proof procedures survive in modern legal systems, in a modernized and bureaucratized way. In some sense, modern trials are a test of who has the more expensive attorney (as opposed to who has a more skilled champion or combatant).

▲

jacquesm 4 hours ago | parent | prev | next [-]

No, the burden of proof is on the defender: if you didn't create it you are not the copyright holder.

Copyright is automatic for a reason, the simple act of creation is technically enough to establish copyright. But that mechanism means that if your claimed creation has an uncanny resemblance to an earlier, published creation or an unpublished earlier creation that you had access to that you are going to be in trouble when the real copyright holder is coming to call.

In short: just don't. Write your own stuff if you plan on passing it off as your own.

The accuser just needs to establish precedence.

So if you by your lonesome have never listened to the radio and tomorrow morning wake up and 'Billy Jean' springs from your brain you're going to get sued, even if the MJ estate won't be able to prove how you did it.

▲

bsza 3 hours ago | parent [-]

That much I understand, but that question only comes up when the similarity is already an established fact, no? If we take the claim that this is a "complete rewrite" at face value, then there should be no reason for the code to have any uncanny similarities with chardet 6 beyond what is expectable from their functionality (which is not copyrightable) being the same, right?

So my (perhaps naive) understanding is if none can be found, then the author of chardet 1-6 simply doesn't have a case here, and we don't get to the point of asking "have you been exposed to the code?".

	▲	jacquesm 3 hours ago \| parent \| next [-]
		No, they're on the record as this being a derived work. There is no argument here at all. Not finding proof in a copyright case when the author is on the record about the infringement is a complete non-issue. You'd have to make that claim absent any proof and then there better not be any gross similarities between the two bodies of code that can not be explained away by coincidence. And then there is such a thing as discovery. I've been party to a case like this and won because of some silly little details (mostly: identical typos) and another that was just a couple of lines of identical JavaScript (with all of the variable names changed). Copyright cases against large entities are much harder to win because they have deeper pockets but against smaller parties that are clearly infringing it is much easier. When you're talking about documented protocols or interface specifications then it is a different thing, those have various exceptions and those vary from one jurisdiction to another. What can help bolster the case for the defense is for instance accurate record keeping, who contributed what parts, sworn depositions by those individuals that they have come up with these parts by their lonesome, a delivery pace matching that which you would expect from that particular employee without any suspicious outliers in terms of amount of code dropped per interval and so on. Code copied from online sources being properly annotated with a reference to the source also helps, because if you don't do that then it's going to look like you have no problem putting your own copyright on someone else's code. If it is real, then it is fairly easy to document that it is real. If it is not, after discovery has run its course it is usually fairly easy to prove that it is not if it is not.
	▲	swiftcoder 3 hours ago \| parent \| prev [-]
		> when the similarity is already an established fact The similarity is an established fact - the authors claim that this is chardet, to the extent that they are even using the chardet name! Had they written a similar tool with a different name, and placed it in its own repo, we might be having a very different discussion.

▲

pocksuppet an hour ago | parent | prev [-]

This is a balance of probabilities standard of proof. Both sides have the same burden of proof, it's equally split. Whoever has the stronger proof wins.

▲

swiftcoder 5 hours ago | parent | prev [-]

> if you've ever looked at chardet's source code

If you wish to be able to claim in court that it is a "clean room" implementation, yes.

Clean room implementations are specifically where a company firewalls the implementing team off from any knowledge of the original implementation, in order to be able to swear in court that their implementation does not make any use of the original code (which they are in such a case likely not licensed to use).

▲

zozbot234 7 hours ago | parent | prev | next [-]

This seems right to me. If you ask a LLM to derive a spec that has no expressive element of the original code (a clean-room human team can carefully verify this), and then ask another instance of the LLM (with fresh context) to write out code from the spec, how is that different from a "clean room" rewrite? The agent that writes the new code only ever sees the spec, and by assumption (the assumption that's made in all clean room rewrites) the spec is purely factual with all copyrightable expression having been distilled out. But the "deriving the spec (and verifying that it's as clean as possible)" is crucial and cannot be skipped!

▲

sigseg1v 7 hours ago | parent | next [-]

How would a team verify this for any current model? They would have to observe and control all training data. In practice, any currently available model that is good enough to perform this task likely fails the clean room criteria due to having a copy of the source code of the project it wants to rewrite. At that point it's basically an expensive lossy copy paste.

	▲	zozbot234 7 hours ago \| parent [-]
		You can always verify the output. Unless the problem being solved really is exceedingly specific and non-trivial, it's at least unlikely that the AI will rip off recognizable expression from the original work. The work may be part of the training but so are many millions of completely unrelated works, so any "family resemblance" would have to be there for very specific reasons about what's being implemented.

▲

oytis 7 hours ago | parent | prev | next [-]

It requires the original project to not be in the training data for the model for it to be a clean room rewrite

▲

zozbot234 7 hours ago | parent [-]

That only matters if expression of the original project really does end up in the rewrite, doesn't it? This can be checked for (by the team with access to the code) and it's also quite unlikely at least. It's not trivial at all to have an LLM replicate their training verbatim: even when feasible (the Harry Potter case, a work that's going to be massively overweighted in training due to its popularity) it takes very specific prompting and hinting.

▲

oytis 7 hours ago | parent | next [-]

> That only matters if expression of the original project really does end up in the rewrite, doesn't it?

No, I don't think so. I hate comparing LLMs with humans, but for a human being familiar with the original code might disqualify them from writing a differently-licensed version.

Anyway, LLMs are not human, so as many courts confirmed, their output is not copyrightable at all, under any license.

▲

toyg 7 hours ago | parent [-]

Uh, this is just a curiosity, but do you have a reference for that last argument?

If true, it would mean most commercial code being developed today, since it's increasingly AI-generated, would actually be copyright-free. I don't think most Western courts would uphold that position.

▲

duskdozer 6 hours ago | parent [-]

https://news.ycombinator.com/item?id=47232289

	▲	pseudalopex 4 hours ago \| parent [-]
		The headline was misleading. The courts avoided to decide what Thaler could have copyrighted because he said he was not the author.

▲

vkou 6 hours ago | parent | prev [-]

> That only matters if expression of the original project really does end up in the rewrite, doesn't it?

If that were the case, nobody would bother with clean-room rewrites.

▲

nneonneo 6 hours ago | parent | prev [-]

Somewhat annoyingly, there's been research that suggests that models can pass information to each other via (effectively) steganographic techniques - specific but apparently harmless choices of tokens, wordings, and so on; see https://arxiv.org/abs/1712.02950 and https://alignment.anthropic.com/2025/subliminal-learning/ for some simple examples.

While it feels unlikely that a simple "write this spec from this code" + "write this code from this spec" loop would actually trigger this kind of hiding behaviour, an LLM trained to accurately reproduce code from such a loop definitely would be capable of hiding code details within the spec - and you can't reasonably prove that the frontier LLMs have not been trained to do so.

▲

duskdozer 6 hours ago | parent | prev | next [-]

Not if the codebase was included in training the implementer.

▲

fergie 7 hours ago | parent | prev | next [-]

Answer: probably not, as API-topography is also a part of copyright

Edit: this is wrong

▲

Tiberium 7 hours ago | parent | next [-]

Didn't the Google - Oracle case about Java APIs in Android https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_.... directly disprove this?

	▲	looperhacks 6 hours ago \| parent [-]
		In the end, the supreme court case decided that the re-implementation fell under fair use, it did not answer the copyright question.

▲

scosman 7 hours ago | parent | prev | next [-]

The courts decided that wasn’t true for IBM, Java and many other cases. API typography describes functionality, which isn’t copyrightable (IANAL).

▲

Keyframe 7 hours ago | parent | prev [-]

Wasn't Oracle vs Google about all of that?

▲

actionfromafar 7 hours ago | parent | prev [-]

Yeah I think, the Compaq / IBM precedent can only superficially apply. It would be like having two teams only meet in a room full of documentation - but both teams crammed the source code the day before. (That, the source code you are "reverse engineering" is in the training data.) It doesn't make sense.

Also, it's weird that it's okay apparently to use pirated materials to teach an LLM, but maybe not to disseminate what the LLM then tells you.