Cool project, but they really could have skipped the mention of clean room. Something trained on every copyrighted thing known to mankind is the opposite of clean room

▲

cheema33 2 hours ago | parent | next [-]

As others have pointed out, humans train on existing codebases as well. And then use that knowledge to build clean room implementations.

	▲	mxey 2 hours ago \| parent \| next [-]
		That’s the opposite of clean-room. The whole point of clean-room design is that you have your software written by people who have not looked into the competing, existing implementation, to prevent any claim of plagiarism. “Typically, a clean-room design is done by having someone examine the system to be reimplemented and having this person write a specification. This specification is then reviewed by a lawyer to ensure that no copyrighted material is included. The specification is then implemented by a team with no connection to the original examiners.”
	▲	kelnos an hour ago \| parent \| prev \| next [-]
		No they don't. One team meticulously documents and specs out what the original code does, and then a completely independent team, who has never seen the original source code, implements it. Otherwise it's not clean-room, it's plagiarism.
	▲	regularfry 2 hours ago \| parent \| prev \| next [-]
		What they don't do is read the product they're clean-rooming. That's kinda disqualifying. Impossible to know if the GCC source is in 4.6's training set but it would be kinda weird if it wasn't.
	▲	pizlonator 2 hours ago \| parent \| prev \| next [-]
		Not the same. I have read nowhere near as much code (or anything) as what Claude has to read to get to where it is. And I can write an optimizing compiler that isn't slower than GCC -O0
	▲	cermicelli 2 hours ago \| parent \| prev [-]
		If that's what clean room means to you, I do know AI can definitely replace you. As even ChatGPT is better than that. (prompt: what does a clean room implementation mean?) From ChatGPT without login BTW! > A clean room implementation is a way of building something (usually software) without copying or being influenced by the original implementation, so you avoid copyright or IP issues. > The core idea is separation. > Here’s how it usually works: > The basic setup > Two teams (or two roles): > Specification team (the “dirty room”) > Looks at the original product, code, or behavior > Documents what it does, not how it does it > Produces specs, interfaces, test cases, and behavior descriptions > Implementation team (the “clean room”) > Never sees the original code > Only reads the specs > Writes a brand-new implementation from scratch > Because the clean team never touches the original code, their work is considered independently created, even if the behavior matches. > Why people do this > Reverse-engineering legally > Avoid copyright infringement > Reimplement proprietary systems > Create open-source replacements > Build compatible software (file formats, APIs, protocols) I really am starting to think we have achieved AGI. > Average (G)Human Intelligence LMAO

▲

benjiro 2 hours ago | parent | prev [-]

Hot take:

If you try to reimplement something in a clean room, its a step by step process, using your own accumulated knowledge as the basis. That knowledge that you hold in your brain, all too often is code that may have copyrights on it, from the companies you worked on.

Is it any different for a LLM?

The fact that the LLM is trained on more data, does not change that when you work for a company, leave it, take that accumulated knowledge to a different company, you are by definition taking that knowledge (that may be copyrighted) and implementing it somewhere else. It only a issue if you copy the code directly, or do the implementation as a 1:1 copy. LLMs do not make 1:1 copies of the original.

At what point is trained on copyrighted data, any different then a human trained on copyrighted data, that get reimplemented in a transformative way. The big difference is that the LLM can hold more data over more fields, vs a human, true... But if we look at specializations, this can come back to the same, no?

	▲	42 minutes ago \| parent \| next [-]
		[deleted]
	▲	cermicelli 2 hours ago \| parent \| prev [-]
		If you have worked on a related copyrighted work you can't work on a clean room implementation. You will be sued. There are lots of people who have tried and found out. They weren't trillion dollar AI companies to bankroll the defense sure. But thinking about clean room and using copyrighted stuff is not even an argument that's just nonsense to try to prove something when no one asked.