And this is just the stuff you notice.

1) Verbatin copy is first-order plagiarism.

2a) Second-order plagiarism of written text would be replacing words with synonyms. Or taking a book paragraph by paragraph and for each one of them, rephrasing it in your own words. Yes, it might fool automated checkers but the structure would still be a copy of the original book. And most importantly, it would not contain any new information. No new positive-sum work was done. It would have no additional value.

Before LLMs almost nobody did this because the chance that it would help in a lawsuit vs the amount of work was not a good tradeoff. Now it is. But LLMs can do "better":

2b) A different kind of second-order plagiarism is using multiple sources and plagiarizing each of them only in part. Find multiple books on the same topic, take 1 chapter from each and order them in a coherent manner. Make it more granular. Find paragraphs or phrases which fit into the structure of your new book but are verbatim from other books. See how granular you can make it.

The trick here is that doing this by hand is more work than just writing your own book. So nobody did it and copyright law does not really address this well. But with LLMs, it can be automated. You can literally instruct an LLM to do this and it will do it cheaper than any human could. However, how LLMs work internally is yet different:

n) Higher-order plagiarism is taking multiple source books, identifying patterns, and then reproducing them in your "new" book.

If the patterns are sufficiently complex, nobody will ever be able to prove what specifically you did. What previously took creative human work now became a mechanical transformation of input data.

The point is this ability to detect and reproduce patterns is an impressive innovation but it's built on top of the work of hundreds of millions[0] of humans whose work was used without consent. The work done by those employed by the LLM companies is minuscule compared to that. Yet all of the reward goes to them.

Not to mention LLMs completely defear the purpose of (A)GPL. If you can take AGPL code and pass it through a sufficiently complex mechanical transformation that the output does the same thing but copyright no longer applies, then free software is dead. No more freedom to inspect and modify.

[0]: Github alone has 100 million users ( https://expandedramblings.com/index.php/github-statistics/ ) and we have reason to believe all of their data was used in training.

▲

jacquesm 3 days ago | parent | next [-]

If a human did 2a or 2b we would think that a larger infraction than (1) because it shows intent to obfuscate the origins.

As for your free software is dead argument: I think it is worse than that: it takes away the one payment that free software authors get: recognition. If a commercial entity can take the code, obfuscate it and pass it off as their own copyrighted work to then embrace and extend it then that is the worst possible outcome.

	▲	martin-t 3 days ago \| parent [-]
		> shows intent to obfuscate the origins Good point. Reminds me of how if you poison one person, you go to prison, but when a company poisons thousands, it gets a fine... sometimes. > it takes away the one payment that free software authors get: recognition I keep flip-flopping on this. I did most of my open source work not caring about recognition but about the principles of GPL and later AGPL. However, I came to realize it was a mistake - people don't judge you by the work you actually do but by the work you appear to do. I have zero respect for people who do something just for the approval of others but I am aware of the necessity of making sure people know your value. One thing is certain: credit/recognition affect all open source code, user rights (e.g. to inspect and modify) affect only the subset under (A)GPL. Both are bad in their own right.

▲

fc417fc802 2 days ago | parent | prev [-]

You make several good points, and I appreciate that they appear well thought out.

> What previously took creative human work now became a mechanical transformation of input data.

At which point I find myself wondering if there's actually a problem. If it was previously permitted due to the presence of creative input, why should automating that process change the legal status? What justifies treating human output differently?

> then free software is dead. No more freedom to inspect and modify.

It seems to me that depends on the ideological framing. Consider a (still entirely hypothetical) world where anyone can receive approximately any software they wish with little more than a Q&A session with an expert AI agent. Rather than free software being dead, such a scenario would appear to obviate the vast majority of needs that free software sets out to serve in the first place.

It seems a bit like worrying that free access to a comprehensive public transportation service would kill off a ride sharing service. It probably would, and the end result would also probably be a net benefit to humanity.

▲

jacquesm 2 days ago | parent | next [-]

> At which point I find myself wondering if there's actually a problem. If it was previously permitted due to the presence of creative input, why should automating that process change the legal status? What justifies treating human output differently?

▲

fc417fc802 2 days ago | parent [-]

Yes that's what the law currently says. I'm asking if it ought to say that in this specific scenario.

Previously there was no way for a machine to do large swaths of things that have now recently become possible. Thus a law predicated on the assumption that a machine can't do certain things might need to be revisited.

	▲	martin-t 14 hours ago \| parent [-]
		This is the first technology in human history which only works if you use an exorbitant amount of other people's work (without their consent, often without even their knowledge) to automate their work. There have been previous tech revolutions but they were based on independent innovation. > Thus a law predicated on the assumption that a machine can't do certain things might need to be revisited. Perhaps using copyright law for software and other engineering work might have been a mistake but it worked to a degree me and most devs were OK with. Sidenote: There is no _one_ copyright law. IANAL but reportedly, for example datasets are treated differently in the US vs EU, with greater protection for the work that went into creating a database in the EU. And of course, China does what is best for China at a given moment. There's 2 approaches: 1) Either we follow the current law. Its spirit and (again IANAL) probably the letter says that mechanical transformation preserves copyright. Therefore the LLMs and their output must be licensed under the same license as the training data (if all the training data use compatible licenses) or are illegal (if they mixed incompatible licenses). The consequence is that very roughly a) most proprietary code cannot be used for training, b) using only permissive code gives you a permissively licensed model and output and c) permissive and copyleft code can be combined, as long as the resulting model and output is copyleft. It still completely ignores attribution though but this is a compromise I would at least consider being OK with. (But if I don't get even credit for 10 years of my free work being used to build this innovation, then there should be a limit on how much the people building the training algorithms get out of it as well.) 2) We design a new law. Legality and morality are, sadly, different and separate concepts. Now, call me a naive sucker, but I think legality should try to approximate morality as closely as possible, only deviating due to the real-world limitations of provability. (E.g. some people deserve to die but the state shouldn't have the right to kill them because the chance of error is unacceptably high.) In practice, the law is determined by what the people writing it can get away with before the people forced to follow it revolt. I don't want a revolution, but I think for example a bloody revolution is preferable to slavery. Either way, there are established processes for handling both violations of laws and for writing new laws. This should not be decided by private for-profit corporations seeing whether they can get away with it scot-free or trembling that they might have to pay a fine which is a near-zero fraction of their revenue, with almost universally no repercussions for their owners.

▲

martin-t 2 days ago | parent | prev [-]

> What justifies treating human output differently?

Human time is inherently valuable, computer time is not.

One angle:

The real issue is how this is made possible. Imagine an AI being created by a lone genius or a team of really good programmers and researchers by sitting down and just writing the code. From today's POV, it would be almost unimaginably impressive but that is how most people envisioned AI being created a few decades ago (and maybe as far as 5 years ago). These people would obviously deserve all the credit for their invaluable work and all the income from people using their work. (At least until another team does the same, then it's competition as normal.)

But that's not how AI is being created. What the programmers and researchers really do it create a highly advanced lossy compression algorithm which then takes nearly all publicly available human knowledge (disregarding licenses/consent) and creates a model of it which can reproduce both the first-order data (duh) and the higher-order patterns in it (cool). Do they still deserve all the credit and all the income? What if there's 1k researchers and programmers working on the compression algorithm (= training algorithm) and 1B people whose work ("content") is compressed by it (= used to train it). I will freely admit that the work done to build the algorithm is higher skilled than most of the work done by the 1B people. Maybe even 10x or 100x more expensive. But if you multiply those numbers (1k * 100 vs 1B), you have to come to the conclusion that the 1B people deserve the vast majority of the credit and the vast majority of the income generated by the combined work. (And notice when another team creates a competing model based on the same data, the share by the 1B stays the same and the 1k have to compete for their fraction.)

Another angle:

If you read a book, learn something from it and then apply the knowledge to make money, you currently don't pay a share to the author of the book. But you paid a fixed price for the book, hopefully. We could design a system where books are available for free, we determine how much the book helped you make that money, and you pay a proportional share to the author. This is not as entirely crazy as it might sound. When you cause an injury to someone, a court will determine how much each party involved is liable and there are complex rules (e.g. https://en.wikipedia.org/wiki/Joint_and_several_liability) determining the subsequent exchange of money. We could in theory do the same for material you learn from (though the fractions would probably be smaller than 1%). We don't because it would be prohibitively time consuming, very invasive, and often unprovable unless you (accidentally) praise a specific blog post or say you learned a technique from a book. Instead, we use this thing called market capitalism where the author sets a price and people either buy the book or not (depending on whether they think it's worth it for them), some of them make no money as a result, some make a lot, and we (choose to) believe that in aggregate, the author is fairly compensated.

Even if your blog is available for anyone to read freely, you get compensated in alternative ways by people crediting you and/or by building an audience you can influence to a degree.

With LLMs, there is no way to get the companies training the models to credit you or build you an audience. And even if they pay for the books they use for training, I don't believe they pay enough. The price was determined before the possibility of LLM training was known to the author and the value produced by a sufficiently sophisticated AI, perhaps AGI (which they openly claim to want to create) is effectively unlimited. The only way to compensate authors fairly is to periodically evaluate how much revenue the model attracted and pay a dividend to the authors as long as that model continues to be used.

Best of all, unlike with humans, the inner workings of a computer model, even a very complex one, can be analyzed in their entirety. So it should be possible to track (fractional) attribution throughout the whole process. There's just no incentive for the companies to invest into the tooling.

---

> approximately any software they wish with little more than a Q&A session with an expert AI agent

Making software is not just about writing code, it's about making decisions. Not just understanding problem and designing a solution but also picking tradeoffs and preferences.

I don't think most people are gonna do this just like most people today don't go to a program's settings and tweak every slider/checkbox/dropdown to their liking. They will at most say they want something exactly like another program with a few changes. And then it's clearly based on that original program and all the work performed to find out the users' preferences/likes/dislikes/workflows which remain unchanged.

But even if they genuinely recreate everything, then if it's done by an LLM, it's still based on work of others as per the argument above.

---

> the end result would also probably be a net benefit to humanity.

Possibly. But in the case of software fully written by sufficiently advanced LLMs, that net benefit would be created only by using the work of a hundred million or possibly a billion of people for free and without (quite often against) their consent.

Forced work without compensation is normally called slavery. (The only difference is that our work has already been done and we're "only" forced to not be able to prevent LLM companies from using it despite using licenses which by their intent and by the logic above absolutely should.)

The real question is how to achieve this benefit without exploiting people.

And don't forget such a model will not be offered for free to everyone as a public good. Not even to those people whose data was used to train it. It will be offered as a paid service. And most of the revenue won't even go to the researchers and programmers who worked on the model directly and who made it possible. It will go to the people who contributed the least (often zero) technical work.

---

This comment (and its GP), which contains arguments I have not seen anywhere else, was written over an hour long train ride. I could have instead worked remotely to make more than enough money to pay for the train ride. Instead, I write this training data which will be compressed and some patterns from it reproduced, allowing people I will never know and who will never know me to make an amount of money I have no chance quantifying and get nothing from. Now, I have to work some other hour to pay for the train ride. Make of that what you will.

	▲	fc417fc802 2 days ago \| parent \| next [-]
		Human time is certainly valuable to a particular human. However, if I choose to spend time doing something that a machine can do people will not generally choose to compensate me more for it just because it was me doing it instead of a machine. I think it's worth remembering that IP law is generally viewed (at least legally) as existing for the net benefit of society as opposed to for ethical reasons. Certainly many authors feel like they have (or ought to have) some moral right to control their work but I don't believe that was ever the foundation of IP law. Nor do I think it should be! If we are to restrict people's actions (ex copying) then it should be for a clear and articulable net societal benefit. The value proposition of IP law is that it prevents degenerate behavior that would otherwise stifle innovation. My question is thus, how do these AI developments fit into that? So I completely agree that (for example) laundering a full work more or less verbatim through an AI should not be permissible. But when it comes to the higher order transformations and remixes that resemble genuine human work I'm no longer certain. I definitely don't think that "human exceptionalism" makes for a good basis either legally or ethically. Regarding FOSS licenses, I'm again asking how AI relates back to the original motivations. Why does FOSS exist in the first place? What is it trying to accomplish? A couple ideological motivations that come to mind are preventing someone building on top and then profiting, or ensuring user freedom and ability to tinker. Yes, the current crop of AI tools seem to pose an ideological issue. However! That's only because the current iteration can't truly innovate and also (as you note) the process still requires lots of painstaking human input. That's a far cry from the hypothetical that I previously posed.
	▲	jacquesm 2 days ago \| parent \| prev [-]
		One of your remarks regarding attribution and compensation goes back to 'Xanadu' by the way, if you are not familiar with it that might be worth reading up on (Ted Nelson). He obviously did this well before the current AI age but a lot of the ideas apply. A meta-comment: I absolutely love your attention to detail in this discussion and avoiding taking 'the easy way out' from some of the more hairy concept embedded. This is exactly the kind of interaction that I love HN for, and it is interesting how this thread seems to bring out the best in you at the same time that it seems to bring out the worst in others. Most likely they are responding as strongly as they do because they've bought into this matter to a degree that they are passing off works that they did not create as their own novel output, they got paid for it and they - like a religious person - are now so invested in this that it became their crutch and a part of their identity. If you have another train ride to make I'd love for you to pick apart that argument and to refute it.