Remix.run Logo
kator 6 hours ago

> Yet, this shift made me re-evaluate the open source code publishing. Prior to that, I have been positive about free and open software, and considered this to be the default mode for work such as kefir. I did not require any justifications from myself to publish something. Now, however, I feel more and more that the main beneficiaries of my unpaid work are companies scraping the internet to train large language models. Currently accepted status quo in this area goes against my own intentions in licensing this work under GNU GPLv3. Publication has ceased to be the "null hypothesis" for me, and requires explicit mental justification which I am not able to provide.

I feel this pain, one of my small donation driven sites has been destroyed by crawlers who just ignore robots.txt and burn the site into the ground.

Sort of jokingly I proposed an update to the "spam fax" law:

https://www.karlbunch.com/random/website-protection-act/

account42 4 hours ago | parent | next [-]

This is essentially the digital world transforming from a high trust society into a low trust one. Sad to see.

oooyay an hour ago | parent | next [-]

Not even just digital; much of the world is shifting from high trust to low trust as well: https://social.desa.un.org/sites/default/files/inline-files/...

Gormo 2 hours ago | parent | prev [-]

To whom would you attribute the greater part of that reduction in trust: the people using FOSS to train LLMs, or the people trying to block them?

Xirdus an hour ago | parent | next [-]

People who break the social contract are the ones responsible for breaking the social contract, not the ones who take steps in response to social contract being broken.

Gormo an hour ago | parent | next [-]

So the questions here are (a) is any generally accepted social contract actually being broken, and (b) if so, who are the ones who are breaking it?

Xirdus 15 minutes ago | parent | next [-]

Are you asking how AI coding agents, the companies selling them and the individuals using them break the FOSS social contract (copyleft, attribution, upstreaming), or are you disputing that they do?

Gormo 9 minutes ago | parent [-]

Both would resolve to the same question, no?

There seems to be an implicit premise here that any work generated by an LLM whose training data includes a particular bit of code itself constitutes a redistribution of that code. I've yet to encounter any strong arguments substantiating this premise as a general principle, and my own suspicion is that it is not valid as a general principle, given the nature of how LLMs operate.

It's certainly possible that specific instances of LLMs lazily copy-pasting code from public repos may exist, and the extent to which this is happening is something that can be substantiated by empirical examples, so if you have any to point to, I'd be interested in looking at them. However, where this is happening, it ought to be regarded as a failure modality of LLMs, and not something that implicates the underlying nature of LLMs, given that their intended purpose is to function as stochastic generators that do not merely copy-paste input data.

My initial feeling here is that using open-source code to train LLMs is not per se a violation of the generally accepted FOSS social contract, but rather that attempting to restrict specific use cases of FOSS-licensed code on the basis of normative opinions unrelated to the license terms is a violation, or at least a rejection, of that social contract. I'm not fully committed to this position, though, and would welcome well-reasoned arguments to the contrary.

26 minutes ago | parent | prev [-]
[deleted]
dlev_pika an hour ago | parent | prev [-]

“No, no, what was she wearing?”

Xirdus 21 minutes ago | parent | next [-]

People who take steps in response to social contract being broken are the ones responsible for the steps they've taken, not the ones who break the social contract.

24 minutes ago | parent | prev [-]
[deleted]
hilariously an hour ago | parent | prev [-]

Its definitely the ones DDOSing websites while giving no attribution in any way to the original creators.

Gormo 26 minutes ago | parent [-]

DDOSing websites seems to be an unrelated problem, and one that has traditionally been solved through response throttling and IP blocking.

Attribution is often required even on MIT or BSD licenses where code is being redistributed, either in original or modified versions, but that would relate to this discussion only to the extent that one regards using LLMs whose training data included a certain bit of code as itself constituting redistribution of that specific code -- but that in turn is a very debatable premise which really ought to be argued for, and not merely argued upon as though it is already generally recognized as true.

hilariously 17 minutes ago | parent [-]

Why? You stole my stuff and now are pretending I need to argue for you to stop stealing it. It's a joke.

malwrar 5 hours ago | parent | prev | next [-]

Really hate to say it, but I’ve stopped publishing my work too for this reason. I spend most of my time now building my own little software ark, and I aspire to no longer think of programming in the next few years. I feel like the creative economy in general will be unrecognizable in the near future, maybe nonexistent. I wonder what modes of collaboration on ideas might form in the next few years.

irdc 4 hours ago | parent | next [-]

Here is what the purveyors of AI don't seem to realise. You can bend copyright law all you want in order to train your models on whatever you can grab, but in the absence of genuine protection of their creative work authors are simply not going to be publishing at all.

buran77 2 hours ago | parent | next [-]

I think they see it all too well. They still think they can make bank today while it lasts, whatever comes after is some other shareholder's problem. And if we're talking about open source, killing it might be a positive side effect, they'll be ready to sell you a closed source alternative when you no longer have options.

irdc 4 minutes ago | parent | next [-]

I don't think we're going back to closed source. I think we're going back to guilds. Aka. closed knowledge.

lesostep an hour ago | parent | prev [-]

Furthermore, if people not only stop publishing, but also take down already published works, it will create a moat around already existing Language Models

And the more they DDOS small websites — instead of respectfully scraping once — the more realistic my conspiracy theory looks.

egypturnash 2 hours ago | parent | prev | next [-]

People who are making stuff because they want to share it are still going to be publishing. And fighting to be noticed in an unending torrent of slop.

irdc 2 hours ago | parent [-]

Without any material or immaterial benefits? And with one's work being ground up and turned into weights for the next version of the machine that's threatening one's employment?

dzhiurgis 3 hours ago | parent | prev [-]

Great. More work for AI then.

kator 2 hours ago | parent | prev [-]

The sad thing is I feel trapped on all sides of the debate, I wrote a book about LLMs and human creativity (spoiler Humans win for a long time) but I was going to do it as a blog series, instead I published https://www.amazon.com/dp/B0GXCSY4W8 because I felt at least I might get a bit back for literally 100’s of hours of my life I poured into the book and my editor and friends who read and provided reviews.

And I push a lot of open source code including a ton for the SWGEmu project, but now I’m of mixed mind to stop pushing anything public. I can’t decide, am I talking out of both sides of my mouth, it’s a confusing time to navigate for sure.

jagged-chisel 5 hours ago | parent | prev [-]

> The sender pays, not the receiver.

You have a hole here. Your web server is sending the response and the bot is receiving.

Fix that and … profit? :-)

kator 2 hours ago | parent | next [-]

oh good point got that backwards… OMG my fax brain didn’t even think about it.

wizzwizz4 3 hours ago | parent | prev [-]

I'm trying to compose a better wording, but my attempts aren't working. The best I've got is:

> The initiator of the communication pays, not the server operator.