Isn't the data that flows through Github so valuable that they (Microsoft) are happy to eat the cost?

I don't have a clear idea how that value can be captured, since it's going to be 90% AI generated code that anyone can scrape (public projects) or can't be used (private projects), so perhaps you're right.

▲

Athas 7 hours ago | parent | next [-]

> Isn't the data they capture so valuable that they (Microsoft) are happy to eat the cost?

Even if that is true, unless the value of the data corresponds to near-term revenue, then eventually the cost may simply not be possible to meet. Or for that matter, the capital to manage the increasing load may simply not exist - it does not matter how much valuable data you have, if the supply of hardware cannot keep up with your demand.

Also, I suspect that most of the "data" obtained by the incessant hammering on GitHub is not very valuable. Most business code is routine, and getting Copilot to help out with generating enormous amounts of it may not contribute much in return.

▲

petcat 7 hours ago | parent | prev | next [-]

> 90% AI generated code

And it isn't even clear yet if the AI generated code is even particularly valuable since it's legally ambiguous as to whether or not any human ownership can be attributed to it.

The USPTO has declined copyrightability for genai artwork, it's only a matter of time before the same question comes up about code.

▲

graemep 7 hours ago | parent [-]

Your claim is incorrect. Something purely AI generated may not be covered by copyright in the US. That would make it more valuable to MS as you can reuse it as you like.

However, works with significant human input are covered by copyright, and most code does have such input. Human review, and correction is very common. There is a lot of AI generated code out there, and there are no cases challenging the copyright on it.

You also need to look beyond US law. Software is a global business and most software businesses do not want to write software they can only sell in certain countries.

▲

sofixa 7 hours ago | parent [-]

> However, works with significant human input are covered by copyright, and most code does have such input. Human review, and correction is very common. There is a lot of AI generated code out there, and there are no cases challenging the copyright on it.

Legislation and court decisions still pending. There are numerous lawsuits about copyrigtability of output, and right of use of copyrighted work by LLMs, and both could have ramifications for code. I don't see how it's materially different to tell Claude Code to write you a function fetching an entry from a database, and telling ChatGPT to generate you a picture of a unicorn riding a bicycle. Both have the same level of input (desired end goal), both might go through review and updates (no, pink unicorn; no, cache the database connection).

Legal challenges over code copyright are relatively rare nowadays, so I wouldn't take lack of high profile lawsuits as proof of legality / copyrightability.

And yes, this will also depend on jurisdiction. Court decisions or laws can change that. Litigation over copyright infringement via training and reproduction is ongoing in multiple jurisdiction, and it wouldn't be shocking to me if at least some decide that it is indeed copyright infringement to pirate content to train LLMs that can reproduce it.

▲

xp84 6 hours ago | parent | next [-]

If I write a program of 1000 lines of code, with AI features turned off, then I turned the AI features on and use a completion to edit one function, can my program not be copyrighted? (I expect/hope you’ll say: “Of course it’s still eligible for copyright”)

How about if I write 100 lines myself, turn the AI features on, vibe code 100 lines, and repeat this for five cycles? Half the functions are AI coded and half the functions I wrote myself. How about if I just tell Claude to write the program?

And what if I tell Claude to write the program, and then spend six months tweaking most of the lines of code?

I struggle to see a specific and obvious point where a line should be drawn. It seems intuitive to me that if I spend at least a few days worth of effort on a code base (whether tweaking, correcting, or directing AI to do targeted refactors), that is meaningful human authorship even if it has thousands of lines of generated code.

I can, however, acknowledge the fairness that something which is simply one-shot output probably shouldn’t merit protection. But really, in any of these cases, it’s going to be pretty hard to prove after the fact exactly what the proportion of generated code to human authorship is, so idk how a court will really tell whether a repo with 20,000 LOC is one-shot or actually had a person spend a few weeks tweaking it.

	▲	elevation 4 hours ago \| parent [-]
		> And what if I tell Claude to write the program Why should this be any different than when telling/paying a human to write the program? You're free to enter an agreement assigning all rights to the employer or the worker, to license the work ir/revokably and/or non/transferably. There is no need to wait for a court decision to understand what the results will be.

▲

graemep 6 hours ago | parent | prev [-]

If that function is all you ask it to write as a one off, maybe. However, if that function is part of a larger system that is human designed it is very different. If you review and correct the code in the system it is very different.

Pages 27 and 28 of this are relevant to this: https://www.copyright.gov/ai/Copyright-and-Artificial-Intell...

▲

gpugreg 7 hours ago | parent | prev | next [-]

> I don't have a clear idea how that value can be captured, since it's going to be 90% AI generated code that anyone can scrape (public projects) or can't be used (private projects), so perhaps you're right.

The value is probably in knowing which AI-generated code ends up being pushed or discarded, which can't be derived from public projects. This information can then be used to finetune the next big model so it only generates the "good" code.

▲

graemep 7 hours ago | parent | prev | next [-]

Its easier for them to scrape than it is for anyone else. they also have a lot more meta data about the code which may be useful.

Do Github terms entirely prevent them from making use of data in private projects.

▲

desdenova 7 hours ago | parent | prev [-]

> or can't be used (private projects)

As if they cared about that