| |
| ▲ | ben_w 3 days ago | parent | next [-] | | Necessarily, LLM output that works isn't gibberish. The code that LLM outputs, has worked well enough to learn from since the initial launch of ChatGPT. This even though back then you might have to repeatedly say "continue" because it would stop in the middle of writing a function. | | |
| ▲ | inferiorhuman 2 days ago | parent [-] | | Necessarily, LLM output that works isn't gibberish.
Hardly. Poorly conjured up code can still work. | | |
| ▲ | ben_w 2 days ago | parent [-] | | "Gibberish" code is necessary code which doesn't work. Even in the broader use of the term: https://en.wikipedia.org/wiki/Gibberish Especially in this context, if a mystery box solves a problem for me, I can look at the solution and learn something from that solution, c.f. how paper was inspired by watching wasps at work. Even the abject failures can be interesting, though I find them more helpful for forcing my writing to be easier to understand. |
|
| |
| ▲ | oblio 2 days ago | parent | prev [-] | | It's not gibberish. More than that, LLMs frequently write comments (some are fluff but some explain the reasoning quite well), variables are frequently named better than cdx, hgv, ti, stuff like that, plus looking at the reasoning while it's happening provides more clues. Also, it's actually fun watching LLMs debug. Since they're reasonably similar to devs while investigating, but they have a data bank the size of the internet so they can pull hints that sometimes surprise even experienced devs. I think hard earned knowledge coming from actual coding is still useful to stay sharp but it might turn out the balance is something like 25% handmade - 75% LLM made. | | |
| ▲ | inferiorhuman 2 days ago | parent [-] | | they have a data bank the size of the internet so they can
pull hints that sometimes surprise even experienced devs.
That's a polite way of phrasing "they've stolen a mountain of information and overwhelmed resources that humans would use to other find answers." I just discovered another victim: the Renesas forums. Cloudflare is blocking me from accessing the site completely, the only site I've ever had this happen to. But I'm glad you're able to have your fun. it might turn out the balance is something like 25% handmade - 75% LLM made.
Doubtful. As the arms race continues AI DDoS bots will have less and less recent "training" material. Not a day goes by that I don't discover another site employing anti-AI bot software. | | |
| ▲ | ben_w 2 days ago | parent | next [-] | | > they've stolen a mountain of information In law, training is not itself theft. Pirating books for any reason including training is still a copyright violation, but the judges ruled specifically that the training on data lawfully obtained was not itself an offence. Cloudfare has to block so many more bots now precisely because crawling the public, free-to-everyone, internet is legally not theft. (And indeed would struggle to be, given all search engines have for a long time been doing just that). > As the arms race continues AI DDoS bots will have less and less recent "training" material My experience as a human is that humans keep re-inventing the wheel, and if they instead re-read the solutions from even just 5 years earlier (or 10, or 15, or 20…) we'd have simpler code and tools that did all we wanted already. For example, "making a UI" peaked sometime between the late 90s and mid 2010s with WYSIWYG tools like Visual Basic (and the mac equivalent now known as Xojo) and Dreamweaver, and then in the final part of that a few good years where Interface Builder finally wasn't sucking on Xcode. And then everyone on the web went for React and Apple made SwiftUI with a preview mode that kept crashing. If LLMs had come before reactive UI, we'd have non-reactive alternatives that would probably suck less than all the weird things I keep seeing from reactive UIs. | | |
| ▲ | Anamon 2 days ago | parent [-] | | > Cloudfare has to block so many more bots now precisely because crawling the public, free-to-everyone, internet is legally not theft. That is simply not true. Freely available on the web doesn't mean it's in the Public Domain. The "lawfully obtained" part of your argument is patently untrue. You can legally obtain something, but that doesn't mean any use of it is automatically legal as well. Otherwise, the recent Spotify dump by Anna's Archive would be legal as well. It all depends on the license the thing is released under, chosen by the person who made it freely accessible on the web. This license is still very emphatically a legally binding document that restricts what someone can do with it. For instance, since the advent of LLM crawling, I've added the "No Derivatives" clause to the CC license of anything new I publish to the web. It's still freely accessible, can be shared on, etc., but it explicitly prohibits using it for training ML models. I even add an additional clause to that effect, should the legal interpretation of CC-ND ever change. In short, anyone training an LLM on my content is infringing my rights, period. | | |
| ▲ | ben_w 2 days ago | parent [-] | | > Freely available on the web doesn't mean it's in the Public Domain. Doesn't need to be. > The "lawfully obtained" part of your argument is patently untrue. You can legally obtain something, but that doesn't mean any use of it is automatically legal as well. I didn't say "any" use, I said this specific use. Here's the quote from the judge who decided this: 5. OVERALL ANALYSIS.
After the four factors and any others deemed relevant are “explored, [ ] the results [are] weighed together, in light of the purposes of copyright.” Campbell, 510 U.S. at 578. The copies used to train specific LLMs were justified as a fair use. Every factor but the nature of the copyrighted work favors this result. The technology at issue was among the most transformative many of us will see in our lifetimes.
- https://storage.courtlistener.com/recap/gov.uscourts.cand.43...> Otherwise, the recent Spotify dump by Anna's Archive would be legal as well. I specifically said copyright infringement was separate. Because, guess what, so did the judge the next paragraph but one from the quote I just gave you. > For instance, since the advent of LLM crawling, I've added the "No Derivatives" clause to the CC license of anything new I publish to the web. It's still freely accessible, can be shared on, etc., but it explicitly prohibits using it for training ML models. I even add an additional clause to that effect, should the legal interpretation of CC-ND ever change. In short, anyone training an LLM on my content is infringing my rights, period. It will be interesting to see if that holds up in future court cases. I wouldn't bank on it if I was you. |
|
| |
| ▲ | oblio 2 days ago | parent | prev [-] | | > That's a polite way of phrasing "they've stolen a mountain of information and overwhelmed resources that humans would use to other find answers." Yes, but I can't stop them, can you? > But I'm glad you're able to have your fun. Unfortunately I have to be practical. > Doubtful. As the arms race continues AI DDoS bots will have less and less recent "training" material. Not a day goes by that I don't discover another site employing anti-AI bot software. Almost all these BigCos are using their internal code bases as material for their own LLMs. They're also increasingly instructing their devs to code primarily using LLMs. The hope that they'll run out of relevant material is slim. Oh, and at this point it's less about the core/kernel/LLMs than it is about building ol' fashioned procedural tooling aka code around the LLM, so that it can just REPL like a human. Turns out a lot of regular coding and debugging is what a machine would do, READ-EVAL-PRINT. I have no idea how far they're going to go, but the current iteration of Claude Code can generate average or better code, which is an improvement in many places. | | |
| ▲ | inferiorhuman 2 days ago | parent [-] | | The hope that they'll run out of relevant material is slim.
If big corps are training their LLMs on their LLM written code… | | |
| ▲ | oblio 2 days ago | parent [-] | | You're almost there: > If big corps are training their LLMs on their LLM written code <<and human reviewed code>>… The last part is important. | | |
|
|
|
|
|