| ▲ | drob518 2 hours ago |
| > It remains unclear whether continuing to throw vast quantities of silicon and ever-bigger corpuses at the current generation of models will lead to human-equivalent capabilities. Massive increases in training costs and parameter count seem to be yielding diminishing returns. Or maybe this effect is illusory. Mysteries! I’m not even sure whether this is possible. The current corpus used for training includes virtually all known material. If we make it illegal for these companies to use copyrighted content without remuneration, either the task gets very expensive, indeed, or the corpus shrinks. We can certainly make the models larger, with more and more parameters, subject only to silicon’s ability to give us more transistors for RAM density and GPU parallelism. But it honestly feels like, without another “Attention is All You Need” level breakthrough, we’re starting to see the end of the runway. |
|
| ▲ | munificent an hour ago | parent | next [-] |
| There is a whole giant essay I probably need to write at some point, but I can't help but see parallels between today and the Industrial Revolution. Prior to the industrial revolution, the natural world was nearly infinitely abundant. We simply weren't efficient enough to fully exploit it. That meant that it was fine for things like property and the commons to be poorly defined. If all of us can go hunting in the woods and yet there is still game to be found, then there's no compelling reason to define and litigate who "owns" those woods. But with the help of machines, a small number of people were able to completely deplete parts of the earth. We had to invent giant legal systems in order to determine who has the right to do that and who doesn't. We are truly in the Information Age now, and I suspect a similar thing will play out for the digital realm. We have copyright and intellecual property law already, of course, but those were designed presuming a human might try to profit from the intellectual labor of others. With AI, we're in the industrial era of the digital world. Now a single corporation can train an AI using someone's copyrighted work and in return profit off the knowledge over and over again at industrial scale. This completely unpends the tenuous balance between creators and consumers. Why would a writer put an article online if ChatGPT will slurp it up and regurgitate it back to users without anyone ever even finding the original article? Who will contribute to the digital common when rapacious AI companies are constantly harvesting it? Why would anyone plant seeds on someone else's farm? It really feels like we're in the soot-covered child-coal-miner Dickensian London era of the Information Revolution and shit is gonna get real rocky before our social and legal institutions catch up. |
| |
| ▲ | arjie 37 minutes ago | parent | next [-] | | If I'm being honest, I've never related to that notion of remuneration and credit being the primary reason to write something. I don't claim to be some great writer or anything, but I do have a blog I write quite often on (though I'm traveling in my wife's Taiwan now and haven't updated it in a while). But for me, I write because it feels good to do so. Sometimes there's a group utility in things like I edit a Google Maps listing to be correct even though "a faceless corporation is going to hoover up my work and profit off it without paying me for my work" and I might pick up a Lime bike someone's dropped into the sidewalk even though "a faceless corporation is externalizing the work of organizing the proper storage of their property on public land without paying the workers" or so on. I just think it's nice to contribute to the human commons and it's fine if some subset of my fellow organism uses it in whatever way. Realistically, the fact that Brewster Kahle is paid whatever few hundred thousand he's paid for managing a non-profit that only exists because it aggregates other people's work isn't a problem for me. Or that Larry Page and Sergey Brin became ultra-rich around providing a search interface into other people's work. Or that Sam Altman and Dario Amodei did the same through a different interface. This particular notion doesn't seem to be a post-AI trend. It seems to have happened prior to the big GPTs coming out where people started doing a lot of this accounting for contribution stuff. One day it'll be interesting to read why it started happening because I don't recall it from the past. Perhaps I just wasn't super plugged in to the communities that were complaining about Red Hat, Inc. It's not that I don't understand if I sold my Subaru to a guy who immediately managed to sell it to another guy for a million times the money. I get that. I'd feel cheated. But if I contributed a little to it, like I did so Google would have a site to list for certain keywords so that they could show ads next to it in their search results, I just find it so hard to be like "That's my money you're using. Pay me!". | | |
| ▲ | wat10000 29 minutes ago | parent [-] | | You do it as a hobby, that's fine. Some people do it for a living. And while they aren't owed a living doing that specific thing, it is going to be a big problem for them if they can't make money at it anymore. I'm sure plenty of people feel the same way about software. They make software as a hobby and don't care about remuneration or credit. Meanwhile I write software for my day job and losing the ability to make money from it would be devastating. | | |
| ▲ | arjie 12 minutes ago | parent [-] | | Ah, I see. It’s just straightforward protectionism like dockworkers opposing automation and so on. That I do comprehend, in fact. I write software too and I may no longer be able to just do it in the old way. Pretty scary world but also exciting. I can’t imagine trying to restrict LLM software writers on that basis but I can comprehend it as simply self-interest. Fair enough. |
|
| |
| ▲ | cjcole 18 minutes ago | parent | prev | next [-] | | "but I can't help but see parallels between today and the Industrial Revolution" You're not the only one. The current Pope Leo XIV explicitly named himself after the the previous Leo, Pope Leo XIII, who was pope during the Industrial Revolution (1878-1903) and issued the influential Encyclical Rerum novarum (Rights and Duties of Capital and Labor) in response to the upheaval. “Pope Leo XIII, with the historic Encyclical Rerum novarum, addressed the social question in the context of the first great industrial revolution,” Pope Leo recalled. “Today, the Church offers to all her treasure of social teaching in response to another industrial revolution and the developments of artificial intelligence.” A name, then, not only rooted in tradition, but one that looks firmly ahead to the challenges of a rapidly changing world and the perennial call to protect those most vulnerable within it.” https://www.vatican.va/content/leo-xiii/en/encyclicals/docum... https://www.vaticannews.va/en/pope/news/2025-05/pope-leo-xiv... | |
| ▲ | steveklabnik an hour ago | parent | prev | next [-] | | As you know, I deeply respect you. Not trying to argue here, just provide my own perspective: > Why would a writer put an article online if ChatGPT will slurp it up and regurgitate it back to users without anyone ever even finding the original article? I write things for two main reasons: I feel like I have to. I need to create things. On some level, I would write stuff down even if nobody reads it (and I do do that already, with private things.) But secondly, to get my ideas out there and try to change the world. To improve our collective understanding of things. A lot of people read things, it changes their life, and their life is better. They may not even remember where they read these things. They don't produce citations all of the time. That's totally fine, and normal. I don't see LLMs as being any different. If I write an article about making code better, and ChatGPT trains on it, and someone, somewhere, needs help, and ChatGPT helps them? Win, as far as I'm concerned. Even if I never know that it's happened. I already do not hear from every single person who reads my writing. I don't mean that thinks that everyone has to share my perspective. It's just my own. | | |
| ▲ | lelanthran an hour ago | parent | next [-] | | > I don't mean that thinks that everyone has to share my perspective. It's just my own. I think you are walking all around the word "consent" and trying very hard to avoid it altogether. Your perspective, because it refuses to include any sort of consent, is invalid. No perspective that refuses consent can be valid. | | |
| ▲ | steveklabnik 18 minutes ago | parent [-] | | Consent is absolutely important, but that does not mean that every single thing in the entire world requires explicit consent. You did not ask me for consent to use my words in your comment. That does not mean you're a bad person. Free use is an important part of intellectual property law. If it did not exist, the powerful could, for example, stifle public criticism by declaring that they do not consent to you using their words or likeness. The ability to do that is important for society. It is also just generally important for creating works inspired by others, which is virtually every work. There has to be lines for cases where requiring attribution is required, and cases where it is not. | | |
| ▲ | lelanthran 3 minutes ago | parent [-] | | > You did not ask me for consent to use my words in your comment. I am not representing your words as mine. I am not using your words to profit off. I am not making a gain by attributing your words to you. > There has to be lines for cases where requiring attribution is required, and cases where it is not. You are blurring the lines between "using a quote or likeness" and "giving credit to". I am skeptical that you don't know the difference between the two. Regardless, any "perspective" that disregards the need to acquire consent is invalid. Even if you are going to ignore it, you have to acknowledge that you don't feel you need any consent from the people you are taking from. This whole "silence is consent" attitude is baffling. |
|
| |
| ▲ | munificent an hour ago | parent | prev [-] | | Agreed, totally! I still write and put stuff online. But it definitely feels different now. It used to feel like I was tending a public garden filled with other people who might enjoy it. It still kind of feels like that, but there are a handful of giant combine machines grinding their way around the garden harvesting stuff and making billionaires richer at the same time. It's not enough to dissuade me from contributing to the public sphere, but the vibe is definitely different. Honestly, it reminds me a lot about the early days of Amazon. It's hard to remember how optimistic the world felt back then, but I remember a time when writing reviews felt like a public good because you were helping other people find good products. It was like we all wanted honest product information and Amazon provided a neutral venue for us to build it. Like Wikipedia for stuff. But as Amazon got bigger and bigger and the externalities more apparent, it felt less like we were helping each other and more like we were help Bezos buy yet another yacht or media empire. And as the reviews got more and more gamed by shady companies, they became less of a useful public good. The whole commons collapsed. I worry that the larger web and digital knowledge environment is going that way. I still intend to create and share my stuff with the world because that's who I want to be. But I'll always miss the early days of the web where it felt like a healthier environment to be that kind of person in. | | |
| ▲ | steveklabnik an hour ago | parent [-] | | I can totally see that, for sure. I was much more likely to write a review long ago, now I don't even bother. (For buying stuff online, at least.) Maybe I lost my innocence about this stuff a long time ago, and so it's not so much LLMs that broke it for me, but maybe... I dunno, the downfall of Web 2.0 and the death of RSS? I do think that the old internet, for some definition of "old," felt different. For sure. I'll have to chew on this. I certainly felt some shock on the IP questions when all of this came up. I'm from the "information wants to be free" sort of persuasion, and now that largely makes me feel kinda old. Also I'm not a fan of billionaires, obviously, but I think that given I've worked on open source and tools for so long, I kinda had to accept that stuff I make was going to be used towards ends I didn't approve of. Something about that is in here too, I think. (Also, I didn't say this in the first comment, but I'm gonna be thinking about the industrial revolution thing a lot, I think you're on to something there. Scale meaningfully changes things.) |
|
| |
| ▲ | konschubert 9 minutes ago | parent | prev | next [-] | | > Prior to the industrial revolution, the natural world was nearly infinitely abundant. The opposite is true. Central Europe was almost devoid of trees. Food was scarce as arable land bore little fruit without fertiliser. Society was Malthusian until the Industrial Revolution. | |
| ▲ | pocksuppet 9 minutes ago | parent | prev | next [-] | | Stuff gets put online when the reader isn't the customer. Someone is paying for a reader to be told certain things. So it's free at the point of reading. | |
| ▲ | drob518 an hour ago | parent | prev | next [-] | | A couple thoughts… Mostly, AIs don’t recite back various works. Yes, there a couple of high profile cases where people were able to get an AI to regurgitate pieces of New York Times articles and Harry Potter books, but mostly not. Mostly, it is as if the AI is your friend who read a book and gives you a paraphrase, possibly using a couple sentences verbatim. In other words, it probably falls under a fair use rule. Secondly, given the modern world, content that doesn’t appear online isn’t consumed much, so creators who are doing it for the money will certainly continue putting content online. Much of that content will be generated by AIs, however. | | |
| ▲ | triceratops an hour ago | parent [-] | | You're missing the point. This is the crux of munificent's argument IMO (and I've made variations of it as well) > We have copyright and intellecual property law already, of course, but those were designed presuming a human might try to profit from the intellectual labor of others. You getting a summary of a copyrighted work from a friend is necessarily limited by the number of friends you have, the amount of time they have to read stuff and talk to you, and so on. Machines (and AIs) don't have any limitations. | | |
| ▲ | drob518 43 minutes ago | parent [-] | | Yes, true. But does that really shift the argument much? An AI is like the most well-read book nerd you’ve ever met. The AI has read everything. They still won’t recite Harry Potter for you at full length and reading what the original author wrote is part of the pleasure. | | |
| ▲ | nrabulinski 25 minutes ago | parent [-] | | Does a literal book nerd profit megacorporations when they bring up books to you? While burning through a household worth of energy in the process?
Also, I’d like to talk with such book nerd because they’d have opinions on books, potentially if I brought up something I have read we could exchange thoughts about it, they could make recommendations for me based on their complex experiences instead of statistics from Reddit comments. An LLM can do none of those, while also doing the former. It’s a lose-lose. Also, a book nerd doesn’t take roughly ~all human created text to train to produce meaningful results. It’s just such a misplaced analogy and people have been making it ever since OpenAI announced chatgpt for the first time - why do people think “an LLM is just a human who read a lot” |
|
|
| |
| ▲ | bluefirebrand an hour ago | parent | prev [-] | | > It really feels like we're in the soot-covered child-coal-miner Dickensian London era of the Information Revolution and shit is gonna get real rocky before our social and legal institutions catch up The really discouraging part of this is that it feels like our social and legal institutions don't even care if they catch up or not. Technology is speeding up and the lag time before anything is discussed from a legal standpoint is way, way too long |
|
|
| ▲ | xmprt 2 hours ago | parent | prev | next [-] |
| I see a lot of researchers working on newer ideas so I wouldn't be surprised if we get a breakthrough in 5-10 years. After all, the gap between AlexNet and Attention is All You Need was only 6 years. And then Scaling Laws was about 3-4 years after that. It might seem like not much progress is being made but I think that's in part because AI labs are extremely secretive now when ideas are worth billions (and in the right hands, potentially more). Of course 5-10 years is a long time to bang our heads against the wall with untenable costs but I don't know if we can solve our way out of that problem. |
| |
|
| ▲ | an hour ago | parent | prev | next [-] |
| [deleted] |
|
| ▲ | embedding-shape 2 hours ago | parent | prev | next [-] |
| > I’m not even sure whether this is possible. Based on what's happened so far, maybe. At least that's exactly how we got to the current iteration back in 2022/2023, quite literally "lets see what happens when we throw an enormous amount data at them while training" worked out up until one point, then post-training seems to have taken over where labs currently differ. |
| |
| ▲ | drob518 an hour ago | parent [-] | | Right, but we played the scaling card and it worked but is now reaching limits. What is the next card? You can surely argue that we can find a new one at any time. That’s the definition of a breakthrough. I just don’t see one at the moment. | | |
| ▲ | embedding-shape an hour ago | parent | next [-] | | > I just don’t see one at the moment. Did you see the one before the current one was even found? Things tend to look easy in hindsight, and borderline impossible trying to look forward. Otherwise it sounds like you're in the same spot as before :) | | |
| ▲ | drob518 32 minutes ago | parent [-] | | That’s what I’m said. Breakthroughs happen. No doubt about it, and they are unpredictable. Hence a breakthrough. But right now we’re using up runway with nothing yet identified to take us to the next level. And while sometimes breakthroughs happen, sometimes they don’t. |
| |
| ▲ | functional_dev 32 minutes ago | parent | prev [-] | | better tooling and integration |
|
|
|
| ▲ | htrp 2 hours ago | parent | prev | next [-] |
| We pay people to create more high quality tokens (mercor, turing) which are then fed into data generating processes (synthetic data) to create even more tokens to train on |
| |
| ▲ | drob518 an hour ago | parent [-] | | But does that really help, or do you get distortion? The frequency distribution of human generated content moves slowly over time as new subjects are discussed. What frequency distribution do those “data generating processes” use? And at root, aren’t those “data generating processes” basically just another LLM (I.e., generating tokens according to a probability distribution)? Thus, aren’t we just sort of feeding AI slop into the next training run and humoring ourselves by renaming the slop as “synthetic data?” Not trying to be argumentative. I’m far from being an AI expert, so maybe I’m missing it. Feel free to explain why I’m wrong. |
|
|
| ▲ | krainboltgreene an hour ago | parent | prev [-] |
| > The current corpus used for training includes virtually all known material. This is just totally incorrect. It's one of those things everyone just assumes, but there's an immense amount of known material that isn't even digitized, much less in the hands of tech companies. |
| |
| ▲ | drob518 an hour ago | parent [-] | | What large caches of undigitized content exists? Surely, not everything has been digitized, but I can’t think it’s much in percentage terms. | | |
| ▲ | cgh an hour ago | parent [-] | | The Vatican Library contains roughly 1.1 million printed books and around 75,000 codices, only a small percentage of which have been digitised. | | |
| ▲ | drob518 an hour ago | parent [-] | | Which is what percent of the world’s content? 0.000000001% or something similar. It’s nothing in the scheme of things. To put it another way, if we were to digitize that continent and train on it, our AIs would not get noticeably better in any way. It doesn’t move the needle. |
|
|
|