Remix.run Logo
hintymad 6 hours ago

Wouldn't this be worrisome? People used StackOverflow and generated new knowledge along the way. Without such medium for discussion, how can we feed models with up-to-date quality knowledge?

crazygringo 6 hours ago | parent | next [-]

Plenty of documentation, and plenty of code that the AI can read itself.

E.g. if a library has a bug that has a common workaround, it can learn that from open source code using the library that uses the workaround.

hintymad 5 hours ago | parent | next [-]

This and the the other thread that talks about RL and synthetic data seem to suggest that AI can figure out all the technical issues without humans looking into them. I'm not sure if that's true at all.

nitwit005 3 hours ago | parent | prev | next [-]

That assumes there is documentation or examples. A big reason Stack Overflow took off was people struggling with things like the Android API documentation.

Some of those discussions made people go figure out how to do it, and then post it as an answer. The knowledge didn't exist anywhere until they did.

crazygringo 15 minutes ago | parent | next [-]

When I talk about code it can learn from, I'm talking about GitHub etc.

Even if stuff isn't in the official documentation, eventually there are projects that use it.

And if the library in question is open-source, then the LLM's can just ingest and read that directly.

ToValueFunfetti 3 hours ago | parent | prev [-]

It might make sense for AI companies to throw agents at new technologies to trial-and-error their way to internal documentation which they then provide to their models. On the other hand, the people making tomorrow's APIs have LLMs too and that makes documentation ~free. Hallucinations could still bring you back to the first hand, though.

kajman 3 hours ago | parent | prev [-]

The only way I could see this being surfaced the same is if the code essentially had a SO answer written into the doc comment.

mcswell an hour ago | parent [-]

What documentation?

vanuatu 6 hours ago | parent | prev | next [-]

I don't think its much of an issue

- Rl envs + synthetic data + human annotated

- Usage data from codex/claude code/cursor

Most of the model abilities in coding come from post-training, not pretraining

torben-friis 6 hours ago | parent [-]

A better question is what's left for those who don't have access to that. We went from publicly available to vacuumed from private users

vanuatu 6 hours ago | parent [-]

Open source models

unfortunately all the incentives right now are for repos to be private

hungryhobbit an hour ago | parent [-]

Open source models are for rich people: only they can afford the hardware needed to run them.

Jyaif 6 hours ago | parent | prev | next [-]

We unironically need an StackOverflow for LLMs.

LLMs would post solutions to the issues that they've discovered after doing a lot of research.

Unfortunately the LLMs are concentrated into few providers (OpenAI, Anthropic, Google) so there's a chance they each end up doing their own private (and closed) StackOverflows. By leveraging their private StackOverflows, their LLMs will be able to short-circuit complex reasoning, saving tokens, time, and money.

nikole9696 3 hours ago | parent | next [-]

This actually reminds me of the MCP concept. Similar?

JadeNB 3 hours ago | parent | prev [-]

> LLMs would post solutions to the issues that they've discovered after doing a lot of research.

How do you envision the correctness of these solutions being judged? If by other LLMs, then we run into a problem of infinite descent. If by humans, then you'd need some way to motivate expert or semi-expert humans (so that their ratings are themselves correct) to participate in a massive project of evaluating the correctness of a constant stream of content from content-generators that never sleep.

Jyaif an hour ago | parent [-]

> How do you envision the correctness of these solutions being judged?

By LLMs. I think it's possible for agents to infer whether the user was satisfied or not, at least with my usage pattern. For example if I end the discussion it's a good sign. If I ask follow up question that look like workarounds, it's a bad sign :-)

You could also simply prompt the users whether they were satisfied with the answer they received, possibly incentivizing them with StackOverflow-style gamification.

stackghost 2 hours ago | parent | prev | next [-]

I'm sure the AI companies will continue to pirate textbooks and papers, like always.

jmyeet 3 hours ago | parent | prev | next [-]

Yeah, this is something I've been thinking about too. LLMs have basically profited from "stealing" (arguably) user-generated content from a time when there were no LLMs. In the LLM era there won't be a new Stack Overflow to train LLMs on going forward.

We're getting closer to Dead Internet Theory too where a lot of accounts, particularly on Twitter, are just LLMs. I imagine it's a huge problem on Reddit too. Just people farming karma or otherwise involved in influence campaigns or simply grifting to ad revenue.

So we're going to get to a point where the corpus we train LLMs on will itself just be filled with LLM slops. Self-reinforcing slop. Is that the future?

aucisson_masque 2 hours ago | parent | next [-]

It's been studied,LLM that feed on its own data regress and it becomes very bad after a few generations.

mattmanser 3 hours ago | parent | prev [-]

It's happening here too, I saw dang hint that they're not even responding to a lot of questions about it anymore because of the volume of the problem.

If you browse with showdead on you'll be seeing a lot more of what look like reasonable comments greyed out.

add-sub-mul-div 6 hours ago | parent | prev | next [-]

Careful, you can't point out that the AI emperor has no clothes or you'll get called a Luddite.

piker 6 hours ago | parent | prev | next [-]

Yes. Very.

nsxwolf 6 hours ago | parent | prev | next [-]

How do you convince people to not want an instant answer? Even if SO didn’t result in so many “What have you tried?” responses and immediate closures, most people would still prefer instant feedback.

akkad33 6 hours ago | parent | prev [-]

Pointing them to docs? Which is anyway what stack overflow answers did?

mlinhares 6 hours ago | parent | next [-]

I wrote multiple answers to questions that weren't just "point to docs". And even when it is pointing to docs you are providing the reasoning as to why it works one way or another.

izacus 6 hours ago | parent | prev [-]

What docs? Who writes docs now that AIs answer everything?

Fabricio20 6 hours ago | parent | next [-]

Ever since the AI stuff started rolling around on coding i've seen MORE documentation, theres a big incentive to properly document your API endpoints so LLMs can figure it out from specs, and even when not documented the llms can also just read the code and figure it out directly (for libraries and similar). And at least in my experience they tend to document or write it down for future sessions too!

ethagnawl 6 hours ago | parent | prev | next [-]

I know you're being facetious but there may well be docs. It's just that the same AI most likely wrote _them_, too.

Did anyone (person or competing LLM) bother to verify that they're correct, though? Who knows! Let the next generation of models worry about that.

Morromist 6 hours ago | parent | prev | next [-]

I've heard this is now most of some CS jobs now. Just writing documentation for AI.

vanuatu 6 hours ago | parent | prev [-]

on the contrary, theres more of an incentive for apis to have docs for agent discovery. the docs / interfaces themselves can be auto-gened (stainless / mintlify)