Remix.run Logo
reconnecting 3 hours ago

LLM pre-training models risk being unable to be updated with data from after 2025, as much of it is corrupted with LLM-generated content. We might be locked into outdated knowledge, where only whitelisted sources decide what to include.

Taking into account the sometimes blind belief that 'LLMs know everything', the outcome could be very costly, especially for technologies and businesses unfortunate enough to emerge after 2025.

Pikamander2 40 minutes ago | parent | next [-]

But ChatGPT has been popular since early 2023, and even before it there was no shortage of low-quality content on the web.

If anything, this model being trained up to 2025 is a positive sign that the "circular LLM training" problem hasn't (yet) become unmanagable.

The year-long delay is probably just due to how long it takes to test/refine a cutting-edge model. It's surely possible to train one faster, but Google wouldn't want to release a new model unless it's going to top the usual benchmarks.

djeastm 12 minutes ago | parent [-]

Looking at token usage at places like OpenRouter as a proxy for overall production we're looking at exponential growth in AI-created content. Weekly token usage there has tripled just in the past 3 months.

neksn 2 hours ago | parent | prev [-]

Considering all models can use search engines, is this really relevant?

reconnecting an hour ago | parent [-]

Until they prefer not to search. Let me explain using the example of the open-source security framework (1) our team is working on.

If you ask Gemini what you should use to integrate fraud prevention or account takeover protection into your product, there will be no mention of our open-source project. Five years in development, 1.3k stars, over 140 pull requests — all this isn't enough to make it into the training data. From this perspective, any technology that emerges after 2024 is simply invisible to LLMs.

The answer is: without being in the training data, LLMs basically don't understand what they're searching for.

1. https://github.com/tirrenotechnologies/tirreno

ordersofmag 15 minutes ago | parent [-]

I just put the terribly generic query "what tools would you recommend to integrate fraud prevention or account takeover protection into my product" into both Claude (Sonnet) and Gemini (3.1 Pro) via the standard web interface and both took the first step of searching the web. That's consistent with my past experience -- the usual harnesses typically will search the web in cases where I might expect/want them to. Now whether you product has good web visibility or not in those searches and how the LLM's weigh the relative merits of open-source tools versus commercial offerings in deciding what to highlight in their responses is a different issue. As is the change in what constitutes effective SEO in an era where bots, rather then human eyes are the proximal important target. But I don't think the core issue with folks finding your products is the move away from user-driven search toward using models with out-of-date training cutoffs.

FWIW while neither model included your product in it's initial response, when I followed up with "what about open-source" both did another search and Claude's response included your tool....