Yeah, the internet seems like a big poison pill. Training on the whole internet feels like citing the National Enquirer (or the Daily Mail?) for a school essay.

Having an archive of "curated" training data seems like it is going to be important. Otherwise you need "AS" (artificial skepticism) introduced into future models. ("But I read it on the internet!", ha ha.)

Or perhaps there are ways to bucket training data such that the model is aware of which data leans factual (quantifiable) and which data leans opinion (fuzzy, qualifiable?).

(I recently asked Claude about the existence of ball lightning, spontaneous human combustion. I got replies that ultimately did not leave me satisfied. It's probably just as well that I read this article though—I now have an even stronger degree of skepticism with regard to their replies—specifically, I suppose, with topics that are likely to be biased.)

(I'm not quite convinced from the article though that Google is "fighting back". In fact, this feels like another moment where a "player" could try to establish their LLM as more factual. Is that the row Grok is trying to hoe? Or is Grok just trying to be anti-woke?)

▲

dijksterhuis 6 hours ago | parent | next [-]

> Having an archive of "curated" training data seems like it is going to be important

the justification for not doing that is probably "prohibitively expensive given the amount of data involved". they'd need a bunch of human reviewers combing through massive troves of data. it's probably cheaper to "sort of fix" it after the fact.

> perhaps there's ways to bucket training data such that the model is aware of which data leans factual (quantifiable) and which data leans opinion (fuzzy, qualifiable)

as a lecturer once said to me about my idea for a masters dissertation project that would classify news sites based on right/left tendencies -- "that sounds dangerously political". especially given the current let's all shout at each other political climate.

aside: someone built this and it was a fully fledged company, which has always annoyed me.

▲

JKCalhoun 5 hours ago | parent [-]

"…they'd need a bunch of human reviewers combing through massive troves of data…"

Yeah, I concede that. It doesn't need to be done over night. Having a static repo of data though that you can work through over time (years)—removing some data, add pre-curated data to. In so many years you can have a pretty good "reference dataset".

	▲	gowld 5 hours ago \| parent [-]
		I think some of the thousands of people working on training LLMs have tried some of the low-hanging-fruit ideas we can brainstorm of the top of our head 5 years later.

▲

5 hours ago | parent | prev | next [-]

[deleted]

▲

ajross 6 hours ago | parent | prev [-]

> Training on the whole internet feels like citing the National Enquirer

It's not, though, because the refutations are in the training data too. This isn't actually the problem being described.

The weights in the LLM are fine. It's that the task the LLM is being asked to do is to search and summarize new content that isn't in its training data. And it does it too much like a naive reader and not enough like a cynical HN commenter.

But that's a problem with prompt writing, not training. It's also of a piece with most of the other complaints about current AI solutions, really: AI still lacks the "context" that an experienced human is going to apply, so it doesn't know when it's supposed to reason and when it's supposed to repeat.

If you were to ask it "Is this site correct or is it just spin?" it will probably get it right. But it doesn't know to ask itself that question if it's not in the prompt somewhere.

▲

JKCalhoun 5 hours ago | parent [-]

"…the LLM is being asked to do is to search and summarize new content that isn't in its training data…"

If it fails at that then it is a pretty significant problem. As you say earlier "the refutations are in the training data too", then the LLM should in fact be able to use "both sides" and land with a little better confidence when presented with new data.

(Hopefully your point regarding prompting issues is resolved then.)

	▲	ajross 4 hours ago \| parent [-]
		Well, yeah, "should be" and "does" are different and this is new technology and has bugs and misfeatures and different limitations than what came before, and the market will have a learning curve as we all adapt. I was just refuting your contention that this is somehow inherent in the idea of "training", and it's not.