Remix.run Logo
Freak_NL 2 hours ago

How do you think those models get trained? You can only get so far with Wikipedia, Reddit, and non-fiction works like books and academic papers.

tossandthrow 2 hours ago | parent | next [-]

Have a look at this article: https://www.washingtonpost.com/technology/interactive/2023/a...

NY Times is 0.06% of common crawl.

These news media outlets provide a drop in the ocean worth of information. Both qualitatively and quantitatively.

The news / media industry is really just trying to hold on to their lifeboat before inevitably becoming entirely irrelevant.

(I do find this sad, but it is like the reality - I can already now get considerably better journalism using LLMs than actual journalists - both click bait stuff and high quality stuff)

Gigachad a minute ago | parent | next [-]

90% of common crawl is complete junk. While the tiny bit of news articles powers almost all the ai answers in Google search.

pimlottc 2 hours ago | parent | prev [-]

That seems like a reductive way to consider it. What percent of music was created by Led Zeppelin? What percent of art was painted by Monet? What percent of films by Alfred Hitchcock? It may be a small percentage objectively but they are hugely influential.

tossandthrow an hour ago | parent [-]

I don't think back propagation care whose text it is back propagating.

NiloCK 10 minutes ago | parent [-]

The data sets aren't naively fed into the training runs.

Instead, training attempts to sample more heavily from higher quality sources, with, I'm sure, a mix of manual and heuristic labeling.

RugnirViking 2 hours ago | parent | prev [-]

How does the entire textual corpus of say, new York times compare to all novels? Each article is a page of text, maybe two at most? There certainly are an awful lot of articles. But it's hard to imagine it is much more than a couple hundred novels. There must be thousands of novels released each year

Freak_NL 2 hours ago | parent [-]

Like apples to oranges.

LLMs are (apparently) massively used to get information about topics in the real world. Novels aren't going to be much help there. Journalism, particularly in written form, provides a fount of facts presented from different angles, as well as opinions, and it was all there free for the taking…

Wikipedia provides the scantest summary of that, fora and social media give you banter, fake news, summaries of news, and a whole lot of shaky opinions, at best. Novels give you the foundations of language, but in terms of knowledge nothing much beyond what the novel is about.

olalonde 2 hours ago | parent [-]

LLMs can get up to date information from primary sources - no journalists required.

ajam1507 36 minutes ago | parent | next [-]

The primary source for most news is journalism.

NiloCK 5 minutes ago | parent [-]

In context, primary source means the subject of the article (the thing the journalist is writing about).

Journalism is by definition a secondary source. (Notwithstanding edge cases like articles reporting directly on the news industry itself.)

PopAlongKid 2 hours ago | parent | prev | next [-]

I don't understand how LLMs can ask questions at a press conference.

olalonde 31 minutes ago | parent [-]

Startup idea right there.

none2585 2 hours ago | parent | prev | next [-]

I don't think an LLM can have secret human sources that provide them with confidential information anonymously. Not all news shows up on Twitter.

freedomben 2 hours ago | parent | prev [-]

Primary sources can and often are, very biased. Journalists are (supposed to be) doing fact checks and gathering multiple sources from all sides. Modern journalism is in a terrible state, but still important.

Imagine if all info about Facebook came from Facebook...