Remix.run Logo
TeMPOraL 4 hours ago

There isn't. There never was one, because vast majority of websites are actually selfish with respect to data, even when that's entirely pointless. You can see this even here, with how some people complain LLMs made them stop writing their blogs: turns out plenty of people say they write for others to read, but they care more about tracking and controlling the audience.

Anyway, all that means there was never a critical mass of sites large enough for a default bulk data dump discovery to become established. This means even the most well-intentioned scrappers cannot reliably determine if such mechanism exist, and have to scrap per-page anyway.

VonGallifrey 3 hours ago | parent [-]

> turns out plenty of people say they write for others to read

LLMs are not people. They don't write blogs so that a company can profit from their writing by training LLMs on it. They write for others to read their ideas.

TeMPOraL 3 hours ago | parent [-]

LLMs aren't making their owners money by just idling on datacenters worth of GPU. They're making money by being useful for users that pay for access. The knowledge and insights from writings that go into training data all end up being read by people directly, as well as inform even more useful output and work benefiting even more people.

philipwhiuk an hour ago | parent [-]

And rarely cite their sources, thus affording the author not so much a crumb of benefit in kind.