the more time passes the more i'm convinced that the solution is to—somehow—force everyone to have to go through something like common crawl

i don't want people's servers to be pegged at 100% because a stupid dfs scraper is exhaustively traversing their search facets, but i also want the web to remain scrapable by ordinary people, or rather go back to how readily scrapable it used to be before the invention of cloudflare

as a middle ground, perhaps we could agree on a new /.well-known/ path meant to contain links to timestamped data dumps?

▲

nostrademons 4 hours ago | parent | next [-]

That's sorta what MetaBrainz did - they offer their whole DB as a single tarball dump, much like what Wikipedia does. I downloaded it in the order of an hour; if I need a MusicBrainz lookup, I just do a local query.

For this strategy to work, people need to actually use the DB dumps instead of just defaulting to scraping. Unfortunately scraping is trivially easy, particularly now that AI code assistants can write a working scraper in ~5-10 minutes.

▲

8note 10 minutes ago | parent | next [-]

the obvious thing would be to take down their website and only have the DB dump.

if thats the useful thing, it doesnt need the wrapper

▲

tonyhart7 2 hours ago | parent | prev [-]

I mean this AI data scrapper would need to scan and fetch billions of website

why would they even care over 1 single website ??? You expect instiution to care out of billions website they must scrape daily

	▲	what an hour ago \| parent [-]
		This is probably the reason. It’s more effort to special case every site that offers dumps than to just unleash your generic scraper on it.

▲

tpmoney 4 hours ago | parent | prev | next [-]

I'll propose my pie in the sky plan here again. We should overhaul the copyright system completely in light of AI and make it mostly win-win for everyone. This is predicated on the idea that the NIST numbers set is sort of the "hello world" dataset for people wanting to learn machine vision and having that common data set is really handy. Numbers made up off the top of my head/subject to tuning but the basic idea is this:

1) Cut copyright to 15-20 years by default. You can have 1 extension of an additional 10-15 years if you submit your work to the "National Data Set" within say 2-3 years of the initial publication.

2) Content in the National set is well categorized and cleaned up. It's the cleanest data set anyone could want. The data set is used both to train some public models and also licensed out to people wanting to train their own models. Both the public models and the data sets are licensed for nominal fees.

3) People who use the public models or data sets as part of their AI system are granted immunity from copyright violation claims for content generated by these models, modulo some exceptions for knowing and intentional violations (e.g. generating the contents of a book into an epub). People who choose to scrape their own data are subject to the current state of the law with regards to both scraping and use (so you probably better be buying a lot of books).

4) The license fees generated from licensing the data and the models would be split into royalty payments to people whose works are in the dataset, and are still under copyright protection, proportional to the amount of data submitted and inversely proportional to the age of that data. There would be some absolute caps in place to prevent slamming the national data sets with junk data just to pump the numbers.

Everyone gets something out of this. AI folks get clean data, that they didn't have to burn a lot of resources scraping. Copyright holders get paid for their works used by AI and retain most of the protections they have today, just for a shorter time), the public gets usable AI tooling without everyone spending their own resources on building their own data sets, site owners and the like get reduced bot/scraping traffic. It's not perfect, and I'm sure the devil is in the details, but that's the nature of this sort of thing.

	▲	mschuster91 4 hours ago \| parent [-]
		> Cut copyright to 15-20 years by default. This alone will kill off all chances of that ever passing. Like, I fully agree with your proposal... but I don't think it's feasible. There are a lot of media IPs/franchises that are very, very old but still generate insane amounts of money to this day with active developments. Star Wars and Star Trek obviously, but also stuff like the MCU or Avatar is on its best way to two decades of runtime, Iron Man 1 was released in 2008, or Harry Potter which is almost 30 years old. That's dozens of billions of dollars in cumulative income, and most of that is owned by Disney. Look what it took to finally get even the earliest Disney movies to enter the public domain, and that was stuff from before World War 2 that was so bitterly fought over. In order to reform copyright... we first have to use anti-trust to break up the large media conglomerates. And it's not just Disney either. Warner, Sony, Comcast and Paramount also hold ridiculous amounts of IP, Amazon entered the fray as well with acquiring MGM (mostly famous for James Bond), and Lionsgate holds the rights for a bunch of smaller but still well-known IPs (Twilight, Hunger Games). And that's just the movie stuff. Music is just as bad, although at least there thanks to radio stations being a thing, there are licensing agreements and established traditions for remixes, covers, tribute bands and other forms of IP re-use by third parties.

▲

Imustaskforhelp 4 hours ago | parent | prev | next [-]

If someone wants to scrape. I mean not levels of complete internet similar to how google does but at a niche level (like you got a forum you wish to scrape)

I like to create tampermonkey scripts regarding these. They are like more lightweight/easier way to build extensions mostly imo

Now I don't like AI but I don't know anything about scraping so I used AI to generate the scraping code and paste it in tampermonkey and let it run

I recently used this for where I effectively scraped a website which had list of vps servers and their prices and I built myself a list of that to analyze as an example

Also I have to say this that I usually try to look out for databases so much so that on a similar website like this related to something, I contacted them about db but no response, their db of server prices were private and only showed lowest

So I picked the other website and did this. I also scraped all headlines of lowendtalk ever with their links for semi purposes of archival and semi purposes of scraping the headlines and parsing it to LLM to find a list of vps providers as well

▲

crazygringo 4 hours ago | parent | prev | next [-]

Seriously, I can't help but think this has to be part of the answer.

Just something like /llms.txt which contains a list of .txt or .txt.gz files or something?

Because the problem is that every site is going to have its own data dump format, often in complex XML or SQL or something.

LLM's don't need any of that metadata, and many sites might not want to provide it because e.g. Yelp doesn't want competitors scraping its list of restaurants.

But if it's intentionally limited to only paragraph-style text, and stripped entirely of URL's, ID's, addresses, phone numbers, etc. -- so e.g. a Yelp page would literally just be the cuisine category and reviews of each restaurant, no name, no city, no identifier or anything -- then it gives LLM's what they need much faster, the site doesn't need to be hammered, and it's not in a format for competitors to easily copy your content.

At most, maybe add markup for <item></item> to represent pages, products, restaurants, whatever the "main noun" is, and recursive <subitem></subitem> to represent e.g. reviews on a restaurant, comments on a review, comments one level deeper on a comment, etc. Maybe a couple more like <title> and <author>, but otherwise just pure text. As simple as possible.

The biggest problem is that a lot of sites will create a "dummy" llms.txt without most of the content because they don't care, so the scrapers will scrape anyways...

▲

themafia 4 hours ago | parent | prev | next [-]

It's not a technical problem you are facing.

It's a monetary one, specifically, large pools of sequestered wealth making extremely bad long term investments all in a single dubious technical area.

Any new phenomenon driven by this process will have the same deleterious results on the rest of computing. There is a market value in ruining your website that's too high for the fruit grabbers to ignore.

In time adaptations will arise. The apparently desired technical future is not inevitable.

▲

fartfeatures 4 hours ago | parent | prev | next [-]

Good idea and perhaps a standard that means we only have to grab deltas or some sort of etag based give me all the database dumps after the one I have (or if something has changed).

▲

nikanj 4 hours ago | parent | prev [-]

And then YC funds a startup who plans to leapfrog the competition by doing their own scrape instead using the standard data everyone else has

	▲	4 hours ago \| parent [-]
		[deleted]