It's interesting to me that OpenAI considers scraping to be a form of abuse.

nikitaga 3 hours ago | parent | next [-]

Scraping static content from a website at near-zero marginal cost to its server, vs scraping an expensive LLM service provided for free, are different things.

The former relies on fairly controversial ideas about copyright and fair use to qualify as abuse, whereas the latter is direct financial damage – by your own direct competitors no less.

It's fun to poke at a seeming hypocrisy of the big bad, but the similarity in this case is quite superficial.

▲

PunchyHamster an hour ago | parent | next [-]

> Scraping static content from a website at near-zero marginal cost to its server, vs scraping an expensive LLM service provided for free, are different things.

I bet people being fucking DDOSed by AI bots disagree

Also the fucking ignorance assuming it's "static content" and not something needing code running

	▲	Den_VR 12 minutes ago \| parent [-]
		I miss the www where the .html was written in vim or notepad.

▲

mmcwilliams a few seconds ago | parent | prev | next [-]

If you assume that the only cost is CPU, maybe, but bandwidth is a completely different issue. T

▲

not2b 2 hours ago | parent | prev | next [-]

I understand why OpenAI is trying to reduce its costs, but it simply isn't true that AI crawlers aren't creating very significant load, especially those crawlers that ignore robots.txt and hide their identities. This is direct financial damage and it's particularly hard on nonprofit sites that have been around a long time.

▲

stingraycharles 18 minutes ago | parent [-]

These are ChatGPT and Claude Desktop crawlers we’re talking about? Or what is it exactly? Are these really creating significant load while not honoring robots.txt?

Genuinely interested.

	▲	cruffle_duffle 4 minutes ago \| parent [-]
		I bet dollars to doughnuts that 95% of the traffic is from Claude and ChatGPT desktop / mobile and not literal content scraping for training.

▲

heyethan 18 minutes ago | parent | prev | next [-]

I think this also explains why the checks are moving up the stack.

If the real cost is in actually running the app or the model, then just verifying a browser isn’t enough anymore. You need to verify that the expensive part actually happened.

Otherwise you’re basically protecting the cheapest layer while the expensive one is still exposed.

▲

sandeepkd 23 minutes ago | parent | prev | next [-]

Lets not try to qualify the wrongs by picking a metric and evaluating just one side of it. A static website owner could be running with a very small budget and the scraping from bots can bring down their business too. The chances of a static website owner burning through their own life savings are probably higher.

▲

alsetmusic 26 minutes ago | parent | prev | next [-]

Have you not seen the multiple posts that have reached the front page of HN with people taking self-hosted Git repos offline or having their personal blogs hammered to hell? Cause if you haven't, they definitely exist and get voted up by the community.

▲

nozzlegear 24 minutes ago | parent | prev | next [-]

Are they, actually?

▲

bakugo 3 hours ago | parent | prev | next [-]

The cost is so marginal that many, many websites have been forced to add cloudflare captchas or PoW checks before letting anyone access them, because the server would slow to a crawl from 1000 scrapers hitting it at once otherwise.

▲

razingeden 2 hours ago | parent | prev | next [-]

It is direct financial damage if my servers not on an unmetered connection — after years of bills coming in around $3/mo I got a surprise >$800 bill on a site nobody on earth appears to care about besides AI scrapers.

It hasn’t even been updated in years so hell if I know why it needs to be fetched constantly and aggressively, - but fuck every single one of these companies now whining about bots scraping and victimizing them, here’s my violin.

▲

swagmoney1606 an hour ago | parent | prev | next [-]

And yet I have to pay in my time and cash to handle the constant ddos'es from the constant LLM scraping

▲

AtlasBarfed 2 hours ago | parent | prev | next [-]

Because you say it is?

I obviously disagree. I mean, on top of this we are talking about not-open OpenAI.

▲

karlshea 2 hours ago | parent | prev | next [-]

I don’t know what world you live in but it’s not this one.

▲

nslsm 3 hours ago | parent | prev [-]

The issue is that there are so many awful webmasters that have websites that take hundreds of milliseconds to generate and are brought down by a couple requests a second.

	▲	bakugo 3 hours ago \| parent [-]
		OpenAI must be the most awful webmasters of all, then, to need such sophisticated protections.

▲

heyethan an hour ago | parent | prev | next [-]

I think the distinction is less about scraping itself, and more about marginal cost.

Scraping static pages is cheap for both sides. Scraping an LLM-backed service effectively externalizes compute costs onto the provider.

Same behavior, very different economics.

▲

ProofHouse 3 hours ago | parent | prev | next [-]

The irony is thick

▲

sabedevops 4 hours ago | parent | prev | next [-]

Seriously. The hypocrisy is staggering!

▲

Aurornis 3 hours ago | parent | prev | next [-]

I interpreted scraping to mean in the context of this:

> we want to keep free and logged-out access available for more users

I have no doubt that many people see the free ChatGPT access as a convenient target for browser automation to get their own free ChatGPT pseudo-API.

▲

zer00eyz 4 hours ago | parent | prev [-]

" Integrity at OpenAI .. protect ... abuse like bots, scraping, fraud "

Did you mean to use the word hypocrisy. If not, I'm happy to have said it.

I just want to note, that it is well covered how good the support is for actual malware...