Remix.run Logo
dvduval 3 hours ago

The broader problem of original sources not being given credit in a way that rewards them remains. Websites owners are paying to host their content so that spiders can come and crawl them and index it into the AI and then if they’re lucky, they might get a citation, but otherwise there’s very little reward for being a provider of content. And of course, this is something that’s getting worse and worse. Why look at a website when it’s all in AI? And then the counter to that is maybe we need to start closing the website to crawlers and put everything behind a login.

Ensorceled 3 hours ago | parent | next [-]

Worse, the constant AI scraping is actually costing content providers additional money for no return. At least Google/Bing/Yahoo scraping would then be used to provide links back to your content.

devsda 3 minutes ago | parent | next [-]

How do you distinguish Google/MS scraping for Gemini/Copilot vs Google Search/Bing? In the case of Google, the UA is the same and you are entirely at their mercy to honor the Google-Extended instructions in robots.txt

Google has further complicated it with new search announcement blurring lines between regular search and AI search. And AI likes to not honor any licenses or instructions when it is hungry for training material.

It is once again an example of Google using its dominant position to abuse and promote cross functional products.

bolangi 2 hours ago | parent | prev | next [-]

Not only costing money. Constant AI scraping constitutes a denial-of-service attack that has brought down websites.

fiedzia 2 hours ago | parent | prev [-]

> At least Google/Bing/Yahoo scraping would then be used to provide links back

That doesn't work anymore. Google provides AI generated summary, nobody looks at the original site.

motbus3 3 hours ago | parent | prev | next [-]

About a year ago OpenAI crawled and go DDOS level the company I work. Even despite the robots.txt not allowing it, and despite some recaptcha we could assemble in time.

We found our data in the outputs of their models but who can do anything about it...

kibwen 2 hours ago | parent | next [-]

> We found our data in the outputs of their models but who can do anything about it...

If the crawlers refuse to voluntarily respect your robots.txt, then you are well within your rights to poison their data.

hajile 2 hours ago | parent [-]

robots.txt seems like it should be a legally-binding terms of service which would make them outright copyright infringing.

Sue for $180,000 per infringement which should be calculated for each illegal API call.

throw1234567891 an hour ago | parent [-]

Was your robots txt written by a lawyer? Does it hold up in the court?

shimman an hour ago | parent | prev | next [-]

Why hasn't your company sued OpenAI and try to argue they're violating the computer abuse and fraud act? Would it really be impossible to argue this?

Unauthorized access, system damage, and maybe even extortion all apply here.

rastrojero2000 44 minutes ago | parent | prev | next [-]

Lawyers can. As long as that data is actually yours I mean, in a strictly legal sense.

telotortium 2 hours ago | parent | prev [-]

I mean, did you check the IPs and make sure they’re from OpenAI? Obviously a fly-by-night AI company is going to set their User Agent to be from a big player.

b00ty4breakfast 39 minutes ago | parent | prev | next [-]

>Why look at a website when it's all in AI?

well, at least in the case of google, I'm pretty sure that's the point. Or at least, they are doing things that would seem to be moving towards being an oracle with all the answers and not the signpost that points you in the right direction. The destination rather than the gateway.

philipov 38 minutes ago | parent [-]

remember AMP?

spacechild1 2 hours ago | parent | prev | next [-]

It's actually costing them money/time! A friend of mine is a sysadmin at a university and he constantly has to deal with AI crawler DDoS-ing his servers. He said Anthropic is actually one of the worst offenders.

These AI companies are really just a gross example of the motto "Socialize the costs, privatise the profits". It's disgusting!

aaarrm 2 hours ago | parent | prev | next [-]

Is it possible able to host your website in a way so that it couldn't be found via search engines (and thus wouldn't be crawlable I hope)?

I know this has repercussions on findability, but if that wasn't a concern, I'm curious how one might circumvent getting crawled.

matt_heimer 2 hours ago | parent | next [-]

Sure, depends on how accessibly to people you want it to be.

Most legit search engines are going to honor robots.txt and you can disallow access.

Next level would be using something like rate limiting controls and/or Cloudflare's bot fight mode to start blocking the bad bots. You start to annoy some people here.

Next would be putting the content behind some form of auth.

elorant 2 hours ago | parent | prev | next [-]

Possible yes, probable not likely. The moment you're issued a certificate your domain will be shown in the Certificate Transparency logs which are constantly monitored from anyone who wants to find new sites.

trinari 2 hours ago | parent | prev | next [-]

robots.txt is a way of leaving the door unlocked but kindly asking bots to stay outside.

account42 2 hours ago | parent | next [-]

Which in a law-abiding society should be enough. It's also how we do things in the real world in many cases - i.e. here you can just write on your mailbox "no ads" and companies have to respect that.

Even when we do actually put physical locks on things they are mostly there to show that someone breaking in did so intentionally and not at all designed to prevent motivated attackers.

dpark an hour ago | parent [-]

> here you can just write on your mailbox "no ads" and companies have to respect that

Where do you live? In the US it’s actually illegal for anyone except the USPS to deliver to a mailbox.

dpark an hour ago | parent | prev [-]

You might be interested to know that entering an unlocked door into a space you do not have permission to be in is still illegal.

throw1234567891 an hour ago | parent [-]

You might be interested to know that the “illegality” depends on the intent. If I rest on your unlocked door handle, it opens, I enter, it’s an accident.

dpark 20 minutes ago | parent [-]

Sorry, what? In this scenario are you claiming that you accidentally fell inside the restricted area because you were leaning on the door? Or are you claiming that you accidentally opened the door and then walked through intentionally? In the former case, you are guilty of breaking and entering in most US jurisdictions if you don’t promptly get out. Any sane court would likely agree an accidental trespass is probably not a criminal act, but it’s not an accident if you stay. In the latter case, you’re clearly trespassing illegally.

Also this has gotten pretty far away from the web scraping scenario. There’s no door accidentally opening here.

dminik 7 minutes ago | parent [-]

Oops, I just accidentally fell into every website. Don't know how that happened ...

Imustaskforhelp 19 minutes ago | parent | prev | next [-]

If you really wanted and are interested in doing so and perhaps are even happy with just text and normal styling limitations, I recommend you to test out other protocols like creating a gemini website or gopher website. I don't think that scraping happens on even remotely the same scale there as compared to conventional websites

That being said you would require your user to download a compatible browser for gemini/gopher.

MontgomeryPy 2 hours ago | parent | prev [-]

You could just put your website content behind its own chat interface. The crawler would just see a form input for a prompt.

wolttam 2 hours ago | parent | prev | next [-]

I’ve been thinking of a proof-of-work scheme for accessing content where you effectively need to mine some crypto for the author, but, this idea might not fly today

dpark an hour ago | parent | next [-]

This is already a thing.

https://en.wikipedia.org/wiki/Anubis_(software)

wolttam 22 minutes ago | parent [-]

Yes, but:

> Although Anubis could be altered to mine cryptocurrency to serve as proof of work, Iaso has rejected this idea: "I don't want to touch cryptocurrency with a 20 foot pole."

Which in my mind is a shame. Crypto is an absolute mess, yes, but this seems like an elegant way to get something back for putting things out there.

vitally3643 7 minutes ago | parent | next [-]

Mining crypro doesn't materialize money. You have to exchange it for real money which means taking a private individual's money in exchange for scam tokens.

This is the problem crypto fans refuse to acknowledge. The money doesn't magically appear, you're taking it from someone else and letting them hold the bag when whatever cryptocurrency you choose inevitably blows up, fails, or rug-pulls. It's unethical to engage with at all because you're still participating in scamming real money out of private individuals

dpark 16 minutes ago | parent | prev [-]

The problem is that much of the cost is borne by humans accessing the sites. People generally get real mad when they find out you’re using their computers to mine crypto.

microtonal 2 hours ago | parent | prev | next [-]

But that will be a hassle for human visitors as well. A web doing proof-of-work to browse, will be a disaster for phones with their limited batteries, etc.

odo1242 2 hours ago | parent [-]

To be specific, it would be more of a hassle for human visitors than for the AI companies with infinite money and specialized browsers.

wolttam 21 minutes ago | parent [-]

The idea would be that AI companies would still be forced to do this proof of work. Anubis proved the idea

chii 2 hours ago | parent | prev [-]

or you know, just charge for your content if you believe it to be valuable enough for the fee being charged.

wolttam 19 minutes ago | parent [-]

Yes, but that tends to limit the reach of your content. Hence why a lot of people reach for ads.

Between seeing ads and doing a little bit of proof-of-work for the author, I'd choose the latter.

gabbagool an hour ago | parent | prev | next [-]

I agree with this whole heartedly. What's the point of even having copyright law at this point?

What's even crazier to think about is that to use the latest versions of these models for which you supplied training data, you have to pay hundreds of dollars a month. I would love to get a settlement check proportional to my model weights. Even if it's $0.10, at least everyone out there will get what they're owed.

rickydroll 43 minutes ago | parent | next [-]

From my perspective, everybody trains on the knowledge and experience of those who came before. AI just does the same thing at scale.

I do not value copyright. All it does is give you standing to sue if somebody reproduces your work. It does not differentiate or account for parallel creation. I cannot count how many times I have "created" something, only to find it in a research paper later.

Part of the reason I think copyright has no value is that, in general, individual copyright owners don't have the deep pockets necessary to sue someone who violates their copyright. If anyone is violating the spirit of copyright, it's corporations that insist you assign your work over to them as a work for hire, or outright ignore your copyright. (looking at you, Disney's Atlantis).

A significant benefit of AI that doesn't get talked about enough is that AI has a much greater reach over all the information it was trained on and can draw connections that would be invisible to someone operating at the human scale.

ofjcihen 38 minutes ago | parent [-]

The fact that these companies are making money off of it negates your argument.

throw1234567891 an hour ago | parent | prev [-]

No, you don’t have to. There are open weight models you can download and use for free. Many people choose the subscription model but it’s not necessary. And latest doesn’t mean greatest, it’s just most up-to-date.

WarmWash 2 hours ago | parent | prev | next [-]

[flagged]

omnimus 2 hours ago | parent | next [-]

Total sleight of hand.

Ad blocking has always been a problem for creators but it's aimed at big corps - non-creators. The creators asked people to support them other ways or turn off the blocking. And it's not like the little independent creators wanted this version of commercialized internet in the first place.

The ai marketing teams are spinning everything they can but no AI companies are the conscript, the vultures. No question about it.

WarmWash 2 hours ago | parent [-]

The conversion from viewer to donator is around 1%. This is true from wikipedia, to twitch, to podcasts.

The number of people who will not ever load your ads is around 30%.

I can tell you that creators talk about this a lot in private, but will not publicly because the internet has a mass delusion on how creation and compensation works. It's like trying to convince christians that jesus obviously didn't come back from the dead days later, depsite there being no logical system available that would explain it.

If we were to try and map out a functional internet where everyone wins, users and creators, there is no example where ad blocking is anything other net harmful. You either get volunteer net where 0.01% share hobby posts on their own dime for the other 99.9% or you get IRC where 99% of the population doesn't really benefit (ala 1993).

u_fucking_dork 2 hours ago | parent | prev | next [-]

People usually point at the scale when this discussion comes up, in my experience. These companies are doing something at a huge scale spending tons of money to do it so the potential harm is greater.

People can easily justify their own piracy because it’s small scale. Even when they organize, create a whole software and tooling ecosystem around pirating media to stick into jellyfin or plex. AI still did it bigger and worse and is bad, what I’m doing is not so bad because I wasn’t going to buy the movie anyway, etc.

WarmWash 2 hours ago | parent | next [-]

On the whole, about 35% of internet users are ad-blocking. In the tech space it's upwards of 70%.

It's in no way, shape, or form "small scale", and has fundamentally changed the the very nature of the internet for the worse (opinions/views of ad blocking people don't matter).

52-6F-62 2 hours ago | parent | prev [-]

Don't forget that the money being spent to do said scraping has, in great sums, come from subsidies paid by taxes from public coffers.

zetanor 2 hours ago | parent | prev | next [-]

I am in favor of severely limiting both copyright and advertising, but for the benefit of everyone, not just for the benefit of a few "AI" companies.

omnimus 2 hours ago | parent | next [-]

And you will not get it. As the AI pump money into lawyers and politicians - they will be the ones profiting from copyright. Total regulatory capture as US AI companies make it illegal to train AI on their output.

WarmWash an hour ago | parent | prev [-]

The answer is to simply pay for stuff.

There is no viable model where "have stuff but not pay for it" works out.

onedognight 2 hours ago | parent | prev | next [-]

Choosing not to look at something is not denying anyone anything.

WarmWash 2 hours ago | parent [-]

Choosing not to look at an ad, and blocking it are different things. One is totally ok, the other incurs a monetary loss on the creator. Those services aren't free to run, and the content doesn't take zero time to create. It also incentivizes creating content focused on those who cannot figure out ad blocking.

theamk 2 hours ago | parent | prev | next [-]

There is more to life than money.

Many of the websites I read do not collect any appreciable amount of money from ads, or have no ads at all (one example: news.ycombinator.com :) ). They want a recognition, or to share the knowledge, or community, or they are building their brand... And AI is destroying this all - the first result of "zx80" is an AI overview with a link to wikipedia and some youtube videos. If person stops there , they will never get to computinghistory.org.uk link, and won't see any related information about the variants and models.

WarmWash an hour ago | parent [-]

This website is an ad for Ycombinator. It's in no way, shape, or form a charity place for devs to hang out. It's a feeding ground to lure tech people into a mega VCs pastures.

When you click "news.ycombinator.com" you are clicking on the ad.

:)

mixmastamyk 2 hours ago | parent | prev | next [-]

Interesting. I suppose the main difference is that we’re ants compared to an 800 pound gorilla.

qotgalaxy 2 hours ago | parent | prev [-]

[dead]

internet2000 2 hours ago | parent | prev [-]

Perhaps we should go back to back when the internet was about sharing information you liked, not about credit or making money on "content".

throw1234567891 an hour ago | parent [-]

You are there today, but some are unhappy that others don’t share the same sentiment.