Currently all AI companies argue that the content they use falls under fair use, and disregard all licenses. This means any future ones respecting these licenses needs to be whitelisted.

▲

diggan 3 days ago | parent [-]

How do you know that that bot is part of those AI companies? Maybe it's my personal bot you're blocking, should I also not have (indirectly) access to the content?

▲

simianparrot 3 days ago | parent | next [-]

No. Access to my content is a privilege I grant you. I decide how you get to access it, and via a bot that my setup confuses for an AI crawler belonging to an anti-human AI corporation is not a valid way to access it. Get off my virtual lawn.

▲

diggan 3 days ago | parent [-]

> No. Access to my content is a privilege I grant you.

Right, I thought the conversation was about public websites on the public internet, but I think you're talking about this in the context of a private website now? I understand keeping tighter controls if you're dealing with private content you want accessible via the internet for others but not the public.

▲

privatelypublic 3 days ago | parent | next [-]

All websites are private (excepting maybe government sites). In most places the internet infrastructure itself is private.

You're conflating a legal concept that applies to areas that are shared, government owned, paid for by taxes, and the government feels like people should be able to access them.

The web is closer to a shopping mall. You're on one persons property to access other people's stuff who pay to be there. They set their own rules. If you don't follow those rules you get kicked out, charged with trespassing, and possibly banned from the mall entire.

AI bots have been asked to leave. But, since they own the mall too, the store owners are more than a little screwed.

▲

diggan 2 days ago | parent [-]

> You're on one persons property to access other people's stuff who pay to be there.

I see it more like I'm knocking on people's doors (issuing GET requests with my web browser) and people open their door for me (the server responds with something) or not. If you don't wanna open the door, fine you do you, but if you do open the door, I'm gonna assume it was on purpose as I'm not trying to be malicious, I'm just a user with a browser.

> AI bots have been asked to leave. But, since they own the mall too, the store owners are more than a little screwed.

I don't understand what you mean with this, what is the mall here, are you're saying that people have websites hosted at OpenAI et al? I'm not sure how the "mall owner" and the people running the AI bots are the same owners.

▲

privatelypublic a day ago | parent [-]

First, the mall is the internet as a whole- you're going to have to pay to be there (entrance is free, getting there is not), then you use their property to get to private businesses that have leased space at the mall.

And finally: https://www.techspot.com/news/105769-meta-reportedly-plannin...

The internet runs on backhaul. A LOT of backhaul is now owned by FAANG. Along with that, most those companies can financially ruin any business simply by banning them from the platform. So, the companies use their backhaul fiber and peering agreements to crawl everybody else. And nobody says anything because of "The Implication" that if you sue under Computer fraud and abuse Act (among others) they'll just wholesale ban you.

A "door to door" analogy doesn't work because sidewalks are generally considered "Public." The best I can tweak that analogy is a gated neighborhood and everybody has "no soliciting" signs. (NB: at least in my area, soliciting when theres a no-soliciting sign is an actual crime, on top of being trespassing)

	▲	kiitos 17 hours ago \| parent [-]
		making an HTTP GET request to an IP and port over the public internet, and getting a response back, is an interaction defined in a technical context, which has its own definitions for concepts like public/private. stuff like licenses.txt or robots.txt exist in totally separate context, which has a totally separate set of definitions for concepts like public/private. can't really conflate context-specific concepts like public/private, over multiple and incompatible contexts like technical/legal the claim that "a lot of backhaul is now owned by FAANG" is obviously untrue at a basic technical level. the broader argument is cynical, unfalsifiable, and uninteresting.

▲

simianparrot 3 days ago | parent | prev | next [-]

You’re literally visiting a service paid for by me. It’s open to the public, but it’s my domain and my server and I get to say “no thank you” to your visit if you don’t behave. You have no innate right to access the content I share.

Blocking misbehaving IP addresses isn’t new, and is another version of the same principle.

▲

diggan 2 days ago | parent | next [-]

> but it’s my domain and my server and I get to say “no thank you” to your visit if you don’t behave [...] Blocking misbehaving IP addresses isn’t new

Absolutely, I agree that of course people are free to block whatever they want, misbehaving or not. Guess I'm just trying to figure out what sort of "collateral damage" people are OK with when putting up content on the public internet but want it to be selectively available.

> You have no innate right to access the content I share.

No, I guess that's true, I don't have any "rights" to do so. But I am gonna assume that if whatever you host is available without any authentication, protection or similar, you're fine with me viewing that. I'm not saying you should be fine with 1000s of requests per second, but since you made it public in the first place by sharing it, you kind of implicitly agreed for others to view it.

▲

kiitos 15 hours ago | parent | prev [-]

doing an HTTP GET to your server is my request to access some content your server serves. that's my right as a client. and it is your server's responsibility to determine whether or not to respond to my request. that's your server's right. said another way, "access" is the responsibility of the server, not the client.

	▲	simianparrot 4 hours ago \| parent [-]
		Technical pedantry aside, that's what I mean. And I choose to not respond to your request with my content if I don't think your client is acting in good faith -- ie. is a bot or crawler that disrespects robots.txt, for example.

▲

bayindirh 3 days ago | parent | prev [-]

This interpretation won't take you that far.

Crawling-prevention is not new. Many news outlets or biggish websites already was preventing access by non-human agents in various ways for a very long time.

Now, non-human agents are improved and started to leech everything they can find, so the methods are evolving, too.

News outlets are also public sites on the public internet.

Source-available code repositories are also on the public internet, but said agents crawl and use that code, too, backed by fair-use claims.

▲

bayindirh 3 days ago | parent | prev [-]

You can use a honest user string denoting that it's your bot. Some AI companies label their bots transparently, they show up on the logs I keep.

While I understand that you may need a personal bot to crawl or mirror a site, I can't guarantee that I'll grant you access.

I don't like to be that heavy-handed in the first place, but capitalism is making it harder to trust entities which you can't see and talk face to face.