I’m in love with the theme switcher. This is how a personal blog should be. Great content. Fun site to be on.

My issue is that crawlers aren’t respecting robots.txt, they are capable of operating captchas, human verification check boxes, and can extract all your content and information as a tree in a matter of minutes.

Throttling doesn’t help when you have to load a bunch of assets with your page. IP range blocking doesn’t work because they’re lambdas essentially. Their user-agent info looks like someone on Chrome trying to browse your site.

We can’t even render everything to a canvas to stop it.

The only remaining tactic is verification through authorization. Sad.

▲ heikkilevanto 2 days ago | parent | next [-]

I have been speculating on adding a tar pit on my personal web site. A script that produces a page of random nonsense and random looking links to the same script. The thing not linked to anywhere, but explicitly forbidden on robots.txt. If the crawlers start on it let them get lost. Bit of rate limiting should keep my server safe, and slow down the crawlers. Maybe I should add some confusing prompts on the page as well... Probably I never get around to it, but the idea sounds tempting.

▲

shakna 2 days ago | parent | next [-]

I have a single <a> element in my website's head, to a route banned by robots and the page is also marked by noindex meta tags and http headers.

When something grabs it, which AI crawlers regularly do, it feeds them the text of 1984, about a sentence per minute. Most crawlers stay on the line for about four hours.

▲

dbalatero 2 days ago | parent [-]

That's hilarious, can I steal the source for my own site?

▲

anileated 2 days ago | parent | next [-]

That’s what LLM would say…

▲

_moof 2 days ago | parent | prev [-]

Only if you aren't a crawler.

	▲	E39M5S62 2 days ago \| parent [-]
		This is a long shot, but are you the same moof that ran the bot 'regurg' on EFnet in the late 90's / early 2000's for the BeOS community?

▲

reactordev 2 days ago | parent | prev | next [-]

I did something similar. On a normal browser it just displays the matrix rain effect. For a bot, it's a page of links on links to pages that link to each other using a clever php script and .htaccess fun. The fun part is watching the logs to see how long they get stuck for. As each link is unique and can build a tree structure several GB deep on my server.

I did this once before with an ssh honey pot on my Mesos cluster in 2017.

▲

phyzome 2 days ago | parent | prev | next [-]

Should be possible to do this with a static site, even.

Here's what I've been doing so far: https://www.brainonfire.net/blog/2024/09/19/poisoning-ai-scr... (serving scrambled versions of my posts to LLM scrapers)

▲

gleenn 2 days ago | parent | prev | next [-]

Check out doing a compression bomb too, you can host a very small file for you that uncompresses into a massive file for crawlers and hopefully runs them out of ram and they die. Someone posted about it recently on HN even but I can't immediately find the link

	▲	extraduder_ire 2 days ago \| parent [-]
		It's either this one https://news.ycombinator.com/item?id=44670319 or the comments from this one https://news.ycombinator.com/item?id=44651536 I also recall reading it. I think wasting their time is more effective than making them crash and give up in this instance though.

▲

J_McQuade 2 days ago | parent | prev | next [-]

I loved reading about something similar that popped up on HN a wee while back: https://zadzmo.org/code/nepenthes/

	▲	fbunnies 2 days ago \| parent [-]
		I loved reading about something dissimilar that did not pop up on HN yet: https://apnews.com/article/rabbits-with-horns-virus-colorado...

▲

xyzal 2 days ago | parent | prev [-]

Or, serve "Emergent Misalignment" dataset.

https://github.com/emergent-misalignment/emergent-misalignme...

▲ Karawebnetwork 2 days ago | parent | prev | next [-]

Reminds me of CSS Zen Garden and its 221 themes: https://csszengarden.com/

e.g. https://csszengarden.com/221/ https://csszengarden.com/214/ https://csszengarden.com/123/

See all: https://csszengarden.com/pages/alldesigns/

▲

cxr 2 days ago | parent [-]

Only somewhat related and unfortunately misses the point.

CSS Zen Garden was powered by style sheets as they were designed to be used. Want to offer a different look? Write an alternative style sheet. This site doesn't do that. It compiles everything to a big CSS blob and then uses JS (which for some reason is also compiled to a blob, despite consisting of a grand total of 325 SLOC before being fed into bundler) to insert/remove stuff from the page and fiddle with a "data-theme" attribute on the html element.

Kind of a bummer since clicking through to the author's Mastodon profile shows a bunch of love for stuff like a talk about "Un-Sass'ing my CSS" and people advocating others "remove JS by pointing them to a modern CSS solution". (For comparison: Firefox's page style switcher and the DOM APIs it depends on[1] are older than Firefox itself. The spec[1] was made a recommendation in November 2000.)

1. <https://www.w3.org/TR/DOM-Level-2-HTML/html.html#ID-87355129>)

▲

reactordev 2 days ago | parent | next [-]

I fault her static site builder and not the author for that. It’s just how her bundler bundles.

▲

cxr 2 days ago | parent [-]

This makes as much sense as choosing not to fault the person carrying a dagger who buried it into your shoulder—because you just fault their dagger.

▲

reactordev 2 days ago | parent [-]

No, it’s makes as much sense as someone who wants to travel but doesn’t like the carbon footprint but has to because there’s no other way to Paris.

	▲	cxr a day ago \| parent [-]
		Even ignoring your hyperbolic (i.e. wrong) use of "has to", the options are not "use tooling that produces crummy bundles, or else don't go to Paris"; the Paris that we're talking about definitely has more than one way to get there.

▲

extraduder_ire 2 days ago | parent | prev [-]

I'm disappointed no browsers other than Firefox support it anymore.[0] Chrome dropped support in version 47.

It's very rare to see it used in the wild too, probably because it's not "sticky" across page loads.

0: https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/...

	▲	1718627440 2 days ago \| parent [-]
		Me too. I do use it. It is very useful when you redesign a site and want to compare them. Just switch between themes and you instantly see if something is pixel-perfectly at the same spot. I think it should be "sticky" the same way non-submitted form content stays persistent across page-reloads. This kind of features should be what browsers are judged and compared on.

▲ martin-t 3 days ago | parent | prev | next [-]

This shouldn't be enforced through technology but the law.

LLM and other "genAI" (really "generative machine statistics") algorithms just take other people's work, mix it so that any individual training input is unrecognizable and resell it back to them. If there is any benefit to society from LLM and other A"I" algorithms, then most of the work _by orders of magnitude_ was done by the people whose data is being stolen and trained on.

If you train on copyrighted data, the model and its output should be copyrighted under the same license. It's plagiarism and it should be copyright infringement.

▲

stahorn 2 days ago | parent | next [-]

It's like the world turned upside down in the last 20 years. I used to pirate everything as a teenager, and I found it silly that copy right would follow along no matter how anything was encoded. If I XORed copyright material A with open source material B, I would get a strange file C that together with B, I could use to get material A again. Why would it be illegal for me to send anybody B and C, where the strange file C might just as well be thought of as containing the open source material B?!

Now when I've grown up, starting paying for what I want, and seeing the need for some way of content creators to get payed for their work, these AI companies pop up. They encode content into a completely new way and then in some way we should just accept that it's fine this time.

This page was posed here on hacker news a few months ago, and it really shows that this is just what's going on:

https://theaiunderwriter.substack.com/p/an-image-of-an-arche...

Maybe another 10 years and we'll be in the spot when these things are considered illegal again?

▲

martin-t 2 days ago | parent | next [-]

I went through exactly this process.

Then I discovered (A)GPL and realized that the system makes sense to protect user rights.

And as I started making my own money, I started paying instead of pirating, though I sometimes wonder how much of my money goes to the actual artists and creators and how much goes to zero-sum occupations like marketing and management.

---

It comes down to understanding power differentials - we need laws so large numbers of individuals each with little power can defend themselves against a small number of individuals with large amounts of power.

(Well, we can defend ourselves anyway but it would be illegal and many would see it as an overreaction - as long as they steal only a little from each of us, we're each supposed to only be a little angry.)

---

> Maybe another 10 years and we'll be in the spot when these things are considered illegal again?

That's my hope too. But it requires many people to understand they're being stolen from and my fear is way too few produce "content"[0] and that the majority will feel like they benefit from being able to imitate us with little effort. There's also this angle that US needs to beat China (even though two nuclear superpowers both lose in an open conflict) and because China has been stealing everything for decades, we (the west) need to start stealing to keep up too.

[0]: https://eev.ee/blog/2025/07/03/the-rise-of-whatever/#:~:text...

▲

lawlessone 2 days ago | parent | prev [-]

just pirate again. It's the only way to ensure a game or movie can't be recalled by publishers the next time they want everyone to buy the sequel.

	▲	reactordev 2 days ago \| parent [-]
		Or traded to a different streaming service you aren’t subscribed to - ugh!

▲

thewebguyd 2 days ago | parent | prev | next [-]

> and resell it back to them.

This is the part I take issue with the most with this tech. Outside of open weight models (and even then, it's not fully open source - the training data is not available, we cannot reproduce the model ourselves), all the LLM companies are doing is stealing and selling our (humans, collectively) knowledge back to us. It's yet another large scale, massive transfer of wealth.

These aren't being made for the good of humanity, to be given freely, they are being made for profit, treating human knowledge and some raw material to be mined and resold at massive scale.

▲

martin-t 2 days ago | parent [-]

And that's just one part of it.

Part 2 is all the copyleft code powering the world. Now it can be effortlessly laundered. The freedom to inspect and modify? Gone.

Part 3 is what happens if actual AI is created. Rich people (who usually perform zero- or negative- sum work, if any) need the masses (who perform positive-sum work) for a technological civilization to actually function. So we have a log of bargaining power.

Then an ultra rich narcissistic billionaire comes along and wants to replace everyone with robots. We're still far off from that even if actual AI is achieved but the result is not that everyone can live a happy post-scarcity life with equality, blackjack and hookers. The result is that we all become beggars dependent on what those benevolent owners of AI and robots hand out to us because we will no longer have anything valuable to provide (besides our bodies I guess).

	▲	cowboylowrez a day ago \| parent [-]
		makes me happy to read that at least folks are thinking about this stuff. to me, this llm replacing humans stuff is ridiculous because we really do have a pretty good supply of humans, whereas do we really have a pretty good supply of resources that go into all of these human replacing ai's?

▲

jasonvorhe 2 days ago | parent | prev | next [-]

Which law? Which jurisdiction? From the same class of people who have been writing laws in their favor for a few centuries already? Pass. Let them consume it all. I'll rather choose the gwern approach and write stuff that's unlikely to get filtered out in upcoming models during training. Anubis treats me like a machine, just like Cloudflare but open source and erroneously in good spirit.

▲

riazrizvi 2 days ago | parent | prev | next [-]

Laws have to be enforceable. When a technology comes along that breaks enforceability, the law/society changes. See also prohibition vs expansion of homebrewing 20’s/30’s, censorship vs expansion of media production 60’s/70’s, encryption bans vs open source movement 90’s, music sampling markets vs music electronics 80’s/90’s…

	▲	throw10920 2 days ago \| parent \| next [-]
		> Laws have to be enforceable. This is a good point. In this case, it does seem pretty easy to enforce, though - just require anyone hosting an LLM for others to use to have full provenance of all of the data that they trained that LLM on. Wouldn't that solve the problem fairly easily? It's not like LLM training can be done in your garage (at which point this requirement would kill off hundreds/thousands of small LLM-training businesses that would hypothetically otherwise exist).
	▲	martin-t 2 days ago \| parent \| prev [-]
		In most of those cases, it was because too many people broke the laws, regardless of what companies did. It was too distributed. But to train a model, you need a huge amount of compute, centralized and owned by a large corporation. Cut the problem at the root.

▲

visarga 2 days ago | parent | prev [-]

> algorithms just take other people's work, mix it so that any individual training input is unrecognizable and resell it back to them

LLMs are huge and need special hardware to run. Cloud providers underprice even local hosting. Many providers offer free access.

But why are you not talking about what the LLM user brings? They bring a unique task or problem to solve. They guide the model and channel it towards the goal. In the end they take the risk of using anything from the LLM. Context is what they bring, and consequence sink.

▲

martin-t 2 days ago | parent | next [-]

Quantity matters.

Imagine it took 10^12 hours to produce the training data, 10^6 hours to produce the training algorithm and 10^0 hours to write a bunch of prompts to get the model to generate a useful output.

How should the reward be distributed among the people who performed the work?

▲

lawlessone 2 days ago | parent | prev [-]

>But why are you not talking about what the LLM user brings? They bring a unique task or problem to solve. They guide the model and channel it towards the goal. In the end they take the risk of using anything from the LLM.

I must remember next i'm shopping to demand the staff thank me when i ask them them where the eggs are.

▲

martin-t 2 days ago | parent [-]

I was gonna make an analogy of stealing someone's screwdriver set when I need to solve a unique problem but this is so much better.

	▲	lawlessone 2 days ago \| parent [-]
		that's good too.

▲ jasonvorhe 2 days ago | parent | prev | next [-]

These themes are really nice. Even work well on quirky displays. Stuff like this is what makes me enjoy the internet regardless of the way to the gutter.

▲ Scrounger 2 days ago | parent | prev | next [-]

> My issue is that crawlers aren’t respecting robots.txt

Cloudflare has a toggle switch to automatically block LLM's + scrapers etc:

https://blog.cloudflare.com/declaring-your-aindependence-blo...

▲ oooyay 2 days ago | parent | prev | next [-]

https://localghost.dev/about/

The theme also changes the background of her profile picture. The attention to detail is commendable.

▲

jacobyoder 2 days ago | parent | next [-]

Hovering over the netscape link renders it slowly, line by line, like images used to come down...

	▲	oooyay 2 days ago \| parent [-]
		hah, that's amazing

▲

clbn 2 days ago | parent | prev [-]

Not just the background, the Netscape one is a different photo!

▲ pas 3 days ago | parent | prev | next [-]

PoW might not work for long, but Anubis is very nice: https://anubis.techaro.lol/

That said ... putting part of your soul into machine format so you can put it on on the big shared machine using your personal machine and expecting that only other really truly quintessentially proper personal machines receive it and those soulless other machines don't ... is strange.

...

If people want a walled garden (and yeah, sure, I sometimes want one too) then let's do that! Since it must allow authors to set certain conditions, and require users to pay into the maintenance costs (to understand that they are not the product) it should be called OpenFreeBook just to match the current post-truth vibe.

▲

workethics 2 days ago | parent | next [-]

> That said ... putting part of your soul into machine format so you can put it on on the big shared machine using your personal machine and expecting that only other really truly quintessentially proper personal machines receive it and those soulless other machines don't ... is strange.

That's a mischaracterization of most people want. When I put out a bowl of candy for Halloween I'm fine with EVERYONE taking some candy. But these companies are the equivalent of the asshole that dumps the whole bowl into their bag.

▲

horsawlarway 2 days ago | parent | next [-]

I really don't think this holds.

It's vanishingly rare to end up in a spot where your site is getting enough LLM driven traffic for you to really notice (and I'm not talking out my ass - I host several sites from personal hardware running in my basement).

Bots are a thing. Bots have been a thing and will continue to be a thing.

They mostly aren't worth worrying about, and at least for now you can throw PoW in front of your site if you are suddenly getting enough traffic from them to care.

In the mean time...

Your bowl of candy is still there. Still full of your candy for real people to read.

That's the fun of digital goods... They aren't "exhaustible" like your candy bowl. No LLM is dumping your whole bowl (they can't). At most - they're just making the line to access it longer.

▲

shiomiru 2 days ago | parent | next [-]

> They mostly aren't worth worrying about

Well, a common pattern I've lately been seeing is:

* Website goes down/barely accessible

* Webmaster posts "sorry we're down, LLM scrapers are DoSing us"

* Website accessible again, but now you need JS-enabled whatever the god of the underworld is testing this week with to access it. (Alternatively, the operator decides it's not worth the trouble and the website shuts down.)

So I don't think your experience about LLM scrapers "not mattering" generalizes well.

▲

horsawlarway 2 days ago | parent [-]

Nah - it generalizes fine.

They're doing exactly what I said - adding PoW (anubis - as you point out - being one solution) to gate access.

That's hardly different than things like Captchas which were a big thing even before LLMs, and also required javascript. Frankly - I'd much rather have people put Anubis in front of the site than cloudflare, as an aside.

If the site really was static before, and no JS was needed - LLM scraping taking it down means it was incredibly misconfigured (an rpi can do thousands of reqs/s for static content, and caching is your friend).

---

Another great solution? Just ask users to login (no js needed). I'll stand pretty firmly behind "If you aren't willing to make an account - you don't actually care about the site".

My take is that search engines and sites generating revenue through ads are the most impacted. I just don't have all that much sympathy for either.

Functionally - I think trying to draw a distinction between accessing a site directly and using a tool like an LLM to access a site is a mistake. Like - this was literally the mission statement of the semantic web: "unleash the computer on your behalf to interact with other computers". It just turns out we got there by letting computers deal with unstructured data, instead of making all the data structured.

▲

krupan 2 days ago | parent | next [-]

"this was literally the mission statement of the semantic web" which most everyone either ignored or outright rejected, but thanks for forcing it on us anyway?

▲

horsawlarway 2 days ago | parent [-]

I guess if my options for getting a ramen recipe are

- Search for it and randomly click on SEO spam articles all over the place, riddled with ads, scrolling 10,000 lines down to see a generally pretty uninspired recipe

- Use an LLM and get a pretty uninspired recipe

I don't really see much difference.

And we were already well past the days where I got anything other than the first option using the web.

There was a brief window were intentionally searching specific sites like reddit/hn worked, but even that's been gone for a couple years now.

The best recipe is going to be the one you get from your friends/family/neighbors anyways.

And at least on the LLM side - I can run it locally and peg it to a version without ads.

	▲	w00ds 2 days ago \| parent [-]
		It's crazy how appealing the irl version you mentioned is, compared to the online version. Looking through a book, meeting people and sharing recipes, etc. The world you're interacting with actually cares about you. Feels like the net can't ever have that now.

▲

shiomiru 2 days ago | parent | prev | next [-]

> If the site really was static before, and no JS was needed

One does not imply the other. This forum is one example. (Or rather, hn.js is entirely optional.)

> Another great solution? Just ask users to login (no js needed). I'll stand pretty firmly behind "If you aren't willing to make an account - you don't actually care about the site".

Accounts don't make sense for all websites. Self-hosted git repositories are one common case where I now have to wait seconds for my phone to burn through enough sha256 to see a readme - but surely you don't want to gate that behind a login either...

> My take is that search engines and sites generating revenue through ads are the most impacted. I just don't have all that much sympathy for either.

...and hobbyist services. If we're sticking with Anubis as an example, consider the author's motivation for developing it:

> A majority of the AI scrapers are not well-behaved, and they will ignore your robots.txt, ignore your User-Agent blocks, and ignore your X-Robots-Tag headers. They will scrape your site until it falls over, and then they will scrape it some more. They will click every link on every link on every link viewing the same pages over and over and over and over. Some of them will even click on the same link multiple times in the same second. It's madness and unsustainable.

https://xeiaso.net/blog/2025/anubis/

> Functionally - I think trying to draw a distinction between accessing a site directly and using a tool like an LLM to access a site is a mistake.

This isn't "a tool" though, it's cloud hosted scrapers of vc-funded startups taking down small websites in their quest to develop their "tool".

It is possible to develop a scraper that doesn't do this, but these companies consciously chose to ignore the pre-existing standards for that. Which is why I think the candy analogy fits perfectly, in fact.

▲

account42 2 days ago | parent | prev [-]

> They're doing exactly what I said - adding PoW (anubis - as you point out - being one solution) to gate access.

Which is a shit solution where everyone suffers.

> Another great solution? Just ask users to login (no js needed). I'll stand pretty firmly behind "If you aren't willing to make an account - you don't actually care about the site".

No I won't create an account to check if a search result has what I'm looking for. Not will I sign up to a forum before I know what the culture is like. We already had this shit with communities moving to Discord, we don't need fuck up the remaining web as well.

▲

igloopan 2 days ago | parent | prev | next [-]

I think you're missing the context that is the article. The candy in this case is the people who may or may not go to read your e.g. ramen recipe. The real problem, as I see it, is that over time, as LLMs absorb the information covered by that recipe, fewer people will actually look at the search results since the AI summary tells them how to make a good-enough bowl of ramen. The amount of ramen enjoyers is zero-sum. Your recipe will, of course, stay up and accessible to real people but LLMs take away impressions that could have been yours. In regards to this metaphor, they take your candy and put it in their own bowl.

▲

horsawlarway 2 days ago | parent | next [-]

So what is the goal behind gathering those impressions?

Why do you take this as a problem?

And I'm not being glib here - those are genuine questions. If the goal is to share a good ramen recipe... are you not still achieving that?

▲

SamBam 2 days ago | parent [-]

The internet would not exist if it consisted of people just putting stuff out there, happy that it's released into the wilds of the overall consciousness, and nothing more. People are willing to put the time and effort into posting stuff for other reasons. Building community, gaining recognition, making money. Even on a website like HN we post under consistent usernames with the vague sense that these words are ours. If posts had no usernames, no one would comment on this site.

It's completely disingenuous to say that everyone who creates content -- blog authors, recipe creators, book writers, artists, etc -- should just be happy feeding the global consciousness because then everyone will get a tiny diluted iota of their unattributed wisdom.

▲

horsawlarway 2 days ago | parent [-]

How old are you?

I'm old enough I remember a vivid internet of exactly that.

Back when you couldn't make money from ads, and there was no online commerce.

Frankly - I think the world might be a much better place if we moved back in that direction a bit.

If you're only doing it for money or credit, maybe do something else instead?

> If posts had no usernames, no one would comment on this site.

I'd still comment. I don't actually give much of a shit about the username attached. I'm here to have a casual conversation and think about things. Not for some bullshit internet street cred.

	▲	SamBam 2 days ago \| parent \| next [-]
		I'm more than old enough to remember the birth of the internet. Back when I had a GeoCities website about aliens (seriously) it was still mine. I had a comments section and I hoped people would comment on it (no one did). I had a counter. I commented on other people's sites in the Area 51 subsection I was listed under. The aim wasn't just to put out my same-ol' unoriginal thoughts into the distributed global consciousness, it was to actually talk to other people. The fact that I wrote it under a dumb handle (a variant of the one I still use everywhere) didn't make me feel less like it was my own individual communication. It's the same for everything else, even the stuff that was completely unattributed. If you put a hilarious animation on YTMND, you know that other people will be referencing that specific one, and linking to it, and saying "did you see that funny thing on YTMND?" It wouldn't have been enough for the audience to just get some diluted, average version of that animation spread out into some global meme-generating AI. So no, "Google Zero" where no one sees the original content and is just "happy that their thoughts are getting out there, somehow" is not something that anyone should wish for.
	▲	reactordev 2 days ago \| parent \| prev [-]
		You can’t bring back Compuserve. You both are right however it’s the medium that determines one’s point of view on the matter. If I just want to spread my knowledge to the world - I would post on social media. If I want to curate a special viewership and own my own corner of the web - I would post on a blog. If I wanted to set a flag, setup a shop, and say I’m open for business - I would write an app. The internet is all of these things. We just keep being fed the latter.

▲

jasonvorhe 2 days ago | parent | prev | next [-]

That's also trained behavior due to SEO infested recipe sites filled with advertorials, referral links to expensive kitchen equipment, long form texts about the recipe with the recipe hidden somewhere below that.

Same goes for other stuff that can be easily propped up with lengthy text stuffed with just the right terms to spam search indexes with.

LLMs are just readability on speed, with the downsides of drugs.

▲

2 days ago | parent | prev [-]

[deleted]

▲

lelanthran 2 days ago | parent | prev [-]

> I really don't think this holds.

Only if you consider DoS as the only downside.

As with this analogy:

1. I put out a bowl of (infinite and cost-free) candy, with my name written on each piece so people know where they got the candy.

2. Some other resident, who doesn't have an infinite and cost-free source of candy like I do, comes along and grabs all the candy at periodic intervals.

3. They then scrub my name from all the candy wrappers and replace it with their name.

4. They put out all the candy, pretending it is their candy.

This analogy is much more accurate than either mischaracterisation in this thread:

1. I have no objection to the other resident using me as an unlimited source of candy.

2. I object only to them obfuscating their source of candy, instead misrepresenting the candy as their own!

Because, you see, no one cared when search engines directed candy-hunters to your door. No once cared when search engines presented the candy with your name still on it.

The whole issue, which is unaddressed by your post, is scrubbing the attribution, and then re-attributing the candy.

▲

lblume 2 days ago | parent | prev | next [-]

> these companies are the equivalent of the asshole that dumps the whole bowl into their bag

In most cases, they aren't? You can still access a website that is being crawled for the purpose of training LLMs. Sure, DOS exists, but seems to not be as much of a problem as to cause widespread outage of websites.

▲

rangerelf 2 days ago | parent [-]

A better analogy is that LLM crawlers are candy store workers going through the houses grabbing free candy and then selling it in their own shop.

Scalpers. Knowledge scalpers.

▲

horsawlarway 2 days ago | parent [-]

Except nothing is actually taken.

It's copied.

If your goal in publishing the site is to drive eyeballs to it for ad revenue... then you probably care.

If your goal in publishing the site is just to let people know a thing you found or learned... that goal is still getting accomplished.

For me... I'm not in it for the fame or money, I'm fine with it.

▲

allturtles 2 days ago | parent | next [-]

I think you're missing a middle ground, of people who want to let people know a thing they found or learned, and want to get credit for it.

Among other things, this motivation has been the basis for pretty much the entire scientific enterprise since it started:

> But that which will excite the greatest astonishment by far, and which indeed especially moved me to call the attention of all astronomers and philosophers, is this, namely, that I have discovered four planets, neither known nor observed by any one of the astronomers before my time, which have their orbits round a certain bright star, one of those previously known, like Venus and Mercury round the Sun, and are sometimes in front of it, sometimes behind it, though they never depart from it beyond certain limits. [0]

[0]: https://www.gutenberg.org/cache/epub/46036/pg46036-images.ht...

▲

bbarnett 2 days ago | parent | prev | next [-]

It's a very simple metric. They had nothing of value, no product, no marketable thing.

Then they scanned your site. They had to, along with others. And in scanning your site, they scanned the results of your work, effort, and cost.

Now they have a product.

I need to be clear here, if that site has no value, why do they want it?

Understand, these aren't private citizens. A private citizen might print out a recipe, who cares? They might even share that with friends. OK.

But if they take it, then package it, then make money? That is different.

In my country, copyright doesn't really punish a person. No one gets hit for copying movies even. It does punish someone, for example, copying and then reselling that work though.

This sort of thing should depend on who's doing it. Their motive.

When search engines were operating an index, nothing was lost. In fact, it was a mutually symbiotic relationship.

I guess what we should really ask, is why on Earth should anyone produce anything, if the end result is not one sees it?

And instead, they just read a summary from an AI?

No more website, no new data, means no new AI knowledge too.

▲

horsawlarway 2 days ago | parent | next [-]

I guess I don't derive my personal value from the esteem of others.

And I don't mean that as an insult, because I get that different people do things for different reasons, and we all get our dopamine hits in different ways.

I just think that if the only reason you choose to do something is because you think it's going to get attention on the internet... Then you probably shouldn't be doing that thing in the first place.

I produce things because I enjoy producing them. I share them with my friends and family (both in person and online). That's plenty. Historically... that's the norm.

> I guess what we should really ask, is why on Earth should anyone produce anything, if the end result is not one sees it?

This is a really rather disturbing view of the world. Do things for you. I make things because I see it. My family sees it. My friends see it.

I grow roses for me and my neighbors - not for some random internet credit.

I plant trees so my kids can sit under them - not for some random internet credit.

▲

bbarnett 2 days ago | parent | next [-]

Context. Note that we're having a discussion about people putting up websites, and being upset about AI snarfing that content.

> I guess what we should really ask, is why on Earth should anyone produce anything, if the end result is not one sees it?

> And instead, they just read a summary from an AI?

The above is referring to that context. To people wanting others to see things, and that after all is what this whole website's, this person's concerns are about.

So now that this is reiterated, in the context of someone wanting to show things to the world, why would they produce -- if their goal is lost?

This doesn't mean they don't do things privately for their friends and family. This isn't a binary, 0/1 solution. Just because you have a website for "all those other people" to see, doesn't mean you don't share things between your friends and family.

So what you seem to dislike, is that anyone does it at all. Because again, people writing for eyeballs at large, doesn't mean they aren't separately for their friends or family.

It seems to me that you're also creating a schism between "family / friends" and "all those other people". Naturally you care for those close to you, but "those other people" are people too.

And some people just see people as... people. People to share things with.

Yet you seem to be making that a nasty, dirty thing.

	▲	horsawlarway 2 days ago \| parent [-]
		And the content is still there for those people. The only folks who miss it are the ones who choose to use an llm instead of looking for something different. I guess my opinion is that you can't "make the horse drink". So instead focus on the groups that care enough to go find your content. Those people still exist. If the only joy you got was "the number of people who look at me!"... Then yes, that number is probably going to go down. But I also really do think that's a generally bad reason to be doing an activity. Again, personalities vary, and I won't deny people (pretty much all of us) crave that type of attention in some form or another. I just think, socially speaking, we're better off with less of that right now.

▲

Anamon a day ago | parent | prev [-]

You conflate doing something with sharing it online. A lot of people do things for themselves, then they post about it and share it because they like the idea of someone else enjoying and getting something out of it. The thing LLMs might get them to stop doing is not the doing of the thing, but the sharing, to the detriment of everyone who actually would have liked to see it.

And no, people sticking to the LLM summary won't get the ideas I shared. They get a crappy, broken, incoherent, messed-up, bland, averaged version of it. Purified of all the personality, insight and thought it might have had in it. That's why people getting an LLM summary partially derived from their data will never seem like a suitable replacement to someone who does it not for the views or credits, but because they actually want to share something of themselves.

I do agree that the solution would best come from the demand site. People realising the inherent blandness and horseshitness of LLM replies, especially when compared to something written by an actual human with thought and intent, ditch the low-quality LLM turds and demand real content again. The problem I see right now is that pretty much everyone would prefer the human version to the slop, but the megacorps force-feed the slop and spend billions trying to make it as inconvenient as possible to interact with other humans.

▲

shkkmo 2 days ago | parent | prev [-]

> But if they take it, then package it, then make money? That is different

But still, also legal.

You can't copyright a recipe itself, just the fluff around it. It is totally legal for somone to visit a bunch of recipe blogs, copy the recipes, rewrite the descriptions and detailed instructions and then publish that in a book.

The is essentially the same as what LLMs do. So prohibiting this would be a dramatic expansion of the power of copyright.

Personally, I don't use LLMs. I hope there will always be people like me that want to see the original source and verify any knowledge.

I'm actually hopeful that LLM reduction in search traffic will impact the profitability of SEO clickbait referral link garbage sites that now dominate results on many searches. We'll be left with enthusiasts producing content for the joy of nerding out again. Those sites will still have a following of actually interested people and the rest can consume the soulless summaries from the eventually ad infested LLMs.

▲

bbarnett 2 days ago | parent [-]

It may be legal in your jurisdiction, but I think this is a more generic conversation that the specific work class being copied. And further, my point is also that other parts of copyright law, at least where I live, view "for profit copying" and "some dude wanting to print out a webpage" entirely different.

I feel it makes sense.

Amusingly, I feel that an ironic twist would be a judgement that all currently trained LLMs, would be unusable for commercial use.

▲

shkkmo 2 days ago | parent [-]

> other parts of copyright law, at least where I live, view "for profit copying" and "some dude wanting to print out a webpage" entirely different.

I don't know what your jurisdiction is however through treaties, much of how USA copyright law works has been exported to many other countries so it is a reasonable place to base discussion.

In the USA commercial vs. non-commercial is not sufficent to determine if copying violates copyright law. It is one of several factors that is used to determine "fair use" and while it definitely helps, non-commerical use can easily infringe (torrents) and commercial use can be fine (telephone book white pages).

> a judgement that all currently trained LLMs, would be unusable for commercial use

I sure hope not. I don't like or use LLMs but I also don't like copyright law and I hate to see it receive such an expansion of power.

▲

bbarnett 2 days ago | parent [-]

> much of how USA copyright law works has been exported to many other countries

I'm not blaming you for bringing it up, however I did make it clear that I was speaking of a different jurisdiction. And yes, of course you're right, it's always a "big deal" when trade negotiations come up.

Canada has multiple different things in play to protect the individual. The non-profiting dude. Fair use is one, far expanded. Notice-and-notice is another, which currently means you have to pay to send an 'infringed' notice to people, as a copyright owner. Damages are also capped, at an amount that makes legal action untenable for most. And the bar of proof is significantly higher.

And that's for torrents.

For years we've had things like "you pay a tiny tax on hard drives", but then "that means you've already paid for anything you'll ever copy" and the tax goes into a fund to pay Canadian artists. While this may seem strange, it's one solution we've had to help keep art alive, but also not punish the average citizen with crazy law suits, and insane attacks from massive law firms.

Essentially, we don't let the US bully us into agreements which are massively harmful to our citizens.

But back to the LLM side. I see the current situation a weakening of copyright law, a massive one. And not for the average joe, but instead for the most commercial of entities.

I want copyright law, in some circumstances, to be weakened for people. Not companies. They get to pay artists. Creators. Developers.

And of course, there'd be no GPL without copyright law. So while I agree for individuals, especially in the US, copyright law is very annoying and a problem? Let's again focus on what I'm saying.

It currently isn't and doesn't have to be an absolutely

You can and we already have, as we've both discussed, different outcomes for copyright. EG both for fair use and breach outcomes, for corporations/for-profit and just some person. So let's stop talking about copyright stronger/weaker as a generic, and a specific.

I support weaker outcomes of breach, and enhanced fair use for people.

I support stronger outcomes of breach, and so forth for companies.

Further, I support sliding scales too. A one person youtuber isn't the same as a 10B company. A person playing parts of one song in their video for a few seconds, as a one person corp, isn't the same as an entity scanning all of humankind's knowledge and laughing in our faces.

Huge differences of scale and scope.

Look at it this way. Some of these companies have downloaded torrents. If a person did what they did, they'd receive billions in fines!!

Yet they're getting a lesser outcome, as in freaking nothing.

It's the wrong place for copyright weakening.

▲

shkkmo 2 days ago | parent [-]

> I see the current situation a weakening of copyright law, a massive one. And not for the average joe, but instead for the most commercial of entities.

You gonna have to explain this in more detail because it isn't clear to me how you justify this claim. What exactly is being weakened? In what way?

> Some of these companies have downloaded torrents. If a person did what they did, they'd receive billions in fines!!

The one I am assuming you are referring to is Meta, and they are getting sued. They arguably should also be facing criminal charges too under current law.

> Yet they're getting a lesser outcome, as in freaking nothing.

That court case hasn't finished and that doesn't have anything directly to do with LLMs but with our legal system and power/wealth imbalances.

> And of course, there'd be no GPL without copyright law.

I personally strongly prefer MIT to GPL. GPL sort of makes sense as a reaction to copyright law but I don't think GPL justifies the existence or state of copyright law.

> Further, I support sliding scales too.

What does that mean? Just the fines / judgements because along with having to pay, the activity itself must be stopped.

If copyright only prohibited larger entities from copying, it would be less onerous and would make copyright more tolerable, but I don't think that would solve the AI training issue in any way and seems like a tangent.

> an entity scanning all of humankind's knowledge and laughing in our faces.

Knowledge is not copyrightable. If you want to stop this, expanding the power of copyright to make learning/knowing something an infinging activity is one of the worst possible ways to go about it.

	▲	rangerelf a day ago \| parent [-]
		> The one I am assuming you are referring to is Meta, and they are getting sued. They arguably should also be facing criminal charges too under current law. I think your assumption is falling too short, it's not just Meta, it's OpenAI, it's Anthropic, it's Google, and Microsoft, and others. Like you said, the court case hasn't finished, but there's meddling from the Whitehouse already; I really doubt there's going to be any fair play in this case.

▲

lelanthran 2 days ago | parent | prev | next [-]

> If your goal in publishing the site is just to let people know a thing you found or learned... that goal is still getting accomplished.

I like how you posted so many times in this thread, with the assertion that that is the goal of people giving away stuff for free.

Your responses in this thread are almost textbook example of Strawman Argument; you could not do a better Strawman Argument even if you tried!

▲

CJefferson 2 days ago | parent | prev [-]

It's absolutely fine for you to be fine with it. What is nonsense is how copyright laws have been so strict, and suddenly AI companies can just ignore everyone's wishes.

	▲	horsawlarway 2 days ago \| parent [-]
		Hey - no argument here. I don't think the concept of copyright itself is fundamentally immoral... but it's pretty clearly a moral hazard, and the current implementation is both terrible at supporting independent artists, and a beat stick for already wealthy corporations and publishers to use to continue shitting on independent creators. So sure - I agree that watching the complete disregard for copyright is galling in its hypocrisy, but the problem is modern copyright, IMO. ...and maybe also capitalism in general and wealth inequality at large - but that's a broader, complicated, discussion.

▲

reactordev 2 days ago | parent | prev [-]

More like when the project kids show up in the millionaire neighborhood because they know they’ll get full size candy bars.

It’s not that there’s none for the others. It’s that there was this unspoken agreement, reinforced by the last 20 years, that website content is protected speech, protected intellectual property, and is copyrightable to its owner/author. Now, that trust and good faith is broken.

	▲	account42 2 days ago \| parent [-]
		A yes of course, the poor poor AI companies getting scraps from the greedy independent website operators.

▲

pyrale 3 days ago | parent | prev | next [-]

I’m not sure that the issue is just a technical distinction between humans and bots.

Rather it’s about promoting a web serving human-human interactions, rather than one that exists only to be harvested, and where humans mostly speak to bots.

It is also about not wanting a future where the bot owners get extreme influence and power. Especially the ones with mid-century middle-europe political opinions.

▲

reactordev 2 days ago | parent | prev [-]

Security through obscurity is no security at all…

▲ ryao 2 days ago | parent | prev | next [-]

If you want a good example of a site with a theme switcher:

https://www.csszengarden.com/pages/alldesigns/

▲ Halian 2 days ago | parent | prev | next [-]

Anubis or, like Xkeeper of The Cutting Room Floor has done, block the major Chinese cloud providers.

▲ aledalgrande 2 days ago | parent | prev | next [-]

The Netscape theme is my favorite. Love the pixel-y cursor animation

▲ 2 days ago | parent | prev | next [-]

[deleted]

▲ mclau157 2 days ago | parent | prev | next [-]

HomeStarRunner had a theme switcher

▲ lrivers 2 days ago | parent | prev | next [-]

Points off for lack of blink tag. Do better

▲ amelius 3 days ago | parent | prev [-]

The theme switcher uses local storage as a kind of cookie (19 bytes for something that could fit in 1 byte). Kind of surprised they don't show the cookie banner.

Just a remark, nothing more.

PS, I'm also curious why the downvotes for something that appears to be quite a conversation starter ...

▲ athenot 3 days ago | parent | next [-]

You don't need the cookie banner for cookies that are just preferences and don't track users.

▲

dotancohen 3 days ago | parent | next [-]

Which is why calling it the cookie banner is a diversion tactic by those who are against the privacy assurances of the GPDR. There is absolutely no problem with cookies. The problem is with the tracking.

▲

root_axis 3 days ago | parent | next [-]

It's called a cookie banner because only people using cookies to track users need them. If you're using localstorage to track users, informed consent is still required, but nobody does that because cookies are superior for tracking purposes.

▲

madeofpalk 2 days ago | parent [-]

> If you're using localstorage to track users [...] but nobody does

I promise you every adtech/surveillance js junk absolutely is dropping values into local storage you remember you.

	▲	root_axis 2 days ago \| parent [-]
		They are, but without cookies nearly all of the value disappears because there is no way to correlate sessions across domains. If commercesite.com and socialmediasite.com both host a tracking script from analytics.com that sets data in localstorage, there is no way to correlate a user visiting both sites with just the localstorage data alone - they need cookies to establish the connection between what appears to be two distinct users.

▲

reactordev 3 days ago | parent | prev | next [-]

Our problem is with tracking. Their problem is that other companies are tracking. So let’s stop the other companies from tracking since we can track directly from our browser. GDPR requires cookie banner to scare people into blocking cookies

There, now only our browser can track you and only our ads know your history…

We’ll get the other two to also play along, throw money at them if they refuse, I know our partner Fruit also has a solution in place that we could back-office deal to share data.

▲

bigstrat2003 3 days ago | parent | prev [-]

You're assuming bad intent where there are multiple other explanations. I call it the cookie banner and I don't run a web site at all (so, I'm not trying to track users as you claim).

▲

dotancohen 3 days ago | parent | next [-]

You call it the cookie banner because you've been hearing it regularly referred to as the cookie banner. It was the regularization of calling it the cookie banner that confuses people into thinking the issue is about cookies, and not about tracking.

▲

bigstrat2003 3 days ago | parent [-]

So, by your own admission, calling it the cookie banner is not only "a diversion tactic by those who are against the privacy assurances of the GPDR". My only point is that you were painting with an overly broad brush and saying someone is a bad actor if they call it the cookie banner, which is demonstrably not the case.

	▲	dotancohen 2 days ago \| parent [-]
		I admit nothing, because I am not partaking into contentious argument. However I could have better phrased my original comment with the word "was" instead of "is".

▲

3 days ago | parent | prev | next [-]

[deleted]

▲

3 days ago | parent | prev [-]

[deleted]

▲

mhitza 3 days ago | parent | prev [-]

Or for cookies that are required for the site to function.

On a company/product website you should still inform users about them for the sake of compliance, but it doesn't have to be an intrusive panel/popup.

	▲	sensanaty 2 days ago \| parent [-]
		> On a company/product website you should still inform users about them for the sake of compliance No? Github for example doesn't have a cookie banner. If you wanna be informative you can disclose which cookies you're setting, but if they're not used for tracking purposes you don't have to disclose anything. Also, again, it's not a "cookie" banner, it's a consent banner. The law says nothing about the storage mechanism as it's irrelevant, they list cookies twice as examples of storage mechanisms (and list a few others like localStorage).

▲ ProZsolt 3 days ago | parent | prev | next [-]

You don't have to show the cookie banner if you don't use third party cookies.

The problem with third party cookies that it can track you across multiple websites.

	▲	account42 2 days ago \| parent [-]
		Wrong, you also need to ask permission before using first-party tracking cookies.

▲ reactordev 3 days ago | parent | prev | next [-]

Because she’s using local storage…?

If you don’t use cookies, you don’t need a banner. 5D chess move.

▲ root_axis 3 days ago | parent | next [-]

There's no distinction between localstorage and cookies with respect to the law, what matters is how it is used. For something like user preferences (like the case with this blog) localstorage and cookies are both fine. If something in localstorage were used to track a user, then it would require consent.

▲ roywashere 3 days ago | parent | prev | next [-]

That is not how it works. The ‘cookie law’ is not about the cookies, it is about tracking. You can store data in cookies or in local storage just fine, for instance for a language switcher or a theme setting like here without the need for a cookie banner. But if you do it for ads and tracking, then this does require consent and thus a ‘cookie banner’. The storage medium is not a factor.

▲ amelius 3 days ago | parent | prev [-]

Sounds to me like a loophole in the law then. Which would be surprising too since not easy to overlook.

▲ dkersten 3 days ago | parent | next [-]

The law is very clear, if you actually read it. It doesn't care what technology you use: cookies, localstorage, machine fingerprints, something else. It doesn't care. It cares about collecting, storing, tracking, and sharing user data.

You can use cookies, or local storage, or anything you like when its not being used to track the user (eg for settings), without asking for consent.

▲ alternatex 3 days ago | parent | prev | next [-]

LocalStorage is per host though. You can't track people using LocalStorage, right?

▲

reactordev 3 days ago | parent [-]

LocalStorage is per client, per host. You generally can't track people using LocalStorage without some server or database on the other side to synchronize the different client hosts.

GDPR rules are around personal preference tracking, tracking, not site settings (though it's grey whether a theme preference is a personal one or a site one).

	▲	root_axis 2 days ago \| parent [-]
		> though it's grey whether a theme preference is a personal one or a site one In this case it's not grey since the information stored can't possibly be used to identify particular users or sessions.

▲ reactordev 3 days ago | parent | prev [-]

It’s not a loophole. localStorage is just that, local. Nothing is shared. No thing is “tracked” beyond your site preferences for reading on that machine.

I say it’s a perfect application of how to keep session data without keeping session data on the server, which is where GDPR fails. It assumes cookies. It assumes a server. It assumes that you give a crap about the contents of said cookie data.

In this case, no. Blast it away, the site still works fine (albeit with the default theme). This. Is. Perfect.

▲ dkersten 2 days ago | parent | next [-]

> which is where GDPR fails. It assumes cookies.

It does not assume anything. GDPR is technology agnostic. GDPR only talks about consent for data being processed, where 'processing' is defined as:

    ‘processing’ means any operation or set of operations which is performed on personal data or on sets of personal data, whether or not by automated means, such as collection, recording, organisation, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, restriction, erasure or destruction;

(From Article 4.2)

The only place cookies are mentioned is as one example, in recital 30:

    Natural persons may be associated with online identifiers provided by their devices, applications, tools and protocols, such as internet protocol addresses, cookie identifiers or other identifiers such as radio frequency identification tags. This may leave traces which, in particular when combined with unique identifiers and other information received by the servers, may be used to create profiles of the natural persons and identify them.

▲

reactordev 2 days ago | parent [-]

>GDPR only talks about consent for personal data being processed

Emphasis, mine. You are correct. For personal data. This is not personal data. It’s a site preference that isn’t personal other than you like dark mode or not.

	▲	dkersten 2 days ago \| parent [-]
		I was responding to this bit: > It assumes cookies. It assumes a server.

▲ sensanaty 2 days ago | parent | prev | next [-]

> It assumes cookies.

How can people still be this misinformed about GDPR and the ePrivacy law? It's been years, and on this very website I see this exact interaction where someone is misinterpreting GDPR and gets corrected constantly.

▲ 0x073 3 days ago | parent | prev [-]

Gdpr don't assumes cookies, if you misuse local storage you also need confirmation.

▲

reactordev 3 days ago | parent [-]

only if you are storing personal information. Email, Name, unique ID.

Something as simple as "blue" doesn't qualify.

▲

dkersten 2 days ago | parent [-]

Correct. But you can also use cookies for that, without violating GDPR or the ePrivacy directive.

▲

reactordev 2 days ago | parent [-]

Then you have the problem of some users blocking cookies at the browser level. LocalStorage is perfect application for this use case.

	▲	account42 2 days ago \| parent [-]
		Or maybe you could respect those user's preferences of not having shit stored for your website.

▲ the_duke 3 days ago | parent | prev | next [-]

You only need cookie banners for third parties, not for your own functionality.

▲

root_axis 3 days ago | parent [-]

GDPR requires informed consent for tracking of any kind, whether that's 3rd party or restricted to your own site.

▲

input_sh 2 days ago | parent [-]

Incorrect, GDPR requires informed consent to collect personally identifiable information, but you can absolutely run your own analytics that only saves the first three octets of an IP address without needing to ask for constent.

Enough to know the general region of the user, not enough to tie any action to an individual within that region. Therefore, not personally identifiable.

Of course, you also cannot have user authentication of any kind without storing PII (like email addresses).

▲

root_axis 2 days ago | parent [-]

You've stretched the definition of tracking for your hypothetical. If you can't identify the user/device then you're not tracking them.

▲

input_sh 2 days ago | parent [-]

I literally worked with digital rights lawyers to build a tool to exercise your GDPR rights, but sure, call it a hypothetical.

	▲	root_axis 2 days ago \| parent [-]
		It's literally a hypothetical situation you introduced for the sake of discussion. "Hypothetical" doesn't mean it doesn't happen in real life, the whole purpose of a hypothetical is to model reality for the sake of analysis.

▲ lucideer 3 days ago | parent | prev | next [-]

You don't need a banner if you use cookies. You only need a banner if you store data about a user's activity on your server. This is usually done using cookies, but the banners are neither specific to cookies nor inherently required for all cookies.

---

Also: in general the banners are generally not required at all at an EU level (though some individual countries have implemented more narrow local rules related to banners). The EU regs only state that you need to facilitate informed consent in some form - how you do that in your UI is not specified. Most have chosen to do it via annoying banners, mostly due to misinformation about how narrow the regs are.

▲ rafram 3 days ago | parent | prev | next [-]

19 whole bytes!

▲ hju22_-3 3 days ago | parent | prev [-]

I'd guess it's due to it not being a cookie, by technicality, and is not required then.

	▲	account42 2 days ago \| parent [-]
		Instead of guessing you could inform yourself.