| ▲ | End of an era for me: no more self-hosted git(kraxel.org) |
| 163 points by dzulp0d 16 hours ago | 104 comments |
| |
|
| ▲ | kstrauser 2 hours ago | parent | next [-] |
| I cut traffic to my Forgejo server from about 600K request per day to about 1000: https://honeypot.net/2025/12/22/i-read-yann-espositos-blog.h... 1. Anubis is a miracle. 2. Because most scrapers suck, I require all requests to include a shibboleth cookie, and if they don’t, I set it and use JavaScript to tell them to reload the page. Real browsers don’t bat an eye at this. Most scrapers can’t manage it. (This wasn’t my idea; I link to the inspiration for it. I just included my Caddy-specific instructions for implementing it.) |
| |
| ▲ | QuiDortDine 43 minutes ago | parent [-] | | I remember back when Anubis came out, some naysayers on here were saying it wouldn't work for long because the scrapers would adapt. Turns out careless, unethical vibecoders aren't very competent. | | |
| ▲ | wolfi1 9 minutes ago | parent [-] | | "Turns out careless, unethical vibecoders aren't very competent." well, they rely on AI, don't they? and AI is trained with already existing bad code, so why should the outcome be different? |
|
|
|
| ▲ | moebrowne 3 hours ago | parent | prev | next [-] |
| This kind of thing can be mitigated by not publishing a page/download for every single branch, commit and diff in a repo. Make only the HEAD of each branch available. Anyone who wants more detail has to clone it and view it with their favourite git client. For example https://mitxela.com/projects/web-git-sum (https://git.mitxela.com/) |
| |
| ▲ | Imustaskforhelp 3 hours ago | parent [-] | | I got another interesting idea from this and another comment but what if we combine this with ssh git clients/websites with the normal ability. maybe something like https://ssheasy.com/ or similar could also be used? or maybe even a gotty/xterm instance which could automatically ssh/get a tui like interface. I feel as if this would for all scrapers be enough? |
|
|
| ▲ | kristjank 3 hours ago | parent | prev | next [-] |
| Is there a way to block it by shibboleth? Curious, since the recent Google hack where you add -(n-word) to the end of your query so the AI automatically shuts down works like a charm. |
| |
|
| ▲ | data-ottawa 14 hours ago | parent | prev | next [-] |
| Does anyone know what's the deal with these scrapers, or why they're attributed to AI? I would assume any halfway competent LLM driven scraper would see a mass of 404s and stop. If they're just collecting data to train LLMs, these seem like exceptionally poorly written and abusive scrapers written the normal way, but by more bad actors. Are we seeing these scrapers using LLMs to bypass auth or run more sophisticated flows? I have not worked on bot detection the last few years, but it was very common for residential proxy based scrapers to hammer sites for years, so I'm wondering what's different. |
| |
| ▲ | embedding-shape 3 hours ago | parent | next [-] | | I just threw up a public Forjego instance for some lightweight collaboration. About 2 minutes after the certificate was created, I'm guessing they picked up the instance from the transparency logs for certificates, and started going through every commit and so on from the two repositories I had added. Watched it for a while, thinking eventually it'd end. It didn't, seemed like Claudebot and GPTBot (which was the only two I saw, but could have been forged) went over the same URLs over and over again. They tried a bunch of search queries too at the same time. The day after I got tired of seeing it so added a robot.txt forbidding any indexing. Waited a few hours, saw that they were still doing the same thing, so threw up basic authentication with `wiki:wiki` as the username:password basically, wrote the credentials on the page where I linked it and as expected they stopped trying after that. They don't seem to try to bypass anything, whatever you put in front will basically defeat them except blocking them by user-agent, then they just switch to a browser-like user-agent instead, which is why I went the "trivial basic authentication" path instead. Wasn't really an issue, just annoying when they try to masquerade as normal users. Had the same issue with a wiki instance, added rate limits and eventually they seemingly backed off more than my limits were set too, so I guess they eventually got it. Just checked the logs and seems they've stopped trying completely. Seemingly it seems like people who are paying for their hosting by usage (which never made sense to me) is the ones hard hit by this. I'm hosting my stuff on a VPS, and don't understand what the big issue is, worst case scenario I'd add more aggressive caching and it wouldn't be an issue anymore. | | |
| ▲ | rozab 3 hours ago | parent | next [-] | | I had the same issue when I first put up my gitea instance. The bots found the domain through cert registration in minutes, before there were any backlinks. GPTbot, ClaudeBot, PerplexityBot, and others. I added a robots.txt with explicit UAs for known scrapers (they seem to ignore wildcards), and after a few days the traffic died down completely and I've had no problem since. Git frontends are basically a tarpit so are uniquely vulnerable to this, but I wonder if these folks actually tried a good robots.txt? I know it's wrong that they ignore wildcards, but it does seem to solve the issue | | |
| ▲ | stefanka 6 minutes ago | parent | next [-] | | Where does one find a good robots.txt? Are there any well maintained out there? | |
| ▲ | trillic 2 hours ago | parent | prev | next [-] | | I will second a good robots.txt. Just checked my metrics and < 100 requests total to my git instance in the last 48 hours. Completely public, most repos are behind a login but there are a couple that are public and linked. | |
| ▲ | bob1029 3 hours ago | parent | prev [-] | | > I wonder if these folks actually tried a good robots.txt? I suspect that some of these folks are not interested in a proper solution. Being able to vaguely claim that the AI boogeyman is oppressing us has turned into quite the pastime. |
| |
| ▲ | Fabricio20 2 hours ago | parent | prev | next [-] | | Since you had the logs for this, can you confirm the IP ranges they were operating from? You mention "Claudebot and GPTBot" but I'm guessing this is based off of the user-agent presented by the scrapers and could easily be faked to shift blame. I genuinely doubt Anthropic and such would be running scrapers that are this badly written/implemented, it doesnt make economic sense. I'd love to see some of the web logs from this if you'd be willing to share! I feel like this is just some of the old scraper bots now advertising themselves as AI bots to shift blame into the AI companies. | | |
| ▲ | Tharre 2 hours ago | parent [-] | | There are a bit too many IPs to list but from my logs they're always of the form 74.7.2XX.* for GPTBot, matching OpenAIs published ip ranges[0]. So yes, they are definitely running scrapers that are this badly written. Also old scraper bots trying to disguise themselves as GPTBot seems wholly unproductive, they're try to immitate users, not bots. [0] https://openai.com/gptbot.json |
| |
| ▲ | Imustaskforhelp 3 hours ago | parent | prev [-] | | Huh, I had a gitea instance in the public web on one of my netcup vps's. I didn't set any logs and was using cloudflare tunnels (with a custom bash script which makes cf tunnels expose PORT SUBDOMAIN). Maybe its time for me to go ahead and start it again with logs to see if there are any logs. I will maybe test it in all three 1) With CF tunnels + AI Block, 2) Only CF tunnels, 3) On a static IP directly. Maybe you can try the experiment too and we can compare our findings (also saying because I am lazy and I had misconfigured that cf tunnel so when it quit, I was too lazy to restart the vps given I just use it as a playground and just wanted to play around self hosting but maybe I will do it again now) |
| |
| ▲ | jillesvangurp 3 hours ago | parent | prev | next [-] | | Using an LLM to ponder responses for requests is way too costly and slow. Much easier to just use the shotgun approach and fire off a lot of requests and deal with whatever bothers to respond. This btw is nothing new. Way back when I still used wordpress, it was quite common to see your server logs filling up with bots trying to access endpoints for commonly compromised php thingies. Probably still a thing but I don't spend a lot of time looking at logs. If you run a public server, dealing with maliciously intended but relatively harmless requests like that is just what you have to do. Stuff like that is as old as running stuff on public ports is. And the offending parties writing sloppy code that barely works is also nothing new. AI opportunism certainly has added a bit of opportunistic bot and scraper traffic but it doesn't actually change the basic threat model in any fundamental way. Previously version control servers were relatively low value things to scrape. But code just became interesting for LLMs to train on. Anyway, having any kind of thing responding on any port just invites opportunistic attempts to poke around. Anything that can be abused for DOS purposes might get abused for exactly that. If you don't like that, don't run stuff on public servers or protect them properly. Yes this is annoying and not necessarily easy. Cloud based services exist that take some of that pain away. Logs filling up with 404, 401, or 400 responses should not kill your server. You might want to implement some logic that tells repeat offenders 429 (too many requests). A bit heavy handed but why not. But if you are going to run something that could be used to DOS your server, don't be surprised if somebody does that. | |
| ▲ | simonw 14 hours ago | parent | prev | next [-] | | I would love to understand this. Just a few years ago badly behaved scrapers were rare enough not to be worth worrying about. Today they are such a menace that hooking any dynamic site up to a pay-to-scale hosting platform like Vercel or Cloud Run can trigger terrifying bills on very short notice. "It's for AI" feels like lazy reasoning for me... but what IS it for? One guess: maybe there's enough of a market now for buying freshly updated scrapes of the web that it's worth a bunch of chancers running a scrape. But who are the customers? | | |
| ▲ | SCHiM 3 hours ago | parent | next [-] | | The bar to ingest unstructured data into something usable was lowered, causing more people to start doing it. Used to be you needed to implement some papers to do sentiment analysis. Reasonably high bar to entry. Now anyone can do it, the result: more people doing scraping (in less competent scrapers too). | |
| ▲ | devsda 14 hours ago | parent | prev [-] | | For whatever reason, legislation is lax right now if you claim the purpose of scraping is for AI training even for copyrighted material. May be everyone is trying to take advantage of the situation before law eventually catches up. | | |
| ▲ | Imustaskforhelp 2 hours ago | parent [-] | | > For whatever reason, legislation is lax right now if you claim the purpose of scraping is for AI training even for copyrighted material I think the reason is that America & China for the most part are also in AI arms race combined with an AI bubble and neither side would wish to lose literally any percieved advantage to them no matter the cost on others. Also there is an immense lobbying effort against senators who propose for a stricter AI regulation. https://www.youtube.com/watch?v=DUfSl2fZ_E8 [What OpenAI doesn't want you to know] It's actually a great watch. Highly recommended because a lot of talks about regulations does feel to me as mirrors and smoke. |
|
| |
| ▲ | Tharre 3 hours ago | parent | prev | next [-] | | > Does anyone know what's the deal with these scrapers, or why they're attributed to AI? You don't really need to guess, it's obvious from the access logs. I realize not everyone runs their own server, so here are a couple excerpts from mine to illustrate: - "meta-externalagent/1.1 +https://developers.facebook.com/docs/sharing/webmasters/craw...)" - "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)" - "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36" - "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.3; +https://openai.com/gptbot)" - [...] (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)" And to give a sense of scale, my cgit instance recieved 37 212 377 requests over the last 60 days, >99% of which are bots. The access.log from nginx grew to 12 GiB in those 60 days. They scrape everything they can find, indiscriminately, including endpoints that have to do quite a bit of work, leading to a baseline 30-50% CPU utilization on that server right now. Oh, and of course, almost nothing of what they are scraping actually changed in the last 60 days, it's literally just a pointless waste of compute and bandwidth. I'm actually surprised that the hosting companies haven't blocked all of them yet, this has to increase their energy bills substantially. Some bots also seem better behaved then others, OpenAI alone accounts for 26 million of those 37 million requests. | |
| ▲ | everforward 2 hours ago | parent | prev | next [-] | | I think it’s a) volume of scrapers, and b) desire for _all_ content instead of particular content, and c) the scrapers are new and don’t have the decades of patches Googlebot et al do. 5 years ago there were few people with an active interest in scraping ForgeJo instances and personal blogs. Now there are a bajillion companies and individuals getting data to train a model or throw in RAG or whatever. Having a better scraper means more data, which means a better model (handwavily) so it’s a competitive advantage. And writing a good, well-behaved distributed scraper is non-trivial. | |
| ▲ | arnarbi 14 hours ago | parent | prev | next [-] | | > why they're attributed to AI? I don’t think they mean scrapers necessarily driven by LLMs, but scrapers collecting data to train LLMs. | |
| ▲ | M95D 11 hours ago | parent | prev | next [-] | | I stopped trying to understand. Encountering a 404 on my site leads directly to a 1 year ban. | | |
| ▲ | embedding-shape 3 hours ago | parent | next [-] | | Damn, as someone who sometimes navigate by guessing URLs and rewriting them manually in the address bar, I hope more don't start doing this, I probably see at least one self-inflicted 404 per day at least. | | | |
| ▲ | tasuki 3 hours ago | parent | prev | next [-] | | Sounds like you're keeping all your URLs alive forever? Commendable! | | | |
| ▲ | octoberfranklin 2 hours ago | parent | prev [-] | | They're rotating through huge pools of residential IP addresses. | | |
| |
| ▲ | themafia 14 hours ago | parent | prev | next [-] | | There's value to be had in ripping the copyright off your stuff so someone else can pass it off as their stuff. LLMs have no technical improvements so all they can do is throw more and more stolen data into it and hope it, somehow, crosses a nebulous "threshold" where it suddenly becomes actually profitable to use and sell. It's a race to the bottom. What's different is we're much closer to the bottom now. | |
| ▲ | octoberfranklin 2 hours ago | parent | prev | next [-] | | I don't think it has anything to do with LLMs. I think the big cloud companies (AWS) figured out that they could scrape compute-intensive pages in order to drive up their customers' spend. Getting hammered? Upgrade to more-expensive instances. Not using cloud yet? We'll force you to. The other possibility is cloudflare punishing anybody who isn't using it. Probably a combination of these two things. Whoever's behind this has ungodly supplies of cheap bandwidth -- more than any AI company does. It's a cloud company. | |
| ▲ | hsuduebc2 14 hours ago | parent | prev | next [-] | | I’m guessing, but I think a big portion of AI requests now come from agents pulling data specifically to answer a user’s question. I don’t think that data is collected mainly for training now but are mostly retrieved and fed into LLMs so they can generate the response. Thus so many repeated requests. | |
| ▲ | danaris 10 hours ago | parent | prev [-] | | > If they're just collecting data to train LLMs, these seem like exceptionally poorly written and abusive scrapers written the normal way, but by more bad actors. Right, this is exactly what they are. They're written by people who a) think they have a right to every piece of data out there, b) don't have time (or shouldn't have to bother spending time) to learn any kind of specifics of any given site and c) don't care what damage they do to anyone else as they get the data they crave. (a) means that if you have a robots.txt, they will deliberately ignore it, even if it's structured to allow their bots to scrape all the data more efficiently. Even if you have an API, following it would require them to pay attention to your site specifically, so by (b), they will ignore that too—but they also ignore it because they are essentially treating the entire process as an adversarial one, where the people who hold the data are actively trying to hide it from them. Now, of course, this is all purely based on my observations of their behavior. It is possible that they are, in fact, just dumb as a box of rocks...and also don't care what damage they do. (c) is clearly true regardless of other specific motives. |
|
|
| ▲ | snorremd 3 hours ago | parent | prev | next [-] |
| I've recently been setting up web servers like Forgejo and Mattermost to service my own and friends' needs. I ended up setting up Crowdsec to parse and analyse access logs from Traefik to block bad actors that way. So when someone produces a bunch of 4XX codes in a short timeframe I assume that IP is malicious and can be banned for a couple of hours. Seems to deter a lot of random scraping. Doesn't stop well behaved crawlers though which should only produce 200-codes. I'm actually not sure how I would go about stopping AI crawlers that are reasonably well behaved considering they apparently don't identify themselves correctly and will ignore robots.txt. |
| |
| ▲ | lowdude an hour ago | parent | next [-] | | There was a comment in a different thread that suggested they may respect the robots.txt for the most part, but may ignore wildcards: https://news.ycombinator.com/item?id=46975726 Maybe this is worth trying out first, if you are currently having issues. | |
| ▲ | V__ 3 hours ago | parent | prev [-] | | If possible block I would block by country first. Even on public websites I block Russia/China by default and that reduced port scans etc. On "private" services where I or my friends are the only users, I block everything except my country. |
|
|
| ▲ | krick 13 hours ago | parent | prev | next [-] |
| So, what's up with these bots, why am I hearing about that so often lately? I mean, DDoS atacks aren't a new thing, and, honestly, this is pretty much the reason why Cloudflare even exists, but I'd expect OpenAI bots (or whatever this is now) to be a little bit easier to deal with, no? Like, simply having resonable aggressive fail2ban policy? Or do they really behave like a botnet, where each request comes from different IP from a different network? How? Why? What is this thing? |
| |
| ▲ | wseqyrku 5 hours ago | parent | next [-] | | > this is pretty much the reason why Cloudflare even exists, You said it yourself. If you're selling a cure, you might as well start a plague. | |
| ▲ | recursivecaveat 13 hours ago | parent | prev | next [-] | | I doubt it's OpenAI. Maaaybe somebody who sells to OpenAI, but probably not. I think they're big enough to do this mostly in-house and properly. Before AI only big players would want a scrape of the entire internet, they could write quality bots, cooperate, behave themselves, etc. Now every 3rd tier lab wants that data and a billion startups want to sell it, so it's a wild west of bad behavior and bad implementations. They do use residential IP sets as well. | |
| ▲ | esseph 11 hours ago | parent | prev [-] | | The dirty secret is a lot of them come through "residential proxies", aka backdoored home routers, iot devices with shitty security, etc. Basically the scrapers who are often also third party, go to these "companies" and buy access to these "residential proxies". Some are more... considerate than others. Why? Data. Every bit of it is it might be valuable. And not to sound tin foil hatty, but we are getting closer to a post-quantum time (if we aren't already ). | | |
| ▲ | the_biot 2 hours ago | parent | next [-] | | Has this actually been investigated and proven to be true? I see allegations, but no facts really. It seems to me to be just as likely that people are installing LLM chatbot apps that do the occasional bit of scraping work on the sly, covered by some agreed EULA. | | |
| ▲ | Symbiote 24 minutes ago | parent | next [-] | | Another likely source is "free" VPN tools, or tools for streaming TV (especially football or other pay-to-view stuff). The tool can make a little money proxying requests at the same time. I can't provide evidence as it's close to impossible to separate the AI bots using residential proxies from actual users, and their IPs are considered personal data. But as the other reply shows, it's easy enough to find people selling this service. | |
| ▲ | esseph an hour ago | parent | prev [-] | | Seriously, go to Google. Search for: "residential proxy" ai data scraping. Start reading through thousands of articles. |
| |
| ▲ | tigerlily 8 hours ago | parent | prev [-] | | How can I detect if my router is backdoored, or being used as a residential proxy? | | |
| ▲ | mzajc 3 hours ago | parent | next [-] | | I'm dealing with such attack, so if you'd like, you can send me IPv4 addresses, and I'll grep my logs for them. Email address is on the website linked on my profile. As for what you can do on your own, it really depends on your network. OpenWRT routers can run tcpdump, so you can check for suspicious connections or DNS requests, but it gets really hard to tell if you have lots of cloud-tethered devices at home. IoT, browser extensions, and smartphone applications are the usual suspects. | |
| ▲ | kimos 7 hours ago | parent | prev [-] | | If it’s legit you can ask your ISP if they sell use of your hardware. Or just don’t use the provided hardware and instead BYO router or modem or media converter or whatever. But I think what OP is implying is insecure hardware being infected by malware and access to that hardware sold as a service to disreputable actors. For that buy a good quality router and keep it up to date. | | |
|
|
|
|
| ▲ | devsda 14 hours ago | parent | prev | next [-] |
| At this point, I think we should look at implementing filters that send different response when AI bots are detected or when the clients are abusive. Not just simple response code but one that poisons their training data. Preferably text that elaborates on the anti consumer practices of tech companies. If there is a common text pool used across sites, may be that will get the attention of bot developers and automatically force them to backdown when they see such responses. |
| |
|
| ▲ | t312227 3 hours ago | parent | prev | next [-] |
| hello, as always: imho. (!) idk ... i just put a http basic-auth in front of my gitweb instance years ago. if i really ever want to put git-repositories into the open web again i either push them to some portal - github, gitlab, ... - or start thinking about how to solve this ;)) just my 0.02€ |
| |
| ▲ | madduci 3 hours ago | parent [-] | | I've put everything behind a Wireguard Server, so if I need something, I can access to it through VPN and AI can't do anything |
|
|
| ▲ | bigbuppo 12 minutes ago | parent | prev | next [-] |
| Just another example of AI and its DoSaaS ruining things for everyone. The AI bros just won't accept "NO" for an answer. |
|
| ▲ | vachina 13 hours ago | parent | prev | next [-] |
| Scrapers are relentless but not DDoS levels in my experience. Make sure your caches are warm and responses take no more than 5ms to construct. |
| |
| ▲ | mzajc 3 hours ago | parent | next [-] | | I'm also dealing with a scraper flood on a cgit instance. These conclusions come from just under 4M lines of logs collected in a 24h period. - Caching helps, but is nowhere near a complete solution. Of the 4M requests I've observed 1.5M unique paths, which still overloads my server. - Limiting request time might work, but is more likely to just cause issues for legitimate visitors. 5ms is not a lot for cgit, but with a higher limit you are unlikely to keep up with the flood of requests. - IP ratelimiting is useless. I've observed 2M unique IPs, and the top one from the botnet only made 400 well-spaced-out requests. - GeoIP blocking does wonders - just 5 countries (VN, US, BR, BD, IN) are responsible for 50% of all requests. Unfortunately, this also causes problems for legitimate users. - User-Agent blocking can catch some odd requests, but I haven't been able to make much use of it besides adding a few static rules. Maybe it could do more with TLS request fingerprinting, but that doesn't seem trivial to set up on nginx. | | |
| ▲ | Imustaskforhelp 3 hours ago | parent [-] | | Quick question but do these bots which you mention are from a 24H period but how long will this "attack" continue for? Because this is something which is happening continuously & i have observed so many HN posts like these (Anubis iirc was created by its creator out of such frustration too). Git servers being scraped to the point of its effectively an DDOS. | | |
| ▲ | mzajc 2 hours ago | parent [-] | | Yes, the attack is continuous. The rate fluctuates a lot, even within a day. It's definitely an anomaly, because eg. from 2025-08-15 to 2025-10-05 I saw zero days with more than 10k requests. Here's a histogram of the past 2 weeks plus today. 2026-01-28 21'460
2026-01-29 27'770
2026-01-30 53'886
2026-01-31 100'114 #
2026-02-01 132'460 #
2026-02-02 73'933
2026-02-03 540'176 #####
2026-02-04 999'464 #########
2026-02-05 134'144 #
2026-02-06 1'432'538 ##############
2026-02-07 3'864'825 ######################################
2026-02-08 3'732'272 #####################################
2026-02-09 2'088'240 ####################
2026-02-10 573'111 #####
2026-02-11 1'804'222 ##################
| | |
|
| |
| ▲ | watermelon0 12 hours ago | parent | prev [-] | | Great, now we need caching for something that's seldom (relatively speaking) used by people. Let's not forget that scrapers can be quite stupid. For example, if you have phpBB installed, which by defaults puts session ID as query parameter if cookies are disabled, many scrapers will scrape every URL numerous times, with a different session ID. Cache also doesn't help you here, since URLs are unique per visitor. | | |
| ▲ | kimos 7 hours ago | parent [-] | | You’re describing changing the base assumption for software reachable on the internet. “Assume all possible unauthenticated urls will be hit basically constantly”. Bots used to exist but they were rare traffic spikes that would usually behave well and could mostly be ignored. No longer. |
|
|
|
| ▲ | Lerc 14 hours ago | parent | prev | next [-] |
| I presume people have logs that indicate the source for them to place blame on AI scrapers. Is anybody making these available for analysis so we can see exactly who is doing this? |
| |
| ▲ | JohnTHaller 14 hours ago | parent | next [-] | | The big nasty AI bots use 10s of thousands of IPs distributed all over China | | |
| ▲ | notachatbot123 3 hours ago | parent | next [-] | | Millions and all over the world | |
| ▲ | krick 13 hours ago | parent | prev [-] | | So... just blacklist all China IPs? I assume China isn't the primary market for most of complaining site-owners. |
| |
| ▲ | esseph 11 hours ago | parent | prev [-] | | A lot of compromised home devices and cheap servers proxying traffic, from all over the world. | | |
| ▲ | Lerc 11 hours ago | parent [-] | | If that is the case how can you determine the reason for the activity? | | |
| ▲ | esseph 11 hours ago | parent [-] | | Some fake user agent, some tell you who they are. Or.. do they? Here-in is the problem. And if you block them, you risk blocking actual customers. | | |
| ▲ | Lerc 7 hours ago | parent [-] | | If they are using appropriated hardware, what possible reason could there be for them saying who they are? | | |
| ▲ | esseph 2 hours ago | parent [-] | | Three different "companies" normally: 1. The residential proxies 2. Scrapers, on behalf of or as an agent of the data buyer 3. Data buyer (ai training) Scrapers are buying from residential proxies, giving the data buyer a bit of a shield/deniability. The scrapers don't want to get outright blocked if they can avoid it, otherwise they have nothing to sell. |
|
|
|
|
|
|
| ▲ | anarticle 2 hours ago | parent | prev | next [-] |
| I use a private gitlab that was setup by claude, have my own runners and everything. It's fine. I have my own little home cluster, net storage compute around 2.5k. Go NUCs, cluster, don't look back. |
|
| ▲ | JohnTHaller 14 hours ago | parent | prev | next [-] |
| The Chinese AI scrapers/bots are killing quite a bit of the regular web now. YisouSpider absolutely pummeled my open source project's hosting for weeks. Like all Chinese AI scrapers, it ignores robots.txt. So forget about it respecting a Crawl-delay. If you block the user agent, it would calm down for a bit, then it would just come back again using a generic browser user agent from the same IP addresses. It does this across 10s of thousands of IPs. |
| |
|
| ▲ | ptman 9 hours ago | parent | prev | next [-] |
| Maybe put the git repos on radicle? |
|
| ▲ | Joel_Mckay 14 hours ago | parent | prev | next [-] |
| Some run git over ssh, and a domain login for https:// permission manager etc. Also, spider traps and 42TB zip of death pages work well on poorly written scrapers that ignored robots.txt =3 |
|
| ▲ | hattmall 14 hours ago | parent | prev | next [-] |
| Can we not charge for access? If I have a link, that says "By clicking this link you agree to pay $10 for each access" then sending the bill? |
| |
|
| ▲ | october8140 14 hours ago | parent | prev | next [-] |
| You could put it behind Cloudflare and block all AI. |
|
| ▲ | CuriouslyC 15 hours ago | parent | prev | next [-] |
| Does this author have a big pre-established audience or something? Struggling to understand why this is front-page worthy. |
| |
| ▲ | jaunt7632 15 hours ago | parent | next [-] | | A healthy front page shouldn’t be a “famous people only” section. If only big names can show up there, it’s not discovery anymore, it’s just a popularity scoreboard. | |
| ▲ | fouc 15 hours ago | parent | prev | next [-] | | because he's unable to self-host git anymore because AI bots are hammering it to submit PRs. self-hosting was originally a "right" we had upon gaining access to the internet in the 90s, it was the main point of the hyper text transfer protocol. | | |
| ▲ | geerlingguy 14 hours ago | parent | next [-] | | Also converting the blog from something dynamic to a static site generator. I made the same switch partly for ease of maintenance, but a side benefit is it's more resilient to this horrible modern era of scrapers far outnumbering legitimate traffic. It's painful to have your site offline because a scraper has channeled itself 17,000 layers deep through tag links (which are set to nofollow, and ignored in robots.txt, but the scraper doesn't care). And it's especially annoying when that happens on a daily basis. Not everyone wants to put their site behind Cloudflare. | |
| ▲ | tanduv 14 hours ago | parent | prev [-] | | sorry if i missed it, but the original post doesn't say anything about PRs... the bots only seem to be scraping the content | | |
| ▲ | fouc 11 hours ago | parent [-] | | oh you're right, I read "pointless requests" as "PRs", oops! |
|
| |
| ▲ | ares623 15 hours ago | parent | prev | next [-] | | Well the fact that this supposed nobody is overwhelmed by AI scrapers should speak a lot about the issue no? | |
| ▲ | bibimsz 15 hours ago | parent | prev [-] | | the era of mourning has begun |
|
|
| ▲ | Jaxkr 15 hours ago | parent | prev [-] |
| The author of this post could solve their problem with Cloudflare or any of its numerous competitors. Cloudflare will even do it for free. |
| |
| ▲ | denkmoon 15 hours ago | parent | next [-] | | Cool, I can take all my self hosted stuff and stick it behind centralised enterprise tech to solve a problem caused by enterprise tech. Why even bother? | | | |
| ▲ | Shorel 9 hours ago | parent | prev | next [-] | | Cloudflare seems to be taking over all of the last mile web traffic, and this extreme centralization sounds really bad to me. We should be able to achieve close to the same results with some configuration changes. AWS / Azure / Cloudflare total centralization means no one will be able to self host anything, which is exactly the point of this post. | |
| ▲ | the_fall 15 hours ago | parent | prev | next [-] | | They don't. I'm using Cloudflare and 90%+ of the traffic I'm getting are still broken scrapers, a lot of them coming through residential proxies. I don't know what they block, but they're not very good at that. Or, to be more fair: I think the scrapers have gotten really good at what they do because there's real money to be made. | | | |
| ▲ | rubiquity 15 hours ago | parent | prev | next [-] | | The scrapers should use some discretion. There are some rather obvious optimizations. Content that is not changing is less likely to change in the future. | | |
| ▲ | JohnTHaller 14 hours ago | parent [-] | | They don't care. It's the reason they ignore robots.txt and change up their useragents when you specifically block them. |
| |
| ▲ | simonw 14 hours ago | parent | prev | next [-] | | Cloudflare won't save you from this - see my comment here: https://news.ycombinator.com/item?id=46969751#46970522 | |
| ▲ | Semaphor 14 hours ago | parent | prev | next [-] | | For logging, statistics etc. we have the Cloudflare bot protection on the standard paid level, ignore all IPs not from Europe (rough geolocation), and still have over twice the amount of bots that we had ~2 years ago. | |
| ▲ | overgard 14 hours ago | parent | prev | next [-] | | I'm pretty sure scrapers aren't supposed to act as low key DOS attacks | |
| ▲ | isodev 15 hours ago | parent | prev | next [-] | | I think the point of the post was how something useless (AI) and its poorly implemented scrapers is wrecking havoc in a way that’s turning the internet into a digital desert. That Cloudflare is trying to monetise “protection from AI” is just another grift in the sense that they can’t help themselves as a corp. | |
| ▲ | fouc 15 hours ago | parent | prev [-] | | you don't understand what self-hosting means. self-hosting means the site is still up when AWS and Cloudflare go down. |
|