Remix.run Logo
snorremd 4 hours ago

I've recently been setting up web servers like Forgejo and Mattermost to service my own and friends' needs. I ended up setting up Crowdsec to parse and analyse access logs from Traefik to block bad actors that way. So when someone produces a bunch of 4XX codes in a short timeframe I assume that IP is malicious and can be banned for a couple of hours. Seems to deter a lot of random scraping. Doesn't stop well behaved crawlers though which should only produce 200-codes.

I'm actually not sure how I would go about stopping AI crawlers that are reasonably well behaved considering they apparently don't identify themselves correctly and will ignore robots.txt.

lowdude 2 hours ago | parent | next [-]

There was a comment in a different thread that suggested they may respect the robots.txt for the most part, but may ignore wildcards: https://news.ycombinator.com/item?id=46975726

Maybe this is worth trying out first, if you are currently having issues.

V__ 4 hours ago | parent | prev [-]

If possible block I would block by country first. Even on public websites I block Russia/China by default and that reduced port scans etc.

On "private" services where I or my friends are the only users, I block everything except my country.