| ▲ | snorremd 4 hours ago | |
I've recently been setting up web servers like Forgejo and Mattermost to service my own and friends' needs. I ended up setting up Crowdsec to parse and analyse access logs from Traefik to block bad actors that way. So when someone produces a bunch of 4XX codes in a short timeframe I assume that IP is malicious and can be banned for a couple of hours. Seems to deter a lot of random scraping. Doesn't stop well behaved crawlers though which should only produce 200-codes. I'm actually not sure how I would go about stopping AI crawlers that are reasonably well behaved considering they apparently don't identify themselves correctly and will ignore robots.txt. | ||
| ▲ | lowdude 2 hours ago | parent | next [-] | |
There was a comment in a different thread that suggested they may respect the robots.txt for the most part, but may ignore wildcards: https://news.ycombinator.com/item?id=46975726 Maybe this is worth trying out first, if you are currently having issues. | ||
| ▲ | V__ 4 hours ago | parent | prev [-] | |
If possible block I would block by country first. Even on public websites I block Russia/China by default and that reduced port scans etc. On "private" services where I or my friends are the only users, I block everything except my country. | ||