Remix.run Logo
jsheard 3 days ago

The other problem with logs is that it's very difficult to filter out bots masquerading with browser user agents, of which there are a lot nowadays. I've watched the logs on newly registered domains that aren't published anywhere besides the certificate transparency logs and seen the majority of traffic coming from "Chrome". Yeah I'm sure you are.

diggan 3 days ago | parent [-]

> The other problem with logs is that it's very difficult to filter out all of the bot traffic.

It's not very difficult, but isn't not effort-less. Start with something like https://github.com/allinurl/goaccess/blob/master/config/brow... which captures 99% of the crawlers out there. Then, when you notice there is one particular user-agent/IP/IP-range doing a bunch of requests, add it to list and re-run. Doing filtering based on ASNs that you see are being used for crawling lets you filter most of the AI agents too.

We've been dealing with this problem for over 2 decades now, and there are solutions out there that removes almost all of it from your logs.

lazide 2 days ago | parent [-]

Sounds about as much fun as manual spam filtering, but with worse tools. :(

diggan 2 days ago | parent [-]

Literally takes 5 minutes to setup at most, and most analytics tools ships with a "ignore webcrawlers" option somewhere, like goaccess does for example, taking 0 minutes to use :)