Remix.run Logo
c0balt 3 days ago

Huh, I have always wondered if for these usecases log-based analytics would suffice. No need to intrude on the client when the request itself already includes lots of data.

You get visited page, time of visit and a minimal user identifier (preferred locale/user agent, source IP). That should provide enough data for those questions. If you want to make it slightly more privacy friendly and cheaper only store aggregate data (visited page counts etc).

The problem here is probably that logs are not always accessible (managed services like GitHub/GitLab pages).

jsheard 3 days ago | parent | next [-]

The other problem with logs is that it's very difficult to filter out bots masquerading with browser user agents, of which there are a lot nowadays. I've watched the logs on newly registered domains that aren't published anywhere besides the certificate transparency logs and seen the majority of traffic coming from "Chrome". Yeah I'm sure you are.

diggan 3 days ago | parent [-]

> The other problem with logs is that it's very difficult to filter out all of the bot traffic.

It's not very difficult, but isn't not effort-less. Start with something like https://github.com/allinurl/goaccess/blob/master/config/brow... which captures 99% of the crawlers out there. Then, when you notice there is one particular user-agent/IP/IP-range doing a bunch of requests, add it to list and re-run. Doing filtering based on ASNs that you see are being used for crawling lets you filter most of the AI agents too.

We've been dealing with this problem for over 2 decades now, and there are solutions out there that removes almost all of it from your logs.

lazide 2 days ago | parent [-]

Sounds about as much fun as manual spam filtering, but with worse tools. :(

diggan 2 days ago | parent [-]

Literally takes 5 minutes to setup at most, and most analytics tools ships with a "ignore webcrawlers" option somewhere, like goaccess does for example, taking 0 minutes to use :)

closewith 3 days ago | parent | prev [-]

At least in the EU, you still need consent to go down this path, so you end up rebuilding the existing analytics tools to manage that.

Fire-Dragon-DoL 2 days ago | parent [-]

You don't: the ip, user agent header are necessary for you to serve the request and for you to protect from malicious usage. Both are fine use for gathering data under the current legislation. You can still convert those datasets in analytics.

closewith 2 days ago | parent [-]

No, you're wrong. You need consent to use IP address for analytics, even if lawfully collected for another reason under a different legal basis.

Lots of precedent on this and it's clear cut.

Fire-Dragon-DoL 2 days ago | parent [-]

But you don't need to use the ip for analytics. Aggregate the data and throw it away.

closewith 2 days ago | parent [-]

You still need consent for that, as you processed the IP for that purpose and it is explicitly never allowed under other legal bases.

Again, this is a completely settled matter in the EU, so you can easily look it up if your don't believe me.

Fire-Dragon-DoL 2 days ago | parent [-]

Cool, then ignore the ip and just consider the amount of requests to a given endpoint. Counting that should be sufficient