Remix.run Logo
wonger_ 6 days ago

Do you log timestamps and page titles? About how many URLs do you log in an average day or week? Curious if your consumption is similar to mine or not.

1vuio0pswjnm7 6 days ago | parent [-]

The proxy log contains the timestamps but not the titles

For the titles I could extract them from pcaps; I also have a running tcpdump capture that logs to a (daemontools) multilog directory

The URL consumption might be different, and difficult to compare, for a number of reasons, e.g.,

I do not use a browser that sends automatic HTTP requests for resources like images, CSS files, Javascripts, etc.

I do not use a browser that runs Javascript so there are no XHR or other Javascript-triggered requests

I do not use remote DNS, I use "curated" DNS data, so the URLs are only for resources at domains I specifically request

I use HTTP/1.1 pipelining so I have large numbers of URLs that are for resources from a single domain, for example DoH (I do not include these in the URL database)

Generally the proxy log is rather clean and excludes garbage requests that are being sent automatically; IME, use of a "modern" browser will fill a log with such garbage

The proxy's self-signed certificate blocks many potential requests from hardware with pre-installed software from so-called "tech" companies, e.g., Google, Apple, Microsoft, because the TLS connections fail

These attempted connections to the mothership are incessant; they would fill a proxy log with garbage URLs if they were accepted

All this makes it easier to for me keep a URLs database; storing all those garbage URLs would make the database less useful