Remix.run Logo
1vuio0pswjnm7 6 days ago

The proxy log contains the timestamps but not the titles

For the titles I could extract them from pcaps; I also have a running tcpdump capture that logs to a (daemontools) multilog directory

The URL consumption might be different, and difficult to compare, for a number of reasons, e.g.,

I do not use a browser that sends automatic HTTP requests for resources like images, CSS files, Javascripts, etc.

I do not use a browser that runs Javascript so there are no XHR or other Javascript-triggered requests

I do not use remote DNS, I use "curated" DNS data, so the URLs are only for resources at domains I specifically request

I use HTTP/1.1 pipelining so I have large numbers of URLs that are for resources from a single domain, for example DoH (I do not include these in the URL database)

Generally the proxy log is rather clean and excludes garbage requests that are being sent automatically; IME, use of a "modern" browser will fill a log with such garbage

The proxy's self-signed certificate blocks many potential requests from hardware with pre-installed software from so-called "tech" companies, e.g., Google, Apple, Microsoft, because the TLS connections fail

These attempted connections to the mothership are incessant; they would fill a proxy log with garbage URLs if they were accepted

All this makes it easier to for me keep a URLs database; storing all those garbage URLs would make the database less useful