Remix.run Logo
marginalia_nu 2 hours ago

Oh yeah, there's definitely room for improvement in that general direction. Indexing anchor texts is much better than page rank, but in isolation, it's not sufficient.

I've also seen some benefit fingerpinting the network traffic the websites make using a headless browser, to identify which ad networks they load. Very few spam sites have no ads, since there wouldn't be any economy in that.

e.g. https://marginalia-search.com/site/www.salon.com?view=traffi...

The full data set of DOM samples + recorded network traffic are in an enormous sqlite file (400GB+), and I haven't yet worked out any way of distributing the data yet. Though it's in the back of my mind as something I'd like to solve.

dredmorbius 13 minutes ago | parent [-]

Oh, that is clever!

I'd also suspect that there are networks / links which are more likely signs of low-value content than others. Off the top of my head, crypto, MLM, known scam/fraud sites, and perhaps share links to certain social networks might be negative indicators.