| ▲ | PaulHoule 3 hours ago | ||||||||||||||||
My YOShInOn RSS reader uses an SBERT model for classification (will I upvote this or not?) and large-scale clustering (20 k-means clusters and show me the top N in each cluster so I get a diversity of different articles.) For duplicate detection I am using DBSCAN https://scikit-learn.org/stable/modules/generated/sklearn.cl... and found some parameters where I get almost no false positives but a lot of duplicates get missed when I lowered the threshold to make clusters I started getting false positives fast. I don't find duplicates are a big problem in my system with the 110 feeds I have and the subjects I am interested in, but insofar as they are a problem there tend to be structured relationships between articles: that is, site A syndicates articles from site B but for some reason articles from site A usually get selected and site B articles don't. An article from Site A often links to one or more articles, often that I don't have a feed for, and it would be nice if the system looked at the whole constellation. Stuff like that. Effective clustering is the really interesting technology Google News has had for a long time. | |||||||||||||||||
| ▲ | benwills 2 hours ago | parent [-] | ||||||||||||||||
I have been attempting this exact sort of clustering solution for a few years now (on and off as a side project). Do you have source code available, or more detailed explanations/resources of how to approach this? Edit: I just looked around for your YOShInOn RSS reader code and couldn't find it. I did find a number of references it looks like you've made to it on various forums, etc over the years. | |||||||||||||||||
| |||||||||||||||||