| ▲ | Show HN: Infini-News – 1.36B news articles from Common Crawl, queryable in ms(cs2.uni-graz.at) | |
| 5 points by ruggsea a day ago | 1 comments | ||
Infini-News is ten years of CC-NEWS (the news subset of Common Crawl), cleaned, enriched and turned into a full-text index so you can count any keyword or phrase across 1.36B articles in sub-second time (ok, now maybe a few seconds, but circumstantial), without downloading anything. It's free and open on Hugging Face. I did it because I was sick of having to manually scrape news websites and the like for research purposes and because it felt interesting personally to tackle a project of this scale. On top of data cleaning, we have run language, country (via TLDs and some other heuristics) and topic tagging over all the articles and I have indexed all of them using a recent new n-gram indexing technology that I consider akin to magic. I would encourage you to read the blogpost and play with the interactive viz I made for it. Also, of course, happy to answer questions. Blog: https://cs2.uni-graz.at/blog/infini-news/ Dataset: https://huggingface.co/datasets/ruggsea/infini-news-corpus Index: https://huggingface.co/datasets/ruggsea/infini-news-index Preprint: https://arxiv.org/abs/2605.18337 | ||
| ▲ | wonnie a day ago | parent [-] | |
Very cool! Happy to see some cool stuff made in Graz too. Keep up the good work! | ||