▲ | coffeecoders 10 hours ago | |||||||
On a slightly related note- I've been thinking about building a home-local "mini-Google" that indexes maybe 1,000 websites. In practice, I rarely need more than a handful of sites for my searches, so it seems like overkill to rely on full-scale search engines for my use case. My rough idea for architecture: - Crawler: A lightweight scraper that visits each site periodically. - Indexer: Convert pages into text and create an inverted index for fast keyword search. Could use something like Whoosh. - Storage: Store raw HTML and text locally, maybe compress older snapshots. - Search Layer: Simple query parser to score results by relevance, maybe using TF-IDF or embeddings. I would do periodic updates and build a small web UI to browse. Anyone tried it or are there similar projects? | ||||||||
▲ | mrkeen 20 minutes ago | parent | next [-] | |||||||
Yep. Built a crawler, an indexer/queryprocessor, and an engine responsible for merging/compacting indexes. Crawling was tricky. Something like stackoverflow will stop returning pages when it detects that you're crawling, much sooner than you'd expect. | ||||||||
▲ | bryanhogan an hour ago | parent | prev | next [-] | |||||||
Reminds me of building a Obsidian vault with all the content in markdown form. There's also plugins to show vault results when doing a Google search, making notes within your vault show up before external websites. | ||||||||
▲ | harias 10 hours ago | parent | prev | next [-] | |||||||
YaCy (https://yacy.net) can do all this I think. Cloudflare might block you IP pretty soon though if you try to crawl. | ||||||||
▲ | msephton 3 hours ago | parent | prev | next [-] | |||||||
Perhaps not quite solving your problem, but I have a handful of domain-specific Google CSE (Custom Search Engine) that limit the results to predefined websites. I summon them from Alfred with short keywords when I'm doing interest-specific searches. https://blog.gingerbeardman.com/2021/04/20/interest-specific... | ||||||||
▲ | andai 7 hours ago | parent | prev | next [-] | |||||||
Have you ever looked at Common Crawl dumps? I did a bit of data mining and holy cow is 99.99% of the web crap. Spam, porn, ads, flame wars, random blogs by angsty teens... I understand it has historical and cultural value — and maybe literary value, in a Douglas Coupland kind of way — but for my purposes, there was very little here that I considered of interest. Which was very encouraging to me, because it implies that indexing the Actually Important Web Pages might even be possible for a single person on their laptop. Wikipedia, for comparison, is only ~20GB compressed. (And even most of that is not relevant to my interests, e.g. the Wikipedia articles related to stuff I'd ever ask about are probably ~200MB tops.) | ||||||||
▲ | fabiensanglard 10 hours ago | parent | prev | next [-] | |||||||
Have you ever tried https://marginalia-search.com ? I love it. | ||||||||
▲ | computerex 4 hours ago | parent | prev | next [-] | |||||||
Kind of. I made ainews247.org that crawls certain sites and filters content so it's AI specific and valuable. I think it's a really good idea. | ||||||||
▲ | matsz 10 hours ago | parent | prev | next [-] | |||||||
You could take a look at the leaked Yandex source code from a few years ago. I'd believe their architecture should be decent enough. | ||||||||
| ||||||||
▲ | toephu2 9 hours ago | parent | prev [-] | |||||||
With LLMs why do you even need a mini-Google? | ||||||||
|