On a slightly related note-

I've been thinking about building a home-local "mini-Google" that indexes maybe 1,000 websites. In practice, I rarely need more than a handful of sites for my searches, so it seems like overkill to rely on full-scale search engines for my use case.

My rough idea for architecture:

- Crawler: A lightweight scraper that visits each site periodically.

- Indexer: Convert pages into text and create an inverted index for fast keyword search. Could use something like Whoosh.

- Storage: Store raw HTML and text locally, maybe compress older snapshots.

- Search Layer: Simple query parser to score results by relevance, maybe using TF-IDF or embeddings.

I would do periodic updates and build a small web UI to browse.

Anyone tried it or are there similar projects?

▲

mrkeen 20 minutes ago | parent | next [-]

Yep. Built a crawler, an indexer/queryprocessor, and an engine responsible for merging/compacting indexes.

Crawling was tricky. Something like stackoverflow will stop returning pages when it detects that you're crawling, much sooner than you'd expect.

▲

bryanhogan an hour ago | parent | prev | next [-]

Reminds me of building a Obsidian vault with all the content in markdown form. There's also plugins to show vault results when doing a Google search, making notes within your vault show up before external websites.

▲

harias 10 hours ago | parent | prev | next [-]

YaCy (https://yacy.net) can do all this I think. Cloudflare might block you IP pretty soon though if you try to crawl.

▲

msephton 3 hours ago | parent | prev | next [-]

Perhaps not quite solving your problem, but I have a handful of domain-specific Google CSE (Custom Search Engine) that limit the results to predefined websites. I summon them from Alfred with short keywords when I'm doing interest-specific searches. https://blog.gingerbeardman.com/2021/04/20/interest-specific...

▲

andai 7 hours ago | parent | prev | next [-]

Have you ever looked at Common Crawl dumps? I did a bit of data mining and holy cow is 99.99% of the web crap. Spam, porn, ads, flame wars, random blogs by angsty teens... I understand it has historical and cultural value — and maybe literary value, in a Douglas Coupland kind of way — but for my purposes, there was very little here that I considered of interest.

Which was very encouraging to me, because it implies that indexing the Actually Important Web Pages might even be possible for a single person on their laptop.

Wikipedia, for comparison, is only ~20GB compressed. (And even most of that is not relevant to my interests, e.g. the Wikipedia articles related to stuff I'd ever ask about are probably ~200MB tops.)

▲

fabiensanglard 10 hours ago | parent | prev | next [-]

Have you ever tried https://marginalia-search.com ? I love it.

▲

computerex 4 hours ago | parent | prev | next [-]

Kind of. I made ainews247.org that crawls certain sites and filters content so it's AI specific and valuable. I think it's a really good idea.

▲

matsz 10 hours ago | parent | prev | next [-]

You could take a look at the leaked Yandex source code from a few years ago. I'd believe their architecture should be decent enough.

	▲	efilife 4 hours ago \| parent [-]
		Where?

▲

toephu2 9 hours ago | parent | prev [-]

With LLMs why do you even need a mini-Google?

	▲	andai 7 hours ago \| parent [-]
		For my LLM to use! I want sources, excerpts, cross-referencing...