Remix.run Logo
bovermyer 6 hours ago

I'm curious about what it would take to build my own "toy" search engine with its own index. Anyone ever tried this?

marginalia_nu 6 hours ago | parent | next [-]

Yeah that's where I started out in 2021. Been at it for almost 5 years now, last three of which full time. I'm indexing about 1.1 billion documents now off a single server.

Hard part is doing it at any sort of scale and producing useful results. It's easy to build something that indexes a few million documents. Pushing into billions is a bigger challenge, as you start needing a lot of increasingly intricate bespoke solutions.

Devlog here:

https://www.marginalia.nu/tags/search-engine/

And search engine itself:

https://marginalia-search.com/

(... though it operates a bit sub-optimally now as I'm using a ton of CPU cores to migrate the index to use postings lists compression, will take about 4-5 days I think).

rickette 6 hours ago | parent [-]

Curious on what (how much) hardware your running this.

marginalia_nu 6 hours ago | parent [-]

Currently running off

AMD EPYC 7543 x2 for 64 cores/128 threads

512 GB RAM

~ 90 TB of PM9A3 SSDs across 12 physical devices

Storage is not very full though. I'm probably using about a third of it at this point.

Gigachad 6 hours ago | parent | prev | next [-]

Might find YaCy interesting. It’s meant to be a decentralised search engine where users scrape the internet and can search other users indexes in a kind of torrent like way.

I found it didn’t really work as a real search engine but it was interesting.

reddalo 6 hours ago | parent | prev [-]

Good luck scraping websites without being blocked, if you're not Google.

marginalia_nu 6 hours ago | parent [-]

Well you'll get blocked some places but it's not too big of a deal. If you're running an above board operation, you can surprisingly often successfully just email the admin explaining what you're doing, and ask to be unblocked.

BolsunBacset 3 hours ago | parent [-]

Sounds very time consuming. Glad you're able to sustain yourself to be able to do it full time.