Remix.run Logo
marginalia_nu 8 hours ago

The idea behind search itself is very simple, and it's a fun problem domain that I encourage anyone to explore[1].

The difficulties in search are almost entirely dealing with the large amounts of data, both logistically and in handling underspecified queries.

A DBMS-backed approach breaks down surprisingly fast. Probably perfectly fine if you're indexing your own website, but will likely choke on something the size of English wikipedia.

[1] The SeIRP e-book is a good (free) starting point https://ciir.cs.umass.edu/irbook/

zipy124 3 hours ago | parent | next [-]

I think in today's world the harder problem is evading SEO spam. A search engine is in constant war with adverserarial players, who need you to see their content for revenue, rather than the actual answer.

This neccessitates a constant game of cat and mouse, where you adjust your quality metric so SEO shops can't figure it out and capitalise on it.

HEmanZ an hour ago | parent | next [-]

There are more kinds of search engines than just internet search engines. At this point I’m is almost certain that the non-internet search engines of the world are much larger than internet search engines.

zppln 2 hours ago | parent | prev | next [-]

I feel at this point you'd almost be better off hand-curating a set of domains and only crawl those.

jayd16 2 hours ago | parent | prev [-]

I wonder how hard it is when mice are not paying the cat to serve ads.

djoldman 5 hours ago | parent | prev | next [-]

> The difficulties in search are almost entirely dealing with the large amounts of data, both logistically and in handling underspecified queries.

Large amounts of data seem obviously difficult.

For your second difficulty, "handling underspecified queries": it seems to me that's a subset of the problem of, "given a query, what are the most relevant results?" That problem seems very tricky, partially because there is no exact true answer.

marginalia search is great as a contrast to engines like google, in part because google chooses to display advertisements as the most relevant results.

Have you found any of the TREC papers helpful?

https://trec.nist.gov/

mapt 5 hours ago | parent | prev | next [-]

What is the order of magnitude of the largest document store that you can practically work from SQLite on a single thousand-dollar server run by some text-heavy business process? For text search, roughly how big of a corpus can we practically search if we're occupying... let's say five seconds per query, twelve queries per minute?

marginalia_nu 4 hours ago | parent [-]

If you held a gun to my head and forced me to make a guess I'd say you could push that approach to order of 100K, maybe 1M documents.

If sqlite had a generic "strictly ascending sequence of integers" type[1] and would optimize around that, you could probably push it farther in terms of implementing efficient inverted indexes.

[1] primary key tables aren't really useful here.

HelloUsername 6 hours ago | parent | prev | next [-]

I love your https://marginalia-search.com :)

marginalia_nu 6 hours ago | parent [-]

"Building A Complex Search Engine That Works Sometimes"

moffkalast 5 hours ago | parent [-]

15% of the time it works every time.

gcanyon 5 hours ago | parent | prev | next [-]

> The difficulties in search are almost entirely dealing with the large amounts of data, both logistically and in handling underspecified queries.

I would expect the difficulty to be deciding which item to return when there are multiple that contain the search term. Is wikipedia's article on Gilligan's Island better than some guy's blog post? Or is that guy a fanatic who has spent his entire life pondering whether Wrongway Feldman was malicious or how Irving met Bingo Bango and Bongo?

Add in rank hacking, keyword stuffing, etc. and it seems like a very hard problem, while scaling... is scaling? ¯\_(ツ)_/¯

marginalia_nu 5 hours ago | parent | next [-]

That would be the "handling underspecified queries" thing I mentioned.

dumbfounder 5 hours ago | parent | prev [-]

Elastic and many others fail to solve this problem too. There are many different strategies and many of them require ingenuity and development.

jonstewart 5 hours ago | parent [-]

It’s not like ElasticSearch lacks ranking algorithms and control thereof. But it can require tuning and adjustment for various domains. Relevancy is, after all, subjective.

submeta 8 hours ago | parent | prev | next [-]

Thank you very much for the recommendation. I am in the process of building knowledge base bots, and am confronted with the task of creating various crawlers for the different sources the company has. And this book comes in very handy.

SenanG 6 hours ago | parent | prev [-]

[dead]