Remix.run Logo
RA_Fisher 6 days ago

I’m sure Search experts would disagree, because it’d be their technology they’d be admitting is inferior to another. BM25 is the workhorse, no doubt— but it’s also not the best anymore. Vectors are a step toward learning models, but only a small mid-range step vs. an explicit model.

Search is a useful approach for computing learning models, but there’s a difference between the computational means and the model. For example, MIPS is a very useful search algo for computing learning models (but first the learning model has to be formulated).

dtaivpp 6 days ago | parent | next [-]

I have been summoned. Hey it's David from the podcast. As someone who builds search for users every day and shaped the user experience for vector search at OpenSearch I assure you no one is afraid of their technology becoming inferior.

There are two components of search that are really important to understand why BM25 (will likely) not go away for a long time. The first is precision and the second is recall. Precision is the measure of how many relevant results were returned in light of all the results returned. A completely precise search would return only the relevant results and no irrelevant results.

Recall on the other hand measures how many of all the relevant results were returned. For example, if our search only returns 5 results but we know that there were 10 relevant search results that should have been returned we would say the recall is 50%.

These two components are always at odds with each other. Vector search excels at increasing recall. It is able to find documents that are semantically similar. The problem with this is semantically similar documents might not actually be what the user is looking for. This is because vectors are only a representation of user intent.

Heres an example: A user looks up "AWS Config". Vector search would read this and may rate it as similar to ["amazon web services configuration", "cloud configuration", "infrastructure as a service setup"]. In this case the user was looking for a file called, "AWS.config". Vector search is inherently imprecise. It is getting better but it's not replacing BM25 as a scoring mechanism any time soon.

You don't have to believe me though. Weaviate, Vespa, Qdrant all support BM25 search for a reason. Here is an in depth blog that dives more into hybrid search: https://opensearch.org/blog/hybrid-search/

As an aside, vector search is also much more expensive than BM25. It's very hard to scale and get precise results.

RA_Fisher 6 days ago | parent [-]

Hi David. Nice to meet you. Yes, precision and recall are always in tension. However, both can be made simultaneously better with a more informed model. Using your example, this would be a model that encodes the concept of files in the context of a user demand surrounding AWS.

iosjunkie 6 days ago | parent [-]

"more informed model"

Can you be specific on what you recommend instead of BM25?

RA_Fisher 6 days ago | parent [-]

Sure, this chat describes what I have in mind, https://chatgpt.com/share/673e9290-a044-8005-995b-166efe653e...

softwaredoug 6 days ago | parent | prev | next [-]

I don't know a lot of search practitioners who don't want to use the "new sexy" thing. Most of us do a fair amount of "resume driven development" so can claim to be "AI Engineers" :)

RA_Fisher 6 days ago | parent [-]

I don’t think it’s realistic to think that software engineers can pick up advanced statistical modeling on the job, unless they’re pairing with statisticians. There’s just too much background involved.

softwaredoug 6 days ago | parent | next [-]

The "search practitioners" I'm referring to are pretty uniformly ML Engineers . They also work on feeds, recommendations, and adjacent Information Retrieval spaces. Both to generate L0 retrieval candidates and to do higher layers of reranking with learning to rank and other systems to whatever the system's goal is...

You can decide if you agree that most people are sufficiently statistically literate in that group of people. But some humility around statistics is probably far up there in what I personally interview for.

RA_Fisher 6 days ago | parent [-]

For sure. There are ML folks with statistical learning backgrounds, but it tends to be relatively rare. Physics and CS are more common. They tend to view things like you mention, more procedural eg- learning to rank, minimizing distances, less statistical modeling. Humility around statistics is good, but statistical knowledge is still what's required to really level up these systems (I've built them as well).

binarymax 6 days ago | parent | prev [-]

Your overall condescending attitude in this thread is really disgusting.

RA_Fisher 6 days ago | parent [-]

Statisticians are famously disliked, especially by engineers (there are open-minded folks, of course! maybe they’d taken some econometrics or statistics, are exceptionally humble, etc). There are some interesting motives and incentives around that. Sometimes I think in part it’s because many people would prefer their existing beliefs be upheld as opposed to challenged, even if they’re not well-supported (and likely to lead to bad decisions and outcomes). Sticking with outdated technology is one example.

simplecto 6 days ago | parent | prev [-]

It seems that the current mode (eg fashion) is a hybrid approach, with vector results on one side, BM25 on the other, and then a re-reank algo to smooth things out.

I'm out of my depth here but genuinely interested and curious to see over the horizon.

RA_Fisher 6 days ago | parent | next [-]

Best is to use one statistical model and encode the underlying aspects of the context that relate to goal outcomes.

authorfly 6 days ago | parent | prev [-]

Out of interest how come you use the word "mode" here?

simplecto 6 days ago | parent [-]

because the space moves fast, and from my learning this is the current thing. Like fashion -- it changes from season to season

authorfly 4 days ago | parent [-]

Oh right, I just wondered if it was a loan word from German. I am hearing it more and more in English.