Remix.run Logo
marginalia_nu 5 days ago

Optimizing the Marginalia Search index code. The new code is at least twice as fast in benchmarks, but I can't run it in production because it turns out when you do it's four times as slow as what came before it for the queries that are the simplest and fastest to the point where queries exceed their timeout values by a lot.

I'm 97% certain this is because the faster code leads to more page thrashing in the mmap-based index readers. I'm gonna have to implement my own buffer pool and manage my reads directly like that vexatious paper[1] said all along.

[1] https://db.cs.cmu.edu/papers/2022/cidr2022-p13-crotty.pdf

apavlo 5 days ago | parent | next [-]

> I'm gonna have to implement my own buffer pool and manage my reads directly like that vexatious paper[1] said all along.

You make it sound like I was trying to troll everyone when we wrote that paper. We were warning you.

marginalia_nu 5 days ago | parent | next [-]

It's annoying because it's right and also describes the exact type of paradoxical performance reversal I'm seeing. (It's also great because it describes the exact type of paradoxical performance reversal I'm seeing, likely saves me a lot lot of head scratching ;-)

5 days ago | parent | prev [-]
[deleted]
n_u 4 days ago | parent | prev | next [-]

Which part of the index are you putting in the buffer pool here? The postings list, the doc store or the terms dict?

Is it being cached for future queries or are you just talking about putting it in memory to perform the computation for a query?

marginalia_nu 4 days ago | parent [-]

I'm primarily looking at document lists and possibly the keyword-documents mapping.

Caching will likely be fairly tuned toward the operation itself, since it's not a general-purpose DBMS and I can fairly accurately predict which pages will likely be useful to cache or when read-ahead is likely to be fruitful based on the operation being performed.

For keyword-document mappings some LRU cache scheme is likely a good fit, when reading a list of documents readahead is good (and I can inform the pool of how far to read ahead), when intersecting document lists I can also generally predict when pages are likely to be re-read or needed in the future based on the position in the tree.

Will definitely need a fair bit of tuning but overall the problem is greatly simplified by revolving around very specific types of access patterns.

n_u 4 days ago | parent [-]

Ah interesting. Is your keyword-document map (aka term dict) too big to keep in memory permanently? My understanding is that at Google they just keep it in memory on every replica.

Edit: I should specify they shard the corpus by document so there isn't a replica with the entire term dict on it.

marginalia_nu 4 days ago | parent [-]

Could plausibly fit in RAM, is only like ~100 GB in total. We'll see, will probably keep it mmap:ed at first to see what happens. It isn't the target of very many queries (relatively speaking) at any rate so either way is probably fine.

n_u 4 days ago | parent [-]

>It isn't the target of very many queries (relatively speaking)

Wow why is that? Do you use a vector index primarily?

marginalia_nu 4 days ago | parent [-]

No I mean for every query there is mapping up keywords to trees of documents, there is dozens if not hundreds of queries in the latter in order to intersect document lists.

n_u 4 days ago | parent [-]

Ah I see. I thought by query you meant "user search query". I'm guessing by query here you mean "read".

winrid 5 days ago | parent | prev | next [-]

This is basically what everyone that tries to use mmap in their database realizes.

james_chu 5 days ago | parent | prev [-]

What kind of structure does Marginalia Search help people to obtain?