Remix.run Logo
n_u 4 days ago

Which part of the index are you putting in the buffer pool here? The postings list, the doc store or the terms dict?

Is it being cached for future queries or are you just talking about putting it in memory to perform the computation for a query?

marginalia_nu 4 days ago | parent [-]

I'm primarily looking at document lists and possibly the keyword-documents mapping.

Caching will likely be fairly tuned toward the operation itself, since it's not a general-purpose DBMS and I can fairly accurately predict which pages will likely be useful to cache or when read-ahead is likely to be fruitful based on the operation being performed.

For keyword-document mappings some LRU cache scheme is likely a good fit, when reading a list of documents readahead is good (and I can inform the pool of how far to read ahead), when intersecting document lists I can also generally predict when pages are likely to be re-read or needed in the future based on the position in the tree.

Will definitely need a fair bit of tuning but overall the problem is greatly simplified by revolving around very specific types of access patterns.

n_u 4 days ago | parent [-]

Ah interesting. Is your keyword-document map (aka term dict) too big to keep in memory permanently? My understanding is that at Google they just keep it in memory on every replica.

Edit: I should specify they shard the corpus by document so there isn't a replica with the entire term dict on it.

marginalia_nu 4 days ago | parent [-]

Could plausibly fit in RAM, is only like ~100 GB in total. We'll see, will probably keep it mmap:ed at first to see what happens. It isn't the target of very many queries (relatively speaking) at any rate so either way is probably fine.

n_u 4 days ago | parent [-]

>It isn't the target of very many queries (relatively speaking)

Wow why is that? Do you use a vector index primarily?

marginalia_nu 4 days ago | parent [-]

No I mean for every query there is mapping up keywords to trees of documents, there is dozens if not hundreds of queries in the latter in order to intersect document lists.

n_u 4 days ago | parent [-]

Ah I see. I thought by query you meant "user search query". I'm guessing by query here you mean "read".