I'm primarily looking at document lists and possibly the keyword-documents mapping.

Caching will likely be fairly tuned toward the operation itself, since it's not a general-purpose DBMS and I can fairly accurately predict which pages will likely be useful to cache or when read-ahead is likely to be fruitful based on the operation being performed.

For keyword-document mappings some LRU cache scheme is likely a good fit, when reading a list of documents readahead is good (and I can inform the pool of how far to read ahead), when intersecting document lists I can also generally predict when pages are likely to be re-read or needed in the future based on the position in the tree.

Will definitely need a fair bit of tuning but overall the problem is greatly simplified by revolving around very specific types of access patterns.

▲

n_u 4 days ago | parent [-]

Ah interesting. Is your keyword-document map (aka term dict) too big to keep in memory permanently? My understanding is that at Google they just keep it in memory on every replica.

Edit: I should specify they shard the corpus by document so there isn't a replica with the entire term dict on it.

▲

marginalia_nu 4 days ago | parent [-]

Could plausibly fit in RAM, is only like ~100 GB in total. We'll see, will probably keep it mmap:ed at first to see what happens. It isn't the target of very many queries (relatively speaking) at any rate so either way is probably fine.

▲

n_u 4 days ago | parent [-]

>It isn't the target of very many queries (relatively speaking)

Wow why is that? Do you use a vector index primarily?

▲

marginalia_nu 4 days ago | parent [-]

No I mean for every query there is mapping up keywords to trees of documents, there is dozens if not hundreds of queries in the latter in order to intersect document lists.

	▲	n_u 4 days ago \| parent [-]
		Ah I see. I thought by query you meant "user search query". I'm guessing by query here you mean "read".