Remix.run Logo
the_precipitate 4 days ago

To really appreciate inverted indexes, it’s worthwhile to study ISR (Inverted Stream Readers), a concept introduced by the great Mike Burrows. It’s also worth exploring encoding techniques like PForDelta. These elegant ideas demonstrate how true systems design masters can distill complex concepts into simple, powerful abstractions.

Edit: I stand corrected: it's called index stream readers (thanks atombender for pointing this out). For those who knows Mike Burrows only for the Burrows-Wheeler transformation (BZip), you might also want to know that he was also one of the main developers of AltaVista, the first real search engine for the internet. He also designed the early versions of Bing search engine. Eventually he worked for Google and designed their lock service called Chubby.

marginalia_nu 4 days ago | parent | next [-]

Another very nice algorithm in the space is this one[1] for intersecting postings lists in sublinear time generally with very good cache characteristics to boot. Works with tree-based indexes as well as skip lists (though a more modern design might also use simple bloom filters to go with the skip pointers).

[1] https://nlp.stanford.edu/IR-book/html/htmledition/faster-pos...

atombender 4 days ago | parent | prev [-]

I think you're thinking of index stream readers?

mrkeen 4 days ago | parent [-]

I have heard of neither. But the mention of Burrows leads me to Burrows-Wheeler, which is a compression algorithm (bzip).

I'm not 100% but I don't think you can directly query a BWT in the same way you'd query an inverted index (without the later discovery of wavelet trees and FM-indexes / succinct data structures, and all that jazz.) And that's mostly for genomics? Not sure if it applies to plain old document searches. Would love to be corrected though.

lazamar 4 days ago | parent [-]

At Meta they are using FM indexes to power text search through the entire commit history of their monorepo.