Remix.run Logo
adamzwasserman 2 days ago

The "no sharing between filters" insight clicked for me on a different problem.

I needed to filter items by tags. Bloom filter per item seemed clever - quick membership checks. But with thousands of items sharing dozens of tags, each filter re-encodes the same vocabulary. Pure waste.

Switched to an inverted index (tag → item list) with bloom filters per chunk of the index. Now the tag vocabulary is shared, and bloom filters just speed up chunk-skipping when the index grows large.

TFA's mistake is using bloom filters -instead- of an inverted index rather than on top of one. The amortization patterns stack, they don't compete.

hinkley 2 days ago | parent [-]

Why do these “inverted indexes” just look like indexes to me? Too much time with databases perhaps?

farsa 2 days ago | parent | next [-]

The distinction is more clear when indexing actual text and applying tokenization. A "typical" index on a database column goes like "column(value => rows)". When people mention inverted indexes its usually in the context of full text search, where "column value" usually goes through tokenization and you build an index for all N tokens of a column "column:(token 1 => rows)", "column:(token 2 => rows)",... "column:(token N => rows)".

adamzwasserman 2 days ago | parent | prev [-]

A non-unique index, yes.

hinkley 2 days ago | parent [-]

Which is most indexes.

adamzwasserman a day ago | parent [-]

again, I agree