Remix.run Logo
pauldix 2 days ago

I believe you could do this effectively with COBS (COmpact Bit Sliced signature index): https://panthema.net/2019/1008-COBS-A-Compact-Bit-Sliced-Sig...

It's a pretty neat algorithm from a paper in 2019 for the application "to index k-mers of DNA samples or q-grams from text documents". You can take a collection of bloom filters built for documents and then combine them together to have a single filter that will tell you which docs it maps to. Like an inverted index meets a bloom filter.

I'm using it in a totally different domain for an upcoming release in InfluxDB (time series database).

There's also code online here: https://github.com/bingmann/cobs