Remix.run Logo
kwillets 5 days ago

Some folks I work with have a terabyte of short string records that they regularly scan with regexes to develop classification criteria; it's a prime application for a substring index that can accelerate their queries, but the scale is daunting.

I came up with a suffix-sorting index for this domain that's interestingly simple. Most algos for this use a generalized suffix tree that's built by concatenating all the strings into one giant string and feeding it into a conventional suffix sort, but that has some big constants on the indexing throughput, due I think to the overhead of handling one giant string instead of a bunch of small independent records.

In the latter case, by making the structure slightly simpler and search slightly harder, I can get indexing throughput in the GBps, at least for the sorting part.

The output of that in its simplest form is a 4n or 8n-sized set of int's, but it can be fed further into a compressed rank/select data structure for various space/indexing time/retrieval time tradeoffs, and I don't think those are slow (eg Roaring Bitmaps)

I'll post this on show HN if anybody's interested; I'm still writing up the details, as I've barely gotten the POC code working.