Remix.run Logo
bob1029 4 days ago

Combining n-grams and bitmap indexes can give you most of the magic in the space.

I've been working on an architecture for code search that relies on a repo-level trigram index and a per-repo FM-index (actual code). The trigram index is used to find the bitmaps of repos that contain each term. These are then ANDed together to produce the final list of repos to search via FM-index.

lazamar 4 days ago | parent [-]

Why not search the FM-indexes directly? It is faster than the n-gram search and you can use the exact full text of the needle.

bob1029 4 days ago | parent [-]

If you have millions of them, searching all every time could become a problem.