Remix.run Logo
radiator 7 months ago

Thank you for publishing your work. Do you know of any similar projects with examples of custom tokenizers, e.g. for synonyms, snowball, but written in C?

rogerbinns 7 months ago | parent [-]

SQLite itself is in C so you can use the API directly https://www.sqlite.org/fts5.html#custom_tokenizers

The text is in UTF8 bytes so any C code would have to deal with that and mapping to Unicode codepoints, plus lots of other text processing so some kind of library would also be needed. I don't know of any.