Remix.run Logo
jankovicsandras 7 days ago

Shameless plug:

https://github.com/jankovicsandras/plpgsql_bm25

https://github.com/jankovicsandras/bm25opt

softwaredoug 6 days ago | parent | next [-]

If we're shameless plugging passion projects, SearchArray is a pandas extension for fulltext (BM25) search for dorking around with things in google colab

https://github.com/softwaredoug/searcharray

I'll also plug Xing Han Lu's BM25S which is very popular with similar goals:

https://github.com/xhluca/bm25s

mark_l_watson 6 days ago | parent | prev [-]

Thanks, yesterday I was thinking of adding BM25 to a little side project, so a well timed plug!

Do you know of any pure Python wrapper projects for managing large numbers of text and PDF documents? I thought of using Solr or ElasticSearch but that seems too heavy weight for what I am doing. I am considering using SQLite with pysqlite3 and PyPDF2 since SQLite uses BM25. Sorry to be off topic, but I imagine many people are looking at tools for building hybrid BM25 / vector store / LLM applications.

rogerbinns 6 days ago | parent [-]

My project APSW may have exactly what you need. It wraps SQLite proving a Python API, and that includes the FTS5 full text search functionality. https://rogerbinns.github.io/apsw/textsearch.html

You can store your text and PDFs in SQLite (or their filenames) and use the FTS5 infrastructure to do tokenization, query execution, and ranking. You can write your own tokenizer in Python, as well as ranking functions. A pure Python tokenizer for HTML is included, as well as a pure Python implementation of BM25.

You can chain tokenizers so it is just a few lines of code to call pypdf's extract_text method, and then have the bundled UnicodeWords tokenizer properly extract tokens/words, and Simplify to do case folding and accent stripping if desired.

There is a lot more useful functionality, all done from Python. You can see code in action in the example/tour at https://rogerbinns.github.io/apsw/example-fts.html

mark_l_watson 3 days ago | parent | next [-]

Thank you, your project meets my requirements. I want to build a long memory RAG system for my personal data. I like the commercial offerings like Google Gemini integrated with Workplace data, but I think I would be happier with my own system.

radiator 6 days ago | parent | prev [-]

Thank you for publishing your work. Do you know of any similar projects with examples of custom tokenizers, e.g. for synonyms, snowball, but written in C?

rogerbinns 5 days ago | parent [-]

SQLite itself is in C so you can use the API directly https://www.sqlite.org/fts5.html#custom_tokenizers

The text is in UTF8 bytes so any C code would have to deal with that and mapping to Unicode codepoints, plus lots of other text processing so some kind of library would also be needed. I don't know of any.