Remix clone Hacker News

new | show | ask | jobs Github

	▲	edinetdb 6 hours ago
		Interesting timing — we've been tackling similar full-text search challenges over Japanese financial filings (roughly 9M+ sections from annual reports). The tokenization problem hits differently with CJK languages: BM25 assumes word-boundary tokens, but Japanese has no spaces, so you need a morphological analyzer (MeCab or SudachiPy) upstream before any scoring makes sense. We ended up on BigQuery's built-in full-text search rather than a Postgres extension mostly for scale reasons, but the BM25 relevance behavior you're describing is exactly what we'd want. One question: have you tested this with CJK content, or is the tokenizer assumed to handle pre-tokenized input only?