On the API vs local model question:

We went with API embeddings for a similar use case. The cold-start latency of local models across multiple workers ate more money in compute than just paying per-token. Plus you avoid the operational overhead of model updates.

The hybrid approach in this article is smart. Fuzzy matching catches 80% of cases instantly, embeddings handle the rest. No need to run expensive vector search on every query.

▲

TurdF3rguson 3 hours ago | parent [-]

Those text embeddings are dirt cheap. You can do around 1M titles on the cloudflare embedding model I used last time without exceeding daily free tier.

	▲	augusteo 3 hours ago \| parent [-]
		yeah exactly. even OpenAI/Gemini are really cheap too