Pick an embedding model that supports binary quantization and then use a SIMD-optimized Hamming Distance function. I'm doing this for Scour and doing about 1.6 billion comparisons per second.
https://scour.ing
https://emschwartz.me/binary-vector-embeddings-are-so-cool/