I'm guessing you're not storing the CLIP for every single frame, instead of every second or so? Also, are you using the cosine similarity? How are you finding the nearest vector?