Remix.run Logo
warangal 6 hours ago

Currently (Semantic) ML model is the weakest (minorly fine-tuned) ViT B/32 variant, and more like acting as a placeholder i.e very easy to swap with a desired model. (DINO models have been pretty great, being trained on much cleaner and larger Dataset, CLIP was one of first of Image-text type models !).

For point about "girl drinking water", "girl" is the person/tagged name , "drinking water" is just re-ranking all of "girl"s photos ! (Rather than finding all photos of a (generic) girl drinking water) .

I have been more focussed on making indexing pipeline more peformant by reducing copies, speeding up bottleneck portions by writing in Nim. Fusion of semantic features with meta-data is more interesting and challenging part, in comparison to choosing an embedding model !