Reminds me of a project [0] that we spun up in a few days in a hackathon. We finetuned CLIP to be able to work with satellite imagery. Inference was extremely fast.
[0]: https://github.com/arampacha/CLIP-rsicd