why not a multimodal embedding model?
Article says this misses important details, eg data that might be in the image.
[dead]