Remix.run Logo
IndieCoder 7 days ago

Plus one, using the exact setup to make it scale. If Azure Doc Intelligence gets too expensive, VLMs also work great

vinothgopi 7 days ago | parent [-]

What is a VLM?

saharhash 7 days ago | parent [-]

Vision Language Model like Qwen VL https://github.com/QwenLM/Qwen2-VL or CoPali https://huggingface.co/blog/manu/colpali

sidmo 5 days ago | parent [-]

VLMs are cool - they generate embeddings of the images themselves (as a collection of patches) and you can see query matching displayed as a heatmap over the document. Picks up text that OCR misses. Here's an open-source API demo I built if you want to try it out: https://github.com/DataFog/vlm-api