| ▲ | spwa4 12 hours ago | ||||||||||||||||||||||
It's so weird how that works with transformers. Finetuning an LLM "backbone" (if I understand correctly: a fully trained but not instruction tuned LLM, usually small because students) with OCR tokens bests just about every OCR network out there. And it's not just OCR. Describing images. Bounding boxes. Audio, both ASR and TTS, all works better that way. Now many research papers are only really about how to encode image/audio/video to feed it into a Llama or Qwen model. | |||||||||||||||||||||||
| ▲ | zmmmmm 12 hours ago | parent [-] | ||||||||||||||||||||||
It is fascinating. Vision language models are unreasonably good compared to dedicated OCR and even the language tasks to some extent. My take is it fits into the general concept that generalist models have significant advantages because so much more latent structure maps across domains than we expect. People still talk about fine tuning dedicated models being effective but my personal experience is it's still always better to use a larger generalist model than a smaller fine tuned one. | |||||||||||||||||||||||
| |||||||||||||||||||||||