Not an attack on your experience at all! I would would definitely counter that multimodal are still error prone and much better output is achieved using a tool like textract and then an LLM on the output data.