Remix.run Logo
nawazgafar 4 days ago

Author here, that sucks. I'd love to recreate this locally. Would you be willing to share the PDF?

threeducks 3 days ago | parent [-]

As far as I am aware, the "hanging" issue remains unsolved to this day. The underlying problem is that LLMs sometimes get stuck in a loop where they repeat the same text again and again until they reach the token limit. You can break the loop by setting a repeat penalty, but when your image contains repeated text, such as in tables, the LLM will output incorrect results to prevent repetition.

Here is the corresponding GitHub issue for your default model (Qwen2.5-VL):

https://github.com/QwenLM/Qwen2.5-VL/issues/241

You can mitigate the fallout of this repetition issue to some degree by chopping up each page into smaller pieces (paragraphs, tables, images, etc.) with a page layout model. Then at least only part of the text is broken instead of the entire page.

A better solution might be to train a model to estimate a heat map of character density for a page of text. Then, condition the vision-language model on character density by feeding the density to the vision encoder. Also output character coordinates, which can be used with the heat map to adjust token probabilities.