Remix.run Logo
mattnewton a day ago

Because PDFs are a nightmare of a format and the only thing that’s is reasonably guaranteed about them is they will render to an image that people can read, the parsing of which will be much less token efficient than the equivalent text

wpasc a day ago | parent | next [-]

I agree with you, but every non-engineer I know using these tools 100% will drag and drop a PDF into a chatbot. Anthropic and OpenAI as companies who are selling their products to all sorts of businesses should have a much better means of handling this nightmare of a format because it is so pervasive and so obviously what so many of their customers are going to drop into the product.

spott 21 hours ago | parent | next [-]

Why would they spend a ton of effort ensuring that their customers spend less money on them?

Token economics also are weird. If you design a fancy new frontend that for example uses a cheap model to parse a PDF into text that is fed into an expensive model, you will probably spend more money because you are on API payscale rather than the "max plan" payscale.

JeremyNT 10 hours ago | parent | prev | next [-]

> I agree with you, but every non-engineer I know using these tools 100% will drag and drop a PDF into a chatbot

I'm an engineer and use my coding agent to deal with PDFs all the time. It can reach for unix tools if it needs them.

I don't think I understand why this is a problem - it uses tokens, but it removes drudgery. This is the entire promise of the technology.

mattnewton 21 hours ago | parent | prev | next [-]

I’m saying there is basically no way to both make vlms able to understand the long tail of PDFs where the layout conveys information (like charts and tables) and to make it as token efficient as text formats. Current approaches have mostly chosen to work more often than not at the cost of token efficiency.

tiahura a day ago | parent | prev | next [-]

I think they’ve just decided that vision gives the best results and the token issue will take care of itself.

watwut 17 hours ago | parent | prev [-]

Once you pay full price for tokens plus margin, it is better for the company to burn as many tokens as possible.

For the same reason as why the oil companies want everyone to use large cars.

tyre a day ago | parent | prev | next [-]

For anyone needing to do this, the answer is to convert it to an image first. Far smaller, LLMs work well with them (even in some pretty insane use cases I've seen), and, along with human review, it can be a huge productivity gain that results in structured data.

Snoddas an hour ago | parent | next [-]

Since I'm almost never interested in the formatting I run all pdf files through pdftotext from the Poppler library before llm use.

spindump8930 a day ago | parent | prev [-]

I agree with your recomendation, but converting a pdf to an image is by no means smaller. PDFs are much closer to SVGs then to jpegs.

butlike 10 hours ago | parent [-]

Why can't I just take a screenshot of the PDF and feed that into the llm?

schmuhblaster 21 hours ago | parent | prev [-]

Been building various LLM+PDF pipelines at work. As soon as you need to e.g. parse tables etc. it becomes a lot of hard work!