| ▲ | wpasc a day ago |
| One thing I find fascinating as a software engineer who talks to non software engineers who use AI tools is how "reading PDFs" is not more of a solved problem. What I mean is that uploading a PDF into a chatbot tool seems to be an extraordinarily obvious use case that non technical (and technical) users would want to do. IMO claude, chatgpt/codex, etc should be able to optimize the PDF use case to be extremely token efficient as it's a very obvious use case. But when I start to explain to my wife/friends why it burns through so much quota, I find myself thinking "why should they have to understand this aspect of it". to me, that the details of PDF parsing and extracting are relevant to users (instead of solved such that you don't have to pay attention to it) shows how these tools are not nearly as "ready" as they are made out to be. I may be preaching to the choir on this one, but just my 2c |
|
| ▲ | mattnewton a day ago | parent | next [-] |
| Because PDFs are a nightmare of a format and the only thing that’s is reasonably guaranteed about them is they will render to an image that people can read, the parsing of which will be much less token efficient than the equivalent text |
| |
| ▲ | wpasc a day ago | parent | next [-] | | I agree with you, but every non-engineer I know using these tools 100% will drag and drop a PDF into a chatbot. Anthropic and OpenAI as companies who are selling their products to all sorts of businesses should have a much better means of handling this nightmare of a format because it is so pervasive and so obviously what so many of their customers are going to drop into the product. | | |
| ▲ | spott 21 hours ago | parent | next [-] | | Why would they spend a ton of effort ensuring that their customers spend less money on them? Token economics also are weird. If you design a fancy new frontend that for example uses a cheap model to parse a PDF into text that is fed into an expensive model, you will probably spend more money because you are on API payscale rather than the "max plan" payscale. | |
| ▲ | JeremyNT 10 hours ago | parent | prev | next [-] | | > I agree with you, but every non-engineer I know using these tools 100% will drag and drop a PDF into a chatbot I'm an engineer and use my coding agent to deal with PDFs all the time. It can reach for unix tools if it needs them. I don't think I understand why this is a problem - it uses tokens, but it removes drudgery. This is the entire promise of the technology. | |
| ▲ | mattnewton 21 hours ago | parent | prev | next [-] | | I’m saying there is basically no way to both make vlms able to understand the long tail of PDFs where the layout conveys information (like charts and tables) and to make it as token efficient as text formats. Current approaches have mostly chosen to work more often than not at the cost of token efficiency. | |
| ▲ | tiahura a day ago | parent | prev | next [-] | | I think they’ve just decided that vision gives the best results and the token issue will take care of itself. | |
| ▲ | watwut 17 hours ago | parent | prev [-] | | Once you pay full price for tokens plus margin, it is better for the company to burn as many tokens as possible. For the same reason as why the oil companies want everyone to use large cars. |
| |
| ▲ | tyre a day ago | parent | prev | next [-] | | For anyone needing to do this, the answer is to convert it to an image first. Far smaller, LLMs work well with them (even in some pretty insane use cases I've seen), and, along with human review, it can be a huge productivity gain that results in structured data. | | |
| ▲ | Snoddas an hour ago | parent | next [-] | | Since I'm almost never interested in the formatting I run all pdf files through pdftotext from the Poppler library before llm use. | |
| ▲ | spindump8930 a day ago | parent | prev [-] | | I agree with your recomendation, but converting a pdf to an image is by no means smaller. PDFs are much closer to SVGs then to jpegs. | | |
| ▲ | butlike 10 hours ago | parent [-] | | Why can't I just take a screenshot of the PDF and feed that into the llm? |
|
| |
| ▲ | schmuhblaster 21 hours ago | parent | prev [-] | | Been building various LLM+PDF pipelines at work. As soon as you need to e.g. parse tables etc. it becomes a lot of hard work! |
|
|
| ▲ | 0cf8612b2e1e a day ago | parent | prev | next [-] |
| I hope someday we can get out of this local maxima of PDF documents. The format is terrible, but was right place, right time and might be impossible to dislodge. |
| |
| ▲ | Loughla a day ago | parent [-] | | The problem is that for 99% of people in 99% of cases they work fine. It's hard for people to understand that they're trash. Source; my last job working with accessibility and that nightmare. |
|
|
| ▲ | evdubs a day ago | parent | prev | next [-] |
| You don't need to use an online service to do this; you get to avoid spending money on tokens doing it offline. Gemma 4 works perfectly well offline on limited hardware (I have an 8GB video card) and can handle extracting text from image-based PDFs just fine. Take a PDF -> run it through MarkItDown [1], using the OCR plugin if you need (point it to Gemma 4) -> now you can ask Gemma 4 questions about the (markdown) document. I am sure Gemma 4 could even create a GUI to make this process very simple for a non technical user. [1] https://github.com/microsoft/markitdown |
|
| ▲ | kjellsbells a day ago | parent | prev | next [-] |
| Amen. Normal office work is wildly different from what we read about on HN. If you were a CEO, determined to lay off all your people, you would want to really zero in on having your AI solve these very unsexy problems: extract data from Office and PDF. Grab data from some part of the screen of a webapp and parse it. drive a line of business app via keyboard or mouse simulation. I know there are companies out there that try, eg Appian and (here in YC) Skyvern, but its a hard problem and yet I feel this is where the true money is. |
| |
| ▲ | red-iron-pine 11 hours ago | parent [-] | | bingo. 90% of our AI use cases across the company are things like this. security and ops (NOC/SOC folks) use this almost as much as they do for technical stuff. hell we have restrictive rules for security stuff so in many cases our network engineers are still doing by hand configs for critical systems. but in terms of token use it's gotta be "take this pdf and parse these 3 columns into 2" or similar |
|
|
| ▲ | ShinyLeftPad 21 hours ago | parent | prev | next [-] |
| > how "reading PDFs" is not more of a solved problem This and replies to this are surreal. It's like everyone simultaneously decided to forget that you don't need claude or whatever to read a PDF. The document is literally made for you to read... |
| |
| ▲ | stoorafa 16 hours ago | parent [-] | | > The document is literally made for you to read... It’s disingenuous to assume every PDF is actually crafted to communicate to its recipients, even more so to pretend LLM users are in a position to understand all the PDFs they receive There’s a lot of gray area where help understanding a document is fully reasonable | | |
| ▲ | ShinyLeftPad 4 hours ago | parent [-] | | If you can't understand it how can you be sure that the thing that helps you "understand" it does it right? |
|
|
|
| ▲ | nojito a day ago | parent | prev | next [-] |
| The best way to parse pdfs is to convert them to images and feed them into the llm. This workflow is highly optimized. |
| |
| ▲ | wpasc a day ago | parent | next [-] | | For sure there are very optimized ways to do it. My point is that a non technical user will drag and drop a pdf into a chatbot. and from a UX/product perspective, they should have to think about it more than that IMO. but seemingly, that's very much an expensive, inefficient way of doing it (burning through a whole context window try to read it, reloading it multiple times per conversation, etc.). | |
| ▲ | seemaze a day ago | parent | prev [-] | | Absolutely this. Never try to parse a native PDF document with any expectation of coherence or consistency. |
|
|
| ▲ | csomar a day ago | parent | prev [-] |
| You are missing that the product is the hype cycle around AI and that's worth Trillions of $ (Trillions with a T). Why build a PDF parser that generate text when you can BS in a podcast and get paid. This discussion was about measures, goals and incentives. Follow the incentives. |