| ▲ | cratermoon 3 hours ago | ||||||||||||||||||||||
There are PDF files and there are PDF files. Many (most?) PDFs I run into are generated from Microsoft Word or some other MS product with no structure at all. The majority of people use MS products don't understand or care about structure. The WYSIWYG imperative means lots of markup to describe font size, color, and decoration, to make every section heading look the same without ever designating the text as a section head. The same happens with paragraphs, page breaks, and column flow. The resulting document looks correct enough to the creator. Other people who have a different version of Word, different fonts, and a thousand other little differences, won't see it correctly. That leads our author to generate a PDF, probably with embedded fonts, to ensure uniform appearance across these thousand little exceptions. The result is a document with the content mixed up so incomprehensibly with appearance controls as to be both unreadable and without any residue of the underlying intended structure of the document's sections, headers, figures, paragraphs, captions, footnotes, or anything. And then there's PDF files which are nothing more than a series of images of pages of text. If you're lucky and the scans are clean a good OCR might be able to recover most of the content. What I'm saying is, it doesn't matter the tool, if authors don't encode structure and formatting in semantically meaningful ways. | |||||||||||||||||||||||
| ▲ | 3 hours ago | parent | next [-] | ||||||||||||||||||||||
| [deleted] | |||||||||||||||||||||||
| ▲ | tpm 3 hours ago | parent | prev [-] | ||||||||||||||||||||||
So what you are actually saying is that there is a market for a tool that will recreate the PDF with a structure based on how the original PDF looks? | |||||||||||||||||||||||
| |||||||||||||||||||||||