Remix.run Logo
Aachen a day ago

Please don't turn nice formats into a format that's similar to screenshots of text. Pandoc has an option to pack all images and styles needed to render the page into one html file:

    pandoc --self-contained input.html -o output.html
crazygringo a day ago | parent | next [-]

Or, please do?

I use PDF's so I can send them to my iPad to read offline, highlight them, annotate them, and then send them back to my filesystem with highlights and annotations intact.

I sure can't do that with any "nice formats" like HTML or TXT or EPUB or MOBI.

nine_k a day ago | parent | next [-]

PDF is literally digital paper. HTML has logical structure, it can adapt to different displays, etc.

Sometimes you want one, sometimes, the other.

ratelimitsteve a day ago | parent | next [-]

>Sometimes you want one, sometimes, the other.

This is the part that the top commenter missed. Instead they decided that one format is "nice" and the other, by implication, isn't. I find PDFs a lot easier to keep organized en masse, I like that I can use them on any of my devices and it's easy for me to use them when I'm doing in-depth reading such as an ebook. Doubly so because my ereader also does text to speech and syncs across devices so I can read on my tablet while I'm on the exercise bike and then switch to listening to the same book on my phone with minimal seams and without losing my place. It is, in a word, nice.

Aachen 12 hours ago | parent [-]

None of that sounds related to the format?

- A text to speech engine should work better with the original html structure where it sees bold tags, headings, and full sentences ra-

ther than broken-off ones

- Keeping PDFs organised, how would that differ from keeping any other filetype organised? I don't understand what difference you, "by implication", attribute to a file ending in .html or .pdf for being able to handle them en masse. If anything, searching across them will be vastly easier for software (self-written or third-party) and more reliable because it's all plain text

- Text and audio rendering syncing, I have no experience with but that doesn't sound like it ought to fundamentally work for a display format and not for the source text format. Of course, the software has to have support for this format (and otherwise it's trivial to pdfify a html but vice versa is nearly impossible)

ratelimitsteve 3 hours ago | parent | next [-]

>and full sentences ra- > >ther than broken-off ones

This and trying to read the header/footer are the most annoying parts of pdf to audio apps. At least some apps will let you set a margin outside of which text is ignored, so every page doesn't start with the book title, author's name and chapter title and end with the page number.

user3939382 11 hours ago | parent | prev | next [-]

HTML could do everything PDF does in theory but it doesn’t in practice because the tooling doesn’t exist.

crazygringo 10 hours ago | parent [-]

My exercise bike can play Doom on its display in theory.

Theory doesn't matter here, tooling and standards do. And PDF doesn't just have the tooling for highlighting and annotations, it has the standards for them so that tools support them in an interoperable way. A highlight made with one tool can be removed with another, without altering the underlying content.

ratelimitsteve 9 hours ago | parent [-]

I would love an exercise bike that runs doom. Maybe link my movement speed in-game to how fast I pedal, with a joystick on one of the handlebars to move and a couple buttons on the other one to shoot and reload. So far every exercise bike game is just bike race, which I'll admit is a close fit for the existing hardware and probably the first idea I'd have too but it gets boring after a while.

Towaway69 8 hours ago | parent [-]

Had the same idea - not with doom - but with a Quest 3D and watching videos of me riding in the summer. First make a film using an 3D camera (something like a Insta360) and then view that on the Quest in winter while riding exercise bike.

Video speed would sync to the exercise bike speed, giving a feeling of reality.

The core problem is that sweating inside a Quest isn't a good idea ...

ratelimitsteve 3 hours ago | parent [-]

I did thrill of the fight a lot back during the pandemic. sweating in a quest isn't ideal, but it's not unmanageable for brief periods of time (15-30 minutes). The only real problem is that the lenses can get sweaty or foggy and all of a sudden I go from punching faces and dodging fists to punching at blurs and failing to dodge other blurs.

Maybe instead of a quest you just display video to a screen? When I was using a hotel fitness center they had a peloton and that seems to be something you can do with those. It was a couple years back and I recall the video being loosely if at all tied to the speed you pedal at, but it was more fun than just looking at a wall while I pretend to go somewhere.

ratelimitsteve 9 hours ago | parent | prev [-]

maybe html can do all of these and it will only cost me the time it takes to build the app, but right now PDF does all of those things for me, here today in my pocket, for $15. Which is nice.

I'd love to see a text to speech engine that pronounces formatting but I think it might be more complicated than learning to pronounce something boldly. Am I yelling? Am I keeping my voice low but adding intensity? Can you automate answering that question in a way that's mostly correct most of the time? If something is in italics am I whispering, stage whispering, emphasizing or merely saying the title of an existing work out loud? It's a fundamental abuse of a text formatting engine to try to use it for speech formatting, you either have to use the existing tags for things they were never intended for or you have to start adding tags like <slywhisper> and <scream emotion="angry"> vs <scream emotion="excited">. That being said, an html-independent form of emotional text annotation might actually be a good idea as the inevitability of synthesized human voices being a part of our daily lives takes hold.

I find PDFs easier to organize than HTML because HTML is any number of files referencing each other across a directory structure that can have any size or shape, and a PDF is a single file. If I'm searching my library for Bob Wilson, I want his books to show up and I want them to have his picture in them if that's how the book was published but I don't want Bob_Wilson.jpeg to show up as a result. I could automate print to PDF from html or use the tool someone else posted in order to condense my saved HTMLs to single files but that's more processing time and effort in order to get what I already have from a PDF

Syncing position across HTML files may be doable, but syncing position across PDFs is done. You're absolutely right that that has nothing to do with the format but the (implied) question I was answering when I brought it up was why I would sometimes want one and other times want the other. That's why.

Finally, and probably the only one that really matters inasmuch as all the other reasons can be coded around but this one cant: the places I get documents distribute them in PDF, mobi and epub but almost never in HTML

Aachen 14 hours ago | parent | prev [-]

When do want the digital paper when you can have the more flexible one?

jerjerjer 9 hours ago | parent | next [-]

When I want it to be displayed in the exact same way everywhere.

crazygringo 11 hours ago | parent | prev [-]

Did you not read my reply to your root comment? I already answered this for you.

Each one has things the other can't do. Neither is universally more flexible.

mr_mitm a day ago | parent | prev [-]

You could, though. What you are describing are features of an editor, not a file format. I can imagine a browser addon performing the same tasks.

circuit10 a day ago | parent | next [-]

But in this case the flexibility of HTML is a negative because any layout shift would mess up the positions of the annotations, so fixing the layout (and making sure it’s non-interactive) is helpful here

whenc a day ago | parent | prev | next [-]

PDF annotations sit within the file.

mr_mitm a day ago | parent [-]

I know, even though that depends on the editor. Okular for example places them in an extra file, last I checked. That's not unique to PDFs. HTML files are modifiable. There is nothing preventing an editor to put annotations in it as well.

crazygringo a day ago | parent [-]

PDF is designed for annotations in the file format. You annotate in one editor, you can change the annotations in another. You can always distinguish between original content and annotations. I see no indication that Okular stores highlights or annotations in a separate file, that would be bizarre.

There is no mechanism for annotations in HTML or the other formats I listed. An editor would just be editing the original content in its own non-standardized, non-portable way, which is not desirable for a number of reasons.

So when you say:

> What you are describing are features of an editor, not a file format.

That is incorrect. It is an intentionally designed and standardized feature of the file format.

mr_mitm a day ago | parent | next [-]

It definitely used to be bizarre then:

https://superuser.com/questions/333378/where-does-okular-sto...

ratelimitsteve a day ago | parent [-]

turns out the default for okular is to save to an external file but there's a setting that can be changed to use the format correctly and store annotations within the file, which is universally compatible with other PDF readers. You can't really blame the format for someone using it wrong on purpose, and if you can then I'll just abuse HTML and the fact that I use it wrong will be evidence that it is, in itself, wrong

cxr a day ago | parent | prev [-]

The W3C standardized HTML annotations years ago. There's a difference between a standard not existing versus people pretending it doesn't exist because it's not implemented by Chrome.

crazygringo 20 hours ago | parent [-]

That's different. Those are a data structure defining annotations that are meant to be stored externally. They're not part of an HTML file like PDF annotations are. They're meant more for live collaborative commenting within a shared online space, not for making private portable annotations like PDF does.

And it's not a Chrome thing. I don't think any browsers support it, do they? It's not really clear there's a need for it, when collaborative editors already handle document annotations in their own ways.

cxr 9 hours ago | parent [-]

So is there a need for it or isn't there?

> That's different. Those are a data structure defining annotations that are meant to be stored externally.

The protocol is a separate standard.

The format is JSON-LD. Putting JSON-LD into HTML isn't a question mark. (There's info at W3C.org about how to do that, too. Not that it's necessary. You can guess what it says.)

a day ago | parent | prev [-]
[deleted]
jasode 13 hours ago | parent | prev | next [-]

Fyi... the preferred new syntax since 2022 is:

  --embed-resources --standalone.
https://github.com/rstudio/rmarkdown/issues/2382

https://pandoc.org/MANUAL.html#:~:text=Deprecated%20synonym%...

Aachen 12 hours ago | parent [-]

I noticed when trying it out for this comment, but then looked around when it was introduced and it seems recent (as in, an LTS distribution won't have it). Someone on stackoverflow said they get "unknown option --embed-resources". The old option will work for everyone and is also simpler, one instead of two parameters. People whose client supports the new option will see the upgrade suggestion when they run this. In the end I saw mainly downsides to mentioning the new rather than the old way

agedclock a day ago | parent | prev | next [-]

Pandoc would be my preferred tool. It is excellent at converting between other formats as well.

kelnos a day ago | parent | prev | next [-]

> Please don't turn nice formats into a format that's similar to screenshots of text

Converting HTML to PDF shouldn't result in an image wrapped in a PDF. Text will be preserved as text in the final PDF. (Unless the converter is garbage, of course.)

Aachen a day ago | parent [-]

If you've ever copied text out of a PDF, you'll know it's not the original text anymore. Besides ligatures, you get broken sentences with extra hyphens inserted in wrong places (that were word/line breaks in the PDF-rendered version), if it'll properly let you select more than a few words at all. It works like "put these couple words at position x,y" and not (html's) semantic "here comes a heading" tag that helps people accessibly read your text, and if you're not suffering from any impairment or mobile devices with narrower screens than this particular render was designed for, it also lets you work with the document more easily. It's like you remove all HTML and keep only the CSS: all definitions of what's a section, sentence, emphasis, or caption are gone

I didn't mean literally an image, hence saying image-like. You get similar limitations to when using OCR, which seems very image-like to me

layer8 a day ago | parent | prev | next [-]

HTML+CSS+media files isn’t a nice format, and much less portable through time and space than PDF.

Aachen a day ago | parent [-]

Not sure if I'm misreading your comment, but it's not plural files with all those formats separately

That's what the "self contained" option does: turn it into one nice file. Makes no difference if you copy example.pdf or example.html when both contain all images and styles (except one of them also contains the original semantic text)

TylerE a day ago | parent | prev | next [-]

Being (not so easily) edited is often a feature, not a bug.

craftkiller a day ago | parent | next [-]

If that is your goal, you should be cryptographically signing your documents with your PGP key. That way you actually have assurance the document has not been modified rather than just hoping someone hasn't modified the document. Additionally, PGP can sign anything so you are open to use whatever format you want.

Aachen a day ago | parent | prev | next [-]

May I recommend .html in that case? You can embed scripts that control who can run it, having it fetch a decryption token from a server or require a decryption password with a safe password hashing algorithm of your choice

It's much more versatile than PDF and, if the algorithm decides the user is allowed to read the document, then the user gets to make use of all of the document's options like a better search function (PDF can't find words that are bro-

ken across lines because that information of what's a word is gone, transformed into coordinates of what characters need to go where). It's also much more readable on different screen sizes, as the user can resize the window to whatever is comfortable on a 27" screen, or fits on their pocket e-reader. You can even draw it on a canvas if you want to prevent people from extracting the decrypted strings (though it's evil, you have that option). There's only benefits!

PDF is the lazy way to half-ass a read-only document while screwing, ahem, making anyone using a mobile phone zoom, pan, and squint. Thankfully, phones are falling out of fash— wait, scratch that, I just heard text reflow is more relevant than ever as phone use continues to soar

ryandrake a day ago | parent | prev | next [-]

Is this really that much of a motivation in 2025? Maybe in 2000 you could publish a PDF with the assurance that only the people who paid for Acrobat would be able to edit it, but today, there are a lot of accessible ways to edit PDFs, I don't think I'd choose PDF if I for whatever reason wanted to limit others from editing.

guywithahat a day ago | parent | prev [-]

I was thinking this too, PDF's exist so people don't mess with the document. That said, it's still a clever feature, and pandoc can convert html into a pdf as well with a conversion engine. That said, I suspect it'll fail on anything sufficiently complex

pandoc input.html -o output.pdf --pdf-engine=<your engine>

moralestapia a day ago | parent | prev [-]

Please don't police what other people do.

Aachen a day ago | parent [-]

If I were police, I could still not enforce that this is what they run until it's law. They're free to choose this option if they like the merits