I don't think HTML is the right approach. HTML is better than PDF, but it is still a format for displaying/rendering.

the actual paper content format should be separated from its rendering.

i.e. it should contain abstract, sections, equations, figures, citations etc. but it shouldn't have font sizes, layout etc.

the viewer platforms then should be able to style the content differently.

▲ cluckindan 4 hours ago | parent | next [-]

HTML alone is in fact not a format for displaying/rendering. Done properly, it is a structural representation of the content. (This is often called ”semantic HTML”.)

They are converting to HTML to make the content more accessible. Accessibility in this context means a11y, in effect ”more accessible” equates to ”more compatible with screen readers”.

While PDF documents can be made accessible, it is way easier to do it in HTML, where browsers build an actual AOM (accessibility object model) tree and expose it to screen readers.

>it should contain abstract, sections, equations, figures, citations etc.

So <article>, <section>, <math>, <figure>, <cite>, etc.

	▲	benatkin 4 hours ago \| parent \| next [-]
		Much of it is a structural representation of how to display the content.
	▲	Theodores an hour ago \| parent \| prev [-]
		I like Arxiv and what they are doing, however, do the auto-generated HTML files contain nothing more than a sea of divs dressed with a billion classes? I would be delighted if they could do better than that, with figcaptions as well as figures, and sections 'scoped' with just one <h2-6> heading per section. They could specify how it really should be done, the HTML way, with a well defined way of doing the abstract and getting the cited sources to be in semantic markup yet not in some massive footer at the back. There should also be a print stylesheet so that the paper prints out elegantly on A4 paper. Yes, I know you can 'print to PDF' but you can get all the typesetting needed in modern CSS stylesheets. Furthermore, they need to write a whole new HTML editor that discards WYSIWYG in favour of semantic markup. WYSIWYG has held us back by decades as it is useless for creating a semantic document. We haven't moved on from typewriters and the conventions needed to get those antiques to work, with word processors just emulating what people were used to at the time. What we really need is a means to evolve the written word, so that our thinking is 'semantic' when we come to put together documents, with a 'document structure first' approach. LaTeX is great, however, last time I used it was many decades ago, when the tools were 'vi' (so not even vim) and GhostScript, running on a Sun workstation with mono screen. Since then I have done a few different jobs and never have I had the need to do anything in LaTex or even open a LaTeX file. In the wild, LaTeX is rarer than hen's teeth. Yet we all read scientific papers from time to time, and Arxiv was founded on the availability of Tex files. The lack of widespread adoption of semantic markup has been a huge bonus to Google and other gatekeepers that have the money to develop their own heuristics to make sense of 'seas of divs'. As it happens, Google have also been somewhat helpful with Chrome and advancing the web, even if it is for their gatekeeping purposes. The whole world of gatekeeping is also atrocious in academia. Knowledge wants to be free, but it is also big business to the likes of Springer, who are already losing badly to open publishing. As you say, in this instance, accessibility means screen readers, however, I hope that we can do better than that, to get back to the OG Tim Berners Lee vision of what the web should be like, as far as structuring information is concerned.

▲ m-schuetz 3 hours ago | parent | prev | next [-]

That's a purist stance that's never going to work out in praxtice. Authors will always want to adjust the presentation of content, and html might be even better suited for that than Latex, which as bad at both.

▲ dimal 4 hours ago | parent | prev | next [-]

Perfect is the enemy of good. HTML is good enough. Let’s get this done.

And as another commenter has pointed out, HTML does exactly what you ask for. If it’s done correctly, it doesn’t contain font sizes or layout. Users can style HTML differently with custom CSS.

▲

billconan 4 hours ago | parent [-]

mixing rendering definitions with content (PDF) is something from the printer era, that is unsuitable for the digital era.

HTML was a digital format, but it wanted to be a generic format for all document types, not just papers, so it contains a lot of extras that a paper format doesn't need.

for research papers, since they share the same structure, we can further separate content from rendering.

for example, if you want to later connect a paper with an AI, do you want to send <div class="abstract"> ... ?

or do some nasty heuristic to extract the abstract? like document. getElementsByClassName("abstract")[0] ?

	▲	simonw 4 hours ago \| parent [-]
		All of the interesting LLMs can handle a full paper these days without any trouble at all. I don't think it's worth spending much time optimizing for that use-case any more - that was much more important two years ago when most models topped out at 4,000 or 8,000 tokens.

▲ bob1029 4 hours ago | parent | prev | next [-]

> HTML is better than PDF

I disagree. PDF is the most desirable format for printed media and its analogues. Any time I plan to seriously entertain a paper from Arxiv, I print it out first. I prefer to have the author's original intent in hand. Arbitrary page breaks and layout shifts that are a result of my specific hardware/software configuration are not desirable to me in this context of use.

▲

ACCount37 4 hours ago | parent | next [-]

I agree that PDF is best for things that are meant to be printed, no questions. But I wonder how common actually printing those papers is?

In research and in embedded hardware both, I've met some people who had entire stacks of papers printed out - research papers or datasheets or application notes - but also people who had 3 monitors and 64GB of RAM and all the papers open as browser tabs.

I'm far closer to the latter myself. Is this a "generational split" thing?

	▲	pfortuny 4 hours ago \| parent [-]
		Possibly, but then again, when I need to study a paper, I print it, when I need just to skim it and use a result from it, it is more likely that I just read it on a screen (tablet/monitor). That is the difference for me.

▲

s0rce 4 hours ago | parent | prev [-]

I used to print papers, probably stopped about 10 years ago. I now read everything in Zotero where I can highlight and save my annotations and sync my library between devices. You can also seamlessly archive html and pdfs. I don't see people printing papers in my workplace that often unless you need to read them in a wet lab where the computer is not convenient.

▲ afavour 5 hours ago | parent | prev [-]

Wouldn’t that be CSS?

▲ billconan 5 hours ago | parent [-]

<pre><code> abstract text ... </code></pre>

</div>

<ol>

<li>author one</li>

<li>author two</li>

<ol>

</div>

should be just:

[abstract]

abstract text

[authors]

author one | email | affiliation

author two | email | affiliation

▲ afavour 4 hours ago | parent | next [-]

Sounds like XML and XSL would be a great fit here. Shame it’s being deprecated.

But you could still use HTML. Elements with a dash in are reserved for custom elements (that is, a new standardised element will never take that name) so you could do:

    <paper-author-list>
      <paper-author />
    </paper-author-list>

And it would be valid HTML. Then you’d style it with CSS, with

    paper-author {
      display: list-item;
    }

And so on.

▲

bawolff 4 hours ago | parent | next [-]

Nothing is stopping you from using server side XSL. I personally dont think its a great fit, but people need to stop acting like xsl has been wiped from the face of the earth.

▲

afavour 4 hours ago | parent [-]

Yes but we’re specifically talking about a display format here. Something requiring a server side transform before being viewable by a user is a clear step backwards.

▲

bawolff 2 hours ago | parent [-]

How so? I can't think of any advantage to having client side xsl over outputting two files, in this context.

	▲	afavour 2 hours ago \| parent [-]
		The discussion is about the form in which you share papers. With HTML you just share the HTML file, it opens instantly on basically any device. If you distribute the paper as XML with an XSLT transform you need to run something that’ll perform that transform before you can read the paper. No matter whether that transform happens on the server or on the client it’s still an extra complication in the flow of sharing information.

▲

xworld21 2 hours ago | parent | prev [-]

Indeed, LaTeXML (the software used by arXiv) converts LaTeX to a semantic XML document which is turned to HTML using primarily XSLT!

▲ panzi 4 hours ago | parent | prev [-]

There is <article> <section> <figure> <legend>, but yes, <abstract> and <authors> is missing as such. But there are meta tags for such things. Then there is RDF and Thing. Not quite the same, I know, but it's not completely useless.

	▲	kevindamm 4 hours ago \| parent [-]
		and you could shim these gaps with custom components, hypothetically