HTML and PDF are used for different purposes. PDF is file format presenting fixe...

aasasd · on Oct 12, 2018

They certainly should be used for different purposes, but currently I don't see why PDF is necessary for all the papers. Why do they need fixed layout? Plenty of them are already published in both PDF and HTML, what's different about the rest? It's especially baffling in the case of computer science and programming papers when the contents are the same as in blogs.

I have dozens of PDFs in my reading queue, for which I'll probably have to buy a tablet. Why can't I read the same columns of text and pictures on my electroink reader when I can do that with HTML? Who the hell knows.

danso · on Oct 12, 2018

PDF is for printing and for (well, with the exception of a few edge cases) guaranteeing a layout and display given a set of paper dimensions. HTML has the advantage of responsiveness, but the inherent problem of variable output.

When I was a professor and advising my students on creating portfolios, I told them to build websites of course. But I told them to also have a link to a one-page PDF because many organizations (not just academia) forward resumes within an organization to someone senior who eventually prints it out. And you don't want that person's first impression be whatever your website's print.css churns out.

aasasd · on Oct 12, 2018

Variable output is not a problem, it's exactly what's needed. The days of standard paper formats are over, old man: in a decade people will have documents delivered straight to their retinas, or read into their ears—but everyone will still have to scroll PDFs back and forth with no possibility of reformatting, because lots of papers are published in it.

If you want to have your document printed nicely, just prepare it for printing along with other methods of output. The best way to do it is to not use some crazy layout: have a single column with images between paragraphs, and your documents will look fine on any device. All problems of reformatting documents stem from the rigid two-dimensional layout mentality, while the flexible approach requires stepping back to the one-dimensional semantic flow.

(Actually, standard paper formats were never around, because—surprise—my country doesn't use US paper formats.)

danso · on Oct 13, 2018

No, variable output is not "exactly what's needed". Layout of information is an actual skill -- whether it's a resume, a newspaper front page, or a photo gallery -- and we can expect layout to be an important design factor as long as humans have eyes that can process information in formats other than a byte stream.

HTML has been an excellent format for delivering data and information across innumerable devices and visual dimensions. That adaptability comes with tradeoffs. As others have pointed out, anyone who's browsed the Internet Archive knows how HTML, beautiful and organized in its own time, can look like slop today. Paper/PDF's tradeoff, of course, is its rigidity.

bosie · on Oct 12, 2018

While I agree that PDFs are at times cumbersome to use, I can't think of a valid solution to replace them.

- Fixed layout seems much easier to handle than dynamic layouts. I.e. I can't recall any website that resizes the content correctly (correctly meaning i see the image within X% of scrolling of the referenced location; that doesn't just make the lines super-long). And without handling this properly, most of the arguments against PDF usage seem to go out the window.

- I don't know of any way of highlighting, annotating, drawing on an HTML page reliably over multiple devices. Sure, something can be built on but it requires special software, still.

- How do i send someone an HTML copy of a PDF as a single file? (embedded fonts, images etc)

aasasd · on Oct 12, 2018

> I can't recall any website that resizes the content correctly (correctly meaning i see the image within X% of scrolling of the referenced location; that doesn't just make the lines super-long)

I rarely have anything like that happen, so not even sure if I know the exact problem that you have in mind. As far as I can tell, it's specific to when authors put images somewhere distant to the text that mentions them in the one dimension of text flow, e.g. on the next page, or floating in a separate column from the text. The solution is, don't put images far from the text. HTML obviously requires a different approach from PDF: you don't think in terms of two-dimensional physical layout, you think in the one dimension of semantic layout. Most popular content sites are laid out that way now, and I mostly have no problem reading on the desktop or the phone.

> I don't know of any way of highlighting, annotating, drawing on an HTML page reliably over multiple devices. Sure, something can be built on but it requires special software, still.

It requires software just as PDF requires it. Such software isn't ubiquitous precisely because people don't see HTML annotation as a market. It's a typical chicken-and-egg market problem.

To annotate HTML, you abandon the two-dimensional graphical approach just as you do it when producing the document. Instead, you highlight text in paragraphs and attach annotations and drawings to it, independently of the current rendering of the document. Any word processor allows you to highlight text in lines and paragraphs, you do the same thing here. Evernote's web clipper highlights HTML just fine.

> How do i send someone an HTML copy of a PDF as a single file? (embedded fonts, images etc)

You use a format that packs HTML with images, styling and fonts—e.g. MAFF. Come on, it's not rocket science to store what the server sends to the browser. Again, Evernote stores pages fine and could be used for sharing (if the program didn't go to crap overall). It's the same chicken-and-egg problem.

bosie · on Oct 12, 2018

> I rarely have anything like that happen, so not even sure if I know the exact problem that you have in mind.

Not even trying to be funny but do you mind sharing some websites that dynamically resize content correctly? I just checked a couple of the usual suspects (reuters, nytimes, guardian, github) and none do it. They are all using (semi-)fixed layouts.

aasasd · on Oct 12, 2018

I have the opposite problem of finding a page that could be problematic. Flipped through several articles on those sites, and they all use the linear article layout.

Remember that we're talking about publication of static papers, so you look at the main content column on a page, since that's what should be there in a paper. In the main column, those sites use the simple linear flow: ‘text, image, text’—with images occupying entire paragraphs instead of floating to the sides. With this layout, you can reformat articles every which way, string them into horizontal pages, render them in columns or read them with text-to-speech, etc. It's essentially HTML 2.0 layout but with better formatting. Markdown readmes on Github are the perfect example of this approach.

I've regularly used Evernote for capturing web pages, and Pocket to read them on the phone, and they have no problem with storing main content from such articles, stripped of extraneous navigation (outside of Pocket's bugs with dropping some content, presumably from overzealous anti-ad measures).

You don't look at images outside of the main content column for this discussion, because those aren't what should be there in static paper-like publications—unless the images are related to the content. And if the images are related to the content, the question is why the author is trying to use a fancy layout for such a publication.

(NYTimes do sometimes use more complex layouts in feature articles, with dynamic effects—but they, presumably, don't target those for long-term archival, and instead they customize the pages for mobile and desktop access separately. Anyway, they also should tone that down if they want readership via something like Pocket.)

I most often have problems with images on Wikipedia, because they make images float to the right side since they have many non-essential but illustrative images. Those, indeed, tend to detach from the relevant text.

bosie · on Oct 12, 2018

Sorry, i misunderstood you. I thought you wanted to move away from PDFs because they don't resize. But none of the example i gave resize either (neither does pocket or instapaper).

aasasd · on Oct 12, 2018

“Resize” is an ambiguous term, so you may or may not have understood me correctly, I'm still not sure. My (primary) problem with PDF is that it doesn't adapt to displays of different sizes—mobile, electroink and tablet devices in addition to desktop machines—and can't be reformatted on a device (e.g. to adjust the author's typesetting choices).

Did you mean “zooming” the page in/out on the same device? That's not a big issue, in my experience: I zoom in on almost every page due to myopia, and rarely have problems. I adjust text properties on mobile devices too, namely in Pocket and e-book readers (which use HTML under the hood these days). Technically, HTML can be rendered with a rigid layout and just be zoomed in/out like a static image—it's a question of the client having this function, or, I think, can be done via a simple CSS property.

If that's still not what you had in mind, I'd like to know what you mean by “resize,” out of professional curiosity.

bosie · on Oct 12, 2018

Fair point about font-resizing.

Since you are curious due to professional curiosity: what i meant by resize is the utilization of the device's screen. If my screen allows for a 1200px wide browser window, the main content shouldn't use 800px of it. On my 5000px wide screen, nytimes.com articles seem to utilize a whooping 10-15% (i am guessing). Might as well just send me a fixed-layout PDF.

That being said, I doubt it is computationally easy to compute a good layout. Considering how slow latex compiles a PDF, trying to find the optimal layout for a non-rigid layout seems difficult with the time constraint at hand.

aasasd · on Oct 13, 2018

Oh, I happen to know a bit about this issue. It's very much not recommended to have long lines of text, as you may already know—because that way the eye has trouble finding the next line when returning from the end of the previous one, and the entire reading endeavor becomes rather janky experience. That's one of the primary reasons that we have book pages in portrait orientation and that newspaper articles are stretched in vertical columns. With this limitation, it would be quite pointless to try “utilizing” the screen area with other elements, since they can't just be arbitrarily hanging around the text.

If you're doing a lot of reading, you would do better by having your screen in portrait orientation. Wide screens are better suited for other tasks.

I'm tempted to note, however, that HTML with a simple layout, again, can technically be hammered into displaying in several columns on a wide screen. You'd probably want/need site-specific solutions if you want to keep the site's navigation. But if you need only the main content, you could use an extension akin to the “reading mode” of Firefox/Safari/Pocket, and override the CSS to break content into columns. (There might also be such extensions around that already have columns built in.)

nabla9 · on Oct 12, 2018

PDF is for long term archival.

There is no standard and widely recognized long term archival format for HTML pages (with all the extras). Web ARChive (WARC) provides method for bundling all the stuff in file in one file, but that's not enough. Plus the files will be quite large.

You just don't know how your HTML and JavaScript renders 10 - 15 years from now. If you look old Web Archieve files you start to see how they become crap over time.

aasasd · on Oct 12, 2018

HTML is the format. You pack it with images, CSS and whatever else, and you have the distribution format.

> Web ARChive (WARC) provides method for bundling all the stuff in file in one file, but that's not enough. Plus the files will be quite large.

Not enough how? What is there that you need besides what the server hands to you, if that's what rendered in the first place? What magical compression methods do you have in PDF that are better than ZIP compression used in MAFF, for example?

> You just don't know how your HTML and JavaScript renders 10 - 15 years from now. If you look old Web Archive files you start to see how they become crap over time.

Have a static HTML version that's rendered the same in the future. You know, the same way that you have a static PDF standard.

How do you render Javascript in PDFs in a standard way? You don't use Javascript, that's how. Javascript is not for publication of static semantic text, so you don't use Javascript for papers, it's a no-brainer.

nabla9 · on Oct 12, 2018

> HTML is the format. You pack it with images, CSS and whatever else, and you have the distribution format.

HTML is not a good format and standard for that purpose. It's loose best effort markup with no good consensus on semantics. HTML with images is not good option for papers which have many equations.

EPUB3 is emerging standard for what you want, but it's not really good complete solution that can replace PDF/A or TeX/LaTeX

> Have a static HTML version that's rendered the same in the future

We don't have that.

aasasd · on Oct 12, 2018

> It's loose best effort markup with no good consensus on semantics.

And PDF has good semantics? Are we still on the topic of how HTML is better than PDF, or…? We're in the comments for a page that says that PDF tables are characters just floating in space, and people are saying most PDFs out there don't have semantic markup. Meanwhile HTML had semantics efforts for decades now, just choose your flavor.

Blind people read HTML, you know. Do they read PDFs?

> HTML with images is not good option for papers which have many equations.

There's MathML for that, and IIRC other formats too. You could even have embedded TeX like Anki has. Use SVG for fallback.

>> Have a static HTML version that's rendered the same in the future

> We don't have that.

Ooh, chicken-and-egg again? Freeze any of the versions from the past decade with the rendering standards, and you'll have it.

But actually, it doesn't even matter, just like HTML 2.0 can be rendered fine on modern devices (aside from the different text size). Treat your paper as a paper instead of a webzine, don't use crazy layouts, just do “text, image, text” which you'll want anyway for the different displays—and your document will render fine in the future when it will be delivered straight to the retina, instead of making me scroll the PDF back and forth because no reflow.