ISO PDF spec is getting Brotli – ~20 % smaller documents with no quality loss

▲ ISO PDF spec is getting Brotli – ~20 % smaller documents with no quality loss(pdfa.org)

94 points by whizzx 7 hours ago | 43 comments

▲ ericpauley 4 hours ago | parent | next [-]

Some real cognitive dissonance in this article…

“The PDF Association operates under a strict principle—any new feature must work seamlessly with existing readers” followed by introducing compression as a breaking change in the same paragraph.

All this for brotli… on a read-many format like pdf zstd’s decompression speed is a much better fit.

▲ xxs 3 hours ago | parent [-]

yup, zstd is better. Overall use zstd for pretty much anything that can benefit from a general purpose compression. It's a beyond excellent library, tool, and an algorithm (set of).

Brotli w/o a custom dictionary is a weird choice to begin with.

▲ adzm 3 hours ago | parent | next [-]

Brotli makes a bit of sense considering this is a static asset; it compresses somewhat more than zstd. This is why brotli is pretty ubiquitous for precompressed static assets on the Web.

That said, I personally prefer zstd as well, it's been a great general use lib.

▲ dist-epoch 2 hours ago | parent [-]

You need to crank up zstd compression level.

zstd is Pareto better than brotli - compresses better and faster

▲ atiedebee 2 hours ago | parent | next [-]

I thought the same, so I ran brotli and zstd on some PDFs I had laying around.

  brotli 1.0.7 args: -q 11 -w 24
  zstd v1.5.0  args: --ultra -22 --long=31 
                 | Original | zstd    | brotli
  RandomBook.pdf | 15M      | 4.6M    | 4.5M
  Invoice.pdf    | 19.3K    | 16.3K   | 16.1K

I made a table because I wanted to test more files, but almost all PDFs I downloaded/had stored locally were already compressed and I couldn't quickly find a way to decompress them.

Brotli seemed to have a very slight edge over zstd, even on the larger pdf, which I did not expect.

▲ mort96 17 minutes ago | parent | next [-]

I did my own testing where Brotli also ended up better than ZSTD: https://news.ycombinator.com/item?id=46722044

Results by compression type across 55 PDFs:

    +------+------+-----+------+--------+
    | none | zstd | xz  | gzip | brotli |
    +------|------|-----|------|--------|
    | 47M  | 45M  | 39M | 38M  | 37M    |
    +------+------+-----+------+--------+

▲ order-matters an hour ago | parent | prev [-]

Whats the assumption we can potentially target as reason for the counter-intuitive result?

that data in pdf files are noisy and zstd should perform better on noisy files?

▲

jeffbee 42 minutes ago | parent [-]

What's counter-intuitive about this outcome?

	▲	order-matters 6 minutes ago \| parent [-]
		maybe that was too strongly worded but there was an expectation for zstd to outperform. So the fact it didnt means the result was unexpected. i generally find it helpful to understand why something performs better than expected.

▲ DetroitThrow 2 hours ago | parent | prev | next [-]

I love zstd but this isn't necessarily true.

▲ dchest 2 hours ago | parent | prev | next [-]

Not with small files.

▲ jeffbee 2 hours ago | parent | prev [-]

Are you sure? Admittedly I only have 1 PDF in my homedir, but no combination of flags to zstd gets it to match the size of brotli's output on that particular file. Even zstd --long --ultra -22.

▲ greenavocado 3 hours ago | parent | prev [-]

This bizzare move has all the hallmarks of embrace-extend-extinguish rather than technical excellence

▲ bhouston 4 hours ago | parent | prev | next [-]

Are they using a custom dictionary with Brotli designed for PDFs? I am not sure if it would help or not, but it seems like one of those cases it may help?

Something like this:

https://developer.chrome.com/blog/shared-dictionary-compress...

In my applications, in the area of 3D, I've been moving away from Brotli because it is just so slow for large files. I prefer zstd, because it is like 10x faster for both compression and decompression.

	▲	whizzx 3 hours ago \| parent [-]
		The pdf association is still running experiments on whether or not to support custom dictionaries based on real life workloads gains. So it might land in the spec once it has proven if offers enough value

▲ bobpaw 4 hours ago | parent | prev | next [-]

How can iText claim that adding Brotli is not a backward incompatible change (in the "Why keep encoding separate" table)? In the first section the author states that any new feature must work seamlessly with existing readers. New documents created that include this compression would be unintelligible to any reader that only supports Deflate.

Am I missing something? Adoption will take a long time if you can't be confident the receiver of a document or viewers of a publication will be able to open the file.

▲

whizzx 3 hours ago | parent [-]

It's prototypish work to support it before it land's in the official specification. But it will indeed take some adoption time.

Because I'm doing the work to patch in support across different viewers to help adoption grow. And once the big opensource ones ship it pdfjs, poppler, pdfium, adoption can quickly rise.

	▲	croes 2 hours ago \| parent [-]
		There are old devices where the viewer can’t be patched. That’s killing one of the main features of PDF

▲ ndriscoll 2 hours ago | parent | prev | next [-]

What is the point of using a generic compression algorithm in a file format? Does this actually get you much over turning on filesystem and transport compression, which can transparently swap the generic algorithm (e.g. my files are already all zstd compressed. HTTP can already negotiate brotli or zstd)? If it's not tuned to the application, it seems like it's better to leave it uncompressed and let the user decide what they want (e.g. people noting tradeoffs with bro vs zstd; let the person who has to live with the tradeoff decide it, not the original file author).

	▲	wongarsu 8 minutes ago \| parent \| next [-]
		Few people enable file system compression, and even if they do it's usually with fast algorithms like lz4 or zstd -1. When authoring a document you have very different tradeoffs and can afford the cost of high compression levels of zstd or brotli.
	▲	eru an hour ago \| parent \| prev \| next [-]
		Well, if sanity had prevailed, we would have likely stuck to .ps.gz (or you favourite compression format), instead of ending up with PDF. Though we might still want to restrict the subset of PostScript that we allow. The full language might be a bit too general to take from untrusted third parties.
	▲	Someone 38 minutes ago \| parent \| prev [-]
		- inside the file, the compressor can be varied according to the file content. For example, images can use jpeg, but that isn’t useful for compressing text - when jumping from page to page, you won’t have to decompress the entire file

▲ ksec 3 hours ago | parent | prev | next [-]

Why not zstd?

▲ HackerThemAll 2 hours ago | parent | next [-]

I think this was the main reason (from the linked article) LOL:

"Brotli is a compression algorithm developed by Google."

They have no idea about Zstandard nor ANS/FSE comparing it with LZ77.

Sheer incompetence.

▲ mort96 34 minutes ago | parent [-]

I just took all PDFs I had in my downloads folder (55, totaling 47M). These are invoices, data sheets, employment contracts, schematics, research reports, a bunch of random stuff really.

I compressed them all with 'zstd --ultra -22', 'brotli -9', 'xz -9' and 'gzip -9'. Here are the results:

    +------+------+-----+------+--------+
    | none | zstd | xz  | gzip | brotli |
    +------|------|-----|------|--------|
    | 47M  | 45M  | 39M | 38M  | 37M    |
    +------+------+-----+------+--------+

Here's a table with all the files:

    +------+------+------+------+--------+
    | raw  | zstd | xz   | gzip | brotli |
    +------+------+------+------+--------+
    | 12K  | 12K  | 12K  | 12K  | 12K    |
    | 20K  | 20K  | 20K  | 20K  | 20K    | x5
    | 24K  | 20K  | 20K  | 20K  | 20K    | x5
    | 28K  | 24K  | 24K  | 24K  | 24K    |
    | 28K  | 24K  | 24K  | 24K  | 24K    |
    | 32K  | 20K  | 20K  | 20K  | 20K    | x3
    | 32K  | 24K  | 24K  | 24K  | 24K    |
    | 40K  | 32K  | 32K  | 32K  | 32K    |
    | 44K  | 40K  | 40K  | 40K  | 40K    |
    | 44K  | 40K  | 40K  | 40K  | 40K    |
    | 48K  | 36K  | 36K  | 36K  | 36K    |
    | 48K  | 48K  | 48K  | 48K  | 48K    |
    | 76K  | 128K | 72K  | 72K  | 72K    |
    | 84K  | 140K | 84K  | 80K  | 80K    | x7
    | 88K  | 136K | 76K  | 76K  | 76K    |
    | 124K | 152K | 88K  | 92K  | 92K    |
    | 124K | 152K | 92K  | 96K  | 92K    |
    | 140K | 160K | 100K | 100K | 100K   |
    | 152K | 188K | 128K | 128K | 132K   |
    | 188K | 192K | 184K | 184K | 184K   |
    | 264K | 256K | 240K | 244K | 240K   |
    | 320K | 256K | 228K | 232K | 228K   |
    | 440K | 448K | 408K | 408K | 408K   |
    | 448K | 448K | 432K | 432K | 432K   |
    | 516K | 384K | 376K | 384K | 376K   |
    | 992K | 320K | 260K | 296K | 280K   |
    | 1.0M | 2.0M | 1.0M | 1.0M | 1.0M   |
    | 1.1M | 192K | 192K | 228K | 200K   |
    | 1.1M | 2.0M | 1.1M | 1.1M | 1.1M   |
    | 1.2M | 1.1M | 1.0M | 1.0M | 1.0M   |
    | 1.3M | 2.0M | 1.1M | 1.1M | 1.1M   |
    | 1.7M | 2.0M | 1.7M | 1.7M | 1.7M   |
    | 1.9M | 960K | 896K | 952K | 916K   |
    | 2.9M | 2.0M | 1.3M | 1.4M | 1.4M   |
    | 3.2M | 4.0M | 3.1M | 3.1M | 3.0M   |
    | 3.7M | 4.0M | 3.5M | 3.5M | 3.5M   |
    | 6.4M | 4.0M | 4.1M | 3.7M | 3.5M   |
    | 6.4M | 6.0M | 6.1M | 5.8M | 5.7M   |
    | 9.7M | 10M  | 10M  | 9.5M | 9.4M   |
    +------+------+------+------+--------+

Zstd is surprisingly bad on this data set. I'm guessing it struggles with the already-compressed image data in some of these PDFs.

Going by only compression ratio, brotli is clearly better than the rest here and zstd is the worst. You'd have to find some other reason (maybe decompression speed, maybe spec complexity, or maybe you just trust Facebook more than Google) to choose zstd over brotli, going by my results.

I wish I could share the data set for reproducibility, but I obviously can't just share every PDF I happened to have laying around in my downloads folder :p

▲ PunchyHamster 3 hours ago | parent | prev [-]

incompetence

▲

whizzx 3 hours ago | parent [-]

You can read about it here https://pdfa.org/brotli-compression-coming-to-pdf/

▲

jeffbee 3 hours ago | parent [-]

That mentions zstd in a weird incomplete sentence, but never compares it.

▲

F3nd0 3 hours ago | parent | next [-]

They don’t seem to provide a detailed comparison showing how each compression scheme fared at every task, but they do list (some of) their criteria and say they found Brotli the best of the bunch. I can’t tell if that’s a sensible conclusion or not, though. Maybe Brotli did better on code size or memory use?

▲

eviks 3 hours ago | parent | prev [-]

Hey, they did all the work and more, trust them!!!

> Experts in the PDF Association’s PDF TWG undertook theoretical and experimental analysis of these schemes, reviewing decompression speed, compression speed, compression ratio achieved, memory usage, code size, standardisation, IP, interoperability, prototyping, sample file creation, and other due diligence tasks.

	▲	LoganDark 8 minutes ago \| parent [-]
		I love when I perform all the due diligence tasks. You just can't counter that. Yes but, they did all the due diligence tasks. They considered all the factors. Every one. Think you have one they didn't consider? Nope.

▲ superkuh 23 minutes ago | parent | prev | next [-]

This is nice, but PDF jumped the shark already. It's no longer a document format that always looks the same everywhere. The inclusion of "Dynamic XFA (XML Form Architecture) PDF" in the spec made it so PDF is an unreliable format. The aformentioned is a PDF without content that pulls down all it's content from the web. It even still, ostensibly, supports Flash (swf) animations. In practice these "PDF"s are just empty white pages with an error message like,

>"Please wait... If this message is not eventually replaced by the proper contents of the document, your PDF viewer may not be able to display this type of document. You can upgrade to the latest version of Adobe Reader for Windows®, Mac, or Linux® by visiting http://www.adobe.com/go/reader_download. For more assistance with Adobe Reader visit http://www.adobe.com/go/acrreader. Windows is either a registered trademark or a trademark of Microsoft Corporation in the United States and/or other countries. Mac is a trademark of Apple Inc., registered in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other countries."

▲ nialse 3 hours ago | parent | prev | next [-]

Who is responsible for the terrible decision? In the pro vs con analysis, saving 20% size occasionally vs updating ALL pdf libraries/apps/viewers ever built SHOULD be a no-brainer.

▲ h4x0rr 3 hours ago | parent | prev | next [-]

Wouldn't lzma2 be better here since a pdf is more read heavy?

	▲	F3nd0 3 hours ago \| parent [-]
		Going by one of Brotli’s authors’ comment [1] on another post, it probably wouldn’t. [1] https://news.ycombinator.com/item?id=46035817

▲ avalys 2 hours ago | parent | prev | next [-]

This article is AI slop.

	▲	jeffbee an hour ago \| parent [-]
		Yep.

▲ cess11 3 hours ago | parent | prev | next [-]

'Your PDF:s will open slower because we decided that the CDN providers are more important than you'.

If size was important to users then it wouldn't be so common that systems providers crap out huge PDF files consisting mainly of layout junk 'sophistication' with rounded borders and whatnot.

The PDF/A stuff I've built stays under 1 MB for hundreds of pages of information, because it's text placed in a typographically sensible manner.

▲ delfinom 4 hours ago | parent | prev [-]

tl;dr Commerical entity is paying to have the ISO altered to "legalize" their SDK they are pushing which is incompatible with standard PDF readers.

ISO is pay to play so :shrug:

▲

whizzx 3 hours ago | parent | next [-]

No this feature is coming straight from the PDF association itself and we just added experimental support before it's officially in the spec to help testing between different sdk processors.

So your comment is a falsehood

▲

lmz 4 hours ago | parent | prev | next [-]

It's not even clear that they were the ones suggesting inclusion. They're just saying their library now supports the new thing.

https://pdfa.org/brotli-compression-coming-to-pdf/

> As of March 2025, the current development version of MuPDF now supports reading PDF files with Brotli compression. The source is available from github.com/ArtifexSoftware/mupdf, and will be included as an experimental feature in the upcoming 1.26.0 release.

> Similarly, the latest development version of Ghostscript can now read PDF files with Brotli compression. File creation functionality is underway. The next official Ghostscript release is scheduled for August this year, but the source is available now from github.com/ArtifexSoftware/Ghostpdl.

	▲	adrian_b 2 hours ago \| parent [-]
		Yes, I do not see any source of financial gain that could motivate them for this, because both MuPDF and Ghostscript are free. MuPDF is an excellent PDF reader, the fastest that I have ever tested. There are plenty of big PDF files where most other readers are annoyingly slow. It is my default PDF and EPUB reader, except that in very rare cases I encounter PDF files which MuPDF cannot understand, when I use other PDF readers (e.g. Okular).

▲

bhouston 4 hours ago | parent | prev [-]

I'm no fan of Adobe, but it is not that hard to add brotli support given that it is open. Probably can be added by AI without much difficulty - it is a simple feature. I think compared to the ton of other complex features PDF has, this is an easy one.