Remix.run Logo
bhouston 3 hours ago

BTW I've looked into custom dictionaries before for similar use cases and I suspect it would only offer like a 1% improvement or so for PDFs -- still good, but not a massive difference maker. The issue is that PDFs, like web pages, are incredibly repetitive in terms of their tags/structure. As such the custom dictionary only helps if the doc is really small, otherwise because of the repetitive nature, the self-inferred dictionary will resemble the custom dictionary after just a few blocks of PDF content.

The sole exception is if they are restarting the brotli stream for each page, and they are not sharing a dictionary, custom or inferred across the whole doc. Then the dictionary will have to be re-inferred on each page, and then a shared custom dictionary would make more sense.