Remix.run Logo
â–² tomekf a day ago

How it’s done from technical point?

â–²mmh0000 20 hours ago | parent | next [-]

Layers.

PDF is an absurdly complex file format. It's part of the reason there is no single "good" PDF reader, just a lot of mediocre PDF readers that are all terrible in their own way. Which is a topic for another day.

There are several ways to remove data in a PDF:

- Remove the data. This is much harder than it sounds. Many PDF tools won't let you change the content of a PDF, not because it isn't possible, but because you'll likely massively screw up the formatting, and the tools don't want to deal with that.

- Replace the data. This what what all the "blackout" tools do, find "A" and replace with "🮋". This is effective and doesn't break formatting since it's a 1-to-1 replacement. The problem with "replacing" is that not every PDF tool works the same way, and some, instead, just change the foreground and background color to black; it looks nearly the same, but the power of copy-and-paste still functions.

- Then you have the computer illiterate, who think changing the foreground and background color to black is good enough anyway.

â–²zauguin 10 hours ago | parent | next [-]

This seems highly misleading.

> - Remove the data. This is much harder than it sounds. Many PDF tools won't let you change the content of a PDF, not because it isn't possible, but because you'll likely massively screw up the formatting, and the tools don't want to deal with that.

Compared to other formats this is actually relatively easy in a PDF since the way the text drawing operators work they don't influence the state for arbitrary other content. A lot of positioning in a PDF is absolute (or relative to an explicitly defined matrix which has hardcoded values). Usually this makes editing a PDF harder (since when changing text the related text does not adapt automatically), but when removing data it makes it much easier since you can mostly just delete it without affecting anything else. (There are exceptions for text immediately after the removed data, but that's limited and relatively easy to control.)

> - Replace the data. This what what all the "blackout" tools do, find "A" and replace with "🮋". This is effective and doesn't break formatting since it's a 1-to-1 replacement.

That's actually rather tricky in PDFs since they usually contain embedded subset fonts and these usually do not have "🮋" as part of the subset. Also doing this would break the layout since "🮋" has a different width than most letters in a typical font, so it would not lead to less formatting issues than the previous option. Unless the "🮋" is stretched for each letter to have the same dimensions, but then the stretched characters allow to recover the text.

> The problem with "replacing" is that not every PDF tool works the same way, and some, instead, just change the foreground and background color to black; it looks nearly the same, but the power of copy-and-paste still functions.

PDF does not have a concept of a background color. If it looks like a background color in PDF, you have a rectangle drawn in one color and something in the foreground color in front of it. What you usually see in badly redacted PDF files is exactly this, but in opposite color: Someone just draws a black box on top of the characters. You could argue that this is smarter since it would still work even if someone would chnage colors, but of course, PDF is a vector format. If you just add a rectangle, someone else can remove it again. (And also copy & paste doesn't care about your rectangle)

â–²gruez 9 hours ago | parent | prev | next [-]

>- Remove the data. This is much harder than it sounds. Many PDF tools won't let you change the content of a PDF, not because it isn't possible, but because you'll likely massively screw up the formatting, and the tools don't want to deal with that.

>- Replace the data. This what what all the "blackout" tools do, find "A" and replace with "🮋". This is effective and doesn't break formatting since it's a 1-to-1 replacement. The problem with "replacing" is that not every PDF tool works the same way, and some, instead, just change the foreground and background color to black; it looks nearly the same, but the power of copy-and-paste still functions.

You're making it sound way harder than it is, when both adobe acrobat and the built-in preview app on mac can both competently redact documents. I'm not aware of instances of either (or any other purpose-made redaction tools) failing. I wouldn't homebrew a python script to do my redaction either, but that doesn't mean doing redactions properly in some insurmountable task for some intern.

â–²array_key_first 5 hours ago | parent [-]

I would not trust either tool to adequately redact documents, though I'm sure it works under normal levels of scrutiny.

The most reliable way is to just screenshot the document or print and scan it, effectively burning it down and recreating it in a new format that has no concept of the past. This works across basically all formats, too, and against all tools.

â–²JumpCrisscross 9 hours ago | parent | prev | next [-]

> Then you have the computer illiterate, who think changing the foreground and background color to black is good enough anyway

To be fair, this works if you print out those copies and then re-scan them.

â–²hallole 15 hours ago | parent | prev | next [-]

Thanks for this. Really quells the urge I get every so often to just code my own PDF editor, because they all suck and certainly it couldn't be THAT hard. Such hubris!

â–²brailsafe 12 hours ago | parent | next [-]

Heh, have at it, here's the full spec: https://developer.adobe.com/document-services/docs/assets/5b...

Should take... a weekend tops? ;) PDF is crazy and scary

â–²marcosdumay 9 hours ago | parent | next [-]

> PDF includes eight basic types of objects: Boolean values, Integer and Real numbers, Strings, Names, Arrays, Dictionaries, Streams, and the null object

Wait, this is more complete than SOAP. It may be a good idea to redo the IPC protocol with a different serialization format!

â–²jaggederest 7 hours ago | parent [-]

Well, it's a descendant of Postscript (much like JSON is a descendant of Javascript, loosely)

Society would probably never recover if we started implementing RPC-in-Postscript though.

â–²embedding-shape 11 hours ago | parent | prev | next [-]

7.5.6 "Incremental updates" from the specification is an interesting section too, speaking about accessing data people didn't think to remove from PDF files properly.

â–²CamperBob2 11 hours ago | parent | prev [-]

We will be able to say that AGI has arrived when we can hand that spec off to a model and tell it to build an Acrobat clone.

â–²exasperaited 3 hours ago | parent [-]

We will be able to say that AGI has arrived when the AI hands it back and says "no".

â–²gregsadetsky 13 hours ago | parent | prev | next [-]

Don't stop yourself before getting started. I believe in you - maybe you could write the one editor that would actually work!

Not kidding - it's a ~~~billion dollar market haha

Make an MVP/Show HN :-)

â–²kayodelycaon 10 hours ago | parent | prev | next [-]

I did a bunch of work creating pdfs using a low-level API, object goes here stuff.

As far as I understand it, at its core, pdf is just a stream of instructions that is continually modifying the document. You can insert a thousand objects before you start the next word in a paragraph. And this is just the most basic stuff. Anything on a page can be anywhere in the stream. I don't know if you can go back and edit previous pages, you might have a shot at least trying to understand one page at a time.

Did you know you can have embedded XML in PDFs? You can have a paper form with all the data filled in and include an XML version of that for any computer systems that would like an easier way to read it.

â–²TRiG_Ireland 10 hours ago | parent | prev | next [-]

The blog post about adding colour gradients to Typst dives into some of the weirdness of the format. https://typst.app/blog/2023/color-gradients

â–²NamTaf 11 hours ago | parent | prev [-]

Bravo to you for recognising the load-bearing 'just' before you threw it around :)

â–²sigwinch 6 hours ago | parent | prev [-]

qpdf has a redaction option. It’s routinely used to anonymize medical records for studies.

â–²3eb7988a1663 14 hours ago | parent | prev | next [-]

I remember reading the recommendation for journalists to redact documents is to black them out in the digital version, print it out, and re-scan it. Anything else has too many potential ways by which it might be possible to smuggle data.

â–²dmurray 11 hours ago | parent | next [-]

Even that might leak to length attacks: one reasonable plaintext would lead to black bars of 1135 px, another to 1138 px, and with enough redactions you can converge on what the plaintext might be.

The only safe way for journalists is to paraphrase what the document said and to say "an unnamed source claims that ..." and to guarantee with your reputation, and the reputation of your publisher, that you are being faithful to what the original source said. For even better results, combine multiple sources.

Unfortunately paraphrasing things and taking editorial responsibility have both been deprecated in favour of rereleasing press releases in the house style, so it's difficult to get the actual journalism these days.

â–²eviks 8 hours ago | parent [-]

You can use constant /variable length replacement to avoid length leaks?

â–² 10 hours ago | parent | prev [-]
[deleted]
â–²general1465 a day ago | parent | prev | next [-]

Mistaking redaction tool (replaces data with black square) and black highlighter (adds black square as another layer). If people doing redactions are computer-illiterate, they won't see the difference.

â–²oliwarner 17 hours ago | parent | prev | next [-]

They drew black boxes over the text. The text is still underneath. On OCR'd scanned documents, the text you'd copy is actually stored in metadata and just linked by position to the image.

Anyway, if you click on a "redaction", you're clicking on the box and can't select the text underneath, but if you just highlight the text around it, you can copy all the original text.

It's a bizarre oversight.

â–²Gigachad 9 hours ago | parent | prev | next [-]

PDF is less like an image, and more like a web page where elements can be stacked on top of each other. You can visually obscure things by sticking a black rectangle over the top, but anyone who inspects inside the pdf can remove it or see the text in the source.

There would also be a mix of text documents, and image scans. The way to censor each is different.

Perfectly censoring documents, particularly digital ones is actually surprisingly difficult.

â–²stronglikedan 9 hours ago | parent [-]

> Perfectly censoring documents, particularly digital ones is actually surprisingly difficult.

But the difficult part is easily repeatable once it's figured out, which is why it surprises me that it's not built into Acrobat as a tool already.

â–²etskinner 7 hours ago | parent [-]

In fact it is already built into Acrobat: https://helpx.adobe.com/acrobat/desktop/protect-documents/re...

â–² 13 hours ago | parent | prev [-]
[deleted]