Remix.run Logo
mlissner 6 hours ago

Cool to see this here. It’s funny because we do so many huge, complex, multiyear projects at Free Law Project, but this is the most viral any of our work has ever gone!

Anyway, I made X-ray to analyze the millions of documents we have in CourtListener so that we can try to educate people about the issue.

The analysis was fun. We used S3 batch jobs to analyze millions of documents in a matter of minutes, but we haven’t done the hard part of looking at the results and reporting them out. One day.

thangalin 5 hours ago | parent [-]

https://www.argeliuslabs.com/deep-research-on-pdf-redaction-...

> Information Leaking from Redaction Marks: Even when content is properly removed, the redaction marks themselves can leak some information if not done carefully. For example, if you have a black box exactly covering a word, the length of that black box gives a clue to the word’s length (and potentially its identity).

Does X-ray employ glyph spacing attacks and try to exploit font metric leaks?

mlissner 4 hours ago | parent [-]

No, we worked with researchers that developed that kind of system, but didn't broadcast our work b/c the research was too sensitive. Seems the cat is out the bag now though.

I think the combination of AI and font-metrics is going to be wild though. You ought to be able to make a system that can figure out likely words based on the unredacted ones and the redaction's size. I haven't seen any redaction system yet that protects against this.

thangalin 4 hours ago | parent | next [-]

> I haven't seen any redaction system yet that protects against this.

The linked article suggests widening redacted areas more than needed with some randomization applied to the width. Strikes me that that wouldn't do much except add a few more possible solutions.

vlovich123 4 hours ago | parent [-]

Yeah, the more robust protection is to widen to a constant. But in the general case that could require reflowing the pdf. But honestly single word redactions are really probably useless with cheap AI that can highly accurately fill in the gaps

rgmerk 4 hours ago | parent [-]

Depends what you're trying to hide.

If the redaction is a person's name, and there's nothing else to give the person's identity away, single word redaction probably works reasonably well, AI or no AI.

godelski 2 hours ago | parent [-]

  > If the redaction is a person's name
I'm not sure if you're aware, but peoples names are variable in length. We are talking about a system that can identify single character differences. So that does reduce the search space, especially since names are not all possible letter permutations. Combine that with the fact that it isn't uncommon to see partial first letters show up. You can even see some instances in the Epstein files.

Of course, you can also take this further. Even if you can't recover names you can get meta information about how many parties are involved by recognizing different length redactions correspond to different entities. While same length redaction doesn't guarantee same entity it is a hint.

mycall an hour ago | parent [-]

It is also common for authors to misspell names (proper nouns) in an attempt to determine who leaks docs (and to force non-matches for FOIA requests).

vlovich123 4 hours ago | parent | prev [-]

I thought glyph spacing attacks are an old idea; like I recall reading about such ideas 10-20 years ago unless I’m misremembering. Can you clarify why it was considered “too sensitive” if the whole point of this effort is to showcase these attacks?