| ▲ | A case study in PDF forensics: The Epstein PDFs(pdfa.org) |
| 145 points by DuffJohnson 3 hours ago | 55 comments |
| |
|
| ▲ | ted_bunny an hour ago | parent | next [-] |
| Has anyone analysed JE's writing style and looked for matches in archived 4chan posts or content from similar platforms? Same with Ghislaine, there should be enough data to identify them atp right? I don't buy the MaxwellHill claims for various reasons but it doesn't mean there's nothing to find. |
| |
| ▲ | qoez 23 minutes ago | parent | next [-] | | People always claimed this as a data leak vector but I've always been sceptical. Like just writing style and vocabulary is probably extremely shared among too many people to narrow it down much. (How people that you know could have written this reply?) The counter argument is that he had a very specific style in his mail so maybe this is a special case. | |
| ▲ | Der_Einzige an hour ago | parent | prev | next [-] | | Stylometry is extremely sophisticated even with simple n-gram analysis. There's a demo of this that can easily pick out who you are on HN just based on a few paragraphs of your own writing, based on N-gram analysis. https://news.ycombinator.com/item?id=33755016 You can also unironically spot most types of AI writing this way. The approaches based on training another transformer to spot "AI generated" content are wrong. | | | |
| ▲ | kmeisthax an hour ago | parent | prev [-] | | I'm pretty sure Epstein tried to meet with moot at least once: https://www.jmail.world/search?q=chris+poole | | |
| ▲ | nubg an hour ago | parent | next [-] | | He met with moot ("he is sensitive, be gentile", search on jmail), and within a few days the /pol/ board got created, starting a culture war in the US, leading to Trump getting elected president. Absolutely nuts. | | |
| ▲ | mort96 5 minutes ago | parent | next [-] | | Just to substantiate this a bit: I remember a gleeful consensus in certain circles being that /pol/ and /r/the_donald had "memed Trump into the White House". It's much more complicated than that, but there's certainly an element of truth there. | |
| ▲ | acessoproibido an hour ago | parent | prev | next [-] | | I always wondered how much of a cultural etc influence 4Chan actually had (has?) - so much of the mindset and vernacular that was popular there 10+ years ago is now completely mainstream. | | |
| ▲ | jazzyjackson 25 minutes ago | parent [-] | | Ah, a rare opportunity to share a blog post that had a big effect on my political outlook back in 2016, Meme Magic Is Real, You Guys Who can say what effect it had on the world, but a presidential candidate reposting himself personified as Pepe the frog was still weird back then, and at least a nod to the trolls doing so much work on his behalf https://medium.com/tryangle-magazine/meme-magic-is-real-you-... (dismissable login wall) |
| |
| ▲ | GaryBluto an hour ago | parent | prev | next [-] | | /pol/ in no way started the American culture war. It was brewing for a while. | | | |
| ▲ | kipchak 40 minutes ago | parent | prev | next [-] | | Which meeting are you seeing? That search doesn't seem to work for me, I'm only seeing the one Jan 2012. | |
| ▲ | dopa42365 20 minutes ago | parent | prev [-] | | Given the "nature" of 4chan (only a few hundred posts and a few thousand comments at a time, the vast majority of it shitposts and spam), it just can't do that. The imageboard format and limits basically prevent any scaling and mainstream success. If you follow any of the general threads in pol or sp for a while, you'll spot the same few people all the time, it's a tiny community of active users. | | |
| ▲ | thatguy0900 5 minutes ago | parent [-] | | I think the logic is Pol didn't need to reach the masses, the masses only consume content they don't create it. You only need to radicalize the few people who then go on to be the 1% of people commenting and posting. |
|
| |
| ▲ | acessoproibido an hour ago | parent | prev [-] | | That is a crazy amount of emails from/about moot... |
|
|
|
| ▲ | waynenilsen 2 hours ago | parent | prev | next [-] |
| > Information leakage may also be occurring via PDF comments or orphaned objects inside compressed object streams, as I discovered above. hopefully someone is independently archiving all documents my understanding is that some are being removed |
| |
| ▲ | agilob 32 minutes ago | parent | next [-] | | Reddit is also removing and shadowbannig such posts, but there's a community on https://lemmy.world/post/42440468 | |
| ▲ | some_random 2 hours ago | parent | prev | next [-] | | Are they being removed or replaced with more heavily redacted documents? There were definitely some victim names that slipped through the cracks that have since been redacted. | |
| ▲ | embedding-shape 2 hours ago | parent | prev | next [-] | | Initially under "Epstein Files Transparency Act (H.R.4405)" on https://www.justice.gov/epstein/doj-disclosures, all datasets had .zip links. I first saw that page when all but dataset 11 (or 10) had a .zip link. At one point this morning, all the .zip links were removed, now it seems like most are back again. | |
| ▲ | littlecorner 2 hours ago | parent | prev [-] | | I think some of the released documents included images of victims, which where redacted. So it's not necessarily malicious removals | | |
| ▲ | dylan604 an hour ago | parent [-] | | That's my understanding too, so archiving the unredacted images could mean holding CSAM. |
|
|
|
| ▲ | embedding-shape 2 hours ago | parent | prev | next [-] |
| Re the OCR, I'm currently running allenai/olmocr-2-7b against all the PDFs with text in them, comparing with the OCR DOJ provided, and a lot it doesn't match, and surprisingly olmocr-2-7b is quite good at this. However, after extracing the pages from the PDFs, I'm currently sitting on ~500K images to OCR, so this is currently taking quite a while to run through. |
| |
| ▲ | originalvichy 2 hours ago | parent [-] | | Did you take any steps to decrease the dimension size of images, if this increases the performance? I have not tried this as I have not peformed an OCR task like this with an LLM. I would be interested to know at what size the vlm cannot make out the details in text reliably. | | |
| ▲ | embedding-shape 2 hours ago | parent [-] | | The performance is OK, takes a couple of seconds at most on my GPU, just the amount of documents to get through that takes time, even with parallelism. The dimension seems fine as it is, as far as I can tell. |
|
|
|
| ▲ | yonatan8070 23 minutes ago | parent | prev | next [-] |
| A bit off-topic, but I find it kinda funny that the "Decline" button on the cookie popup on this page is labled "Continue without consent". |
|
| ▲ | originalvichy 2 hours ago | parent | prev | next [-] |
| Any guesses why some of the newest files seem to have random ”=” characters in the text? My first thought was OCR, but it seemed to not be linked to characters like ”E” that could be mistakenly interpreted by an OCR tool. My second guess is just making it more difficult to produce reliable text searches, but probably 90% of HN readers could find a way to make a search tool that does not fall apart in case a ”=” character is found (although making this work for long search queries would make the search slower). |
| |
|
| ▲ | _def 2 hours ago | parent | prev | next [-] |
| I can't even download the archive, the transmission always terminates just before its finished. Spooky. |
|
| ▲ | NoToP 25 minutes ago | parent | prev | next [-] |
| This is so incredibly useful to me right now for incidental reasons I am commenting to make sure I can get back to it. |
|
| ▲ | bugeats 2 hours ago | parent | prev | next [-] |
| Somebody ought to train an LLM exclusively on this text, just for funsies. |
| |
|
| ▲ | corygarms 2 hours ago | parent | prev | next [-] |
| These folks must really have their hands full with the 3M+ pages that were recently released. Hoping for an update once they expand this work to those new files. |
|
| ▲ | nkozyra 2 hours ago | parent | prev | next [-] |
| > DoJ explicitly avoids JPEG images in the PDFs probably because they appreciate that JPEGs often contain identifiable information, such as EXIF, IPTC, or XMP metadata Maybe I'm underestimating the issue at full, but isn't this a very lightweight problem to solve? Is converting the images to lower DPI formats/versions really any easier than just stripping the metadata? Surely the DOJ and similar justice agencies have been aware of and doing this for decades at this point, right? |
| |
| ▲ | DharmaPolice 41 minutes ago | parent | next [-] | | This is speculation but generally rules like this follow some sort of incident. e.g. Someone responds to a FOI request and accidentally discloses more information than desired due to metadata. So a blanket rule is instituted not to use a particular format. | |
| ▲ | originalvichy 2 hours ago | parent | prev [-] | | Maybe they know more than we do. It may be possible to tamper with files at a deeper level. I wonder if it is also possible to use some sort of tampered compression algorithm that could mark images much like printers do with paper. Another guess is that perhaps the step is a part of a multi-step sanitation process, and the last step(s) perform the bitmap operation. | | |
| ▲ | normalaccess 2 hours ago | parent [-] | | I'm not sure about computer image generation but you can (relatively) easily fingerprint images generated by digital cameras due to sensor defects. I'll bet there is a similar problem with PC image generation where even without the EXIF data there is probably still too much side channel data leakage. |
|
|
|
| ▲ | mmooss 44 minutes ago | parent | prev | next [-] |
| What is the legal basis for releasing the someone's private files and communications? If they can do it to Epstein, they can do it to you, to the Washington Post journalist, to former President Clinton, etc. Is the scope at least limited somehow? Generally I favor transparency, but of course probably the most important parts are withheld. |
| |
| ▲ | toast0 26 minutes ago | parent | next [-] | | > What is the legal basis for releasing the someone's private files and communications? An act of congress, for one. Also, AFAIK, federal privacy generally ends at death, as does criminal liability; so releasing government files from a federal investigation after death of the subject is generally within the realm of acceptable conduct. | |
| ▲ | pyvpx 32 minutes ago | parent | prev | next [-] | | I believe a literal Act of Congress… | |
| ▲ | dwater 33 minutes ago | parent | prev | next [-] | | It was passed into law by congress and signed by the president: https://en.wikipedia.org/wiki/Epstein_Files_Transparency_Act | |
| ▲ | pstuart 35 minutes ago | parent | prev [-] | | I'd assume it was the nature of the case, and that discovery was done with him being dead. |
|
|
| ▲ | tibbon 3 hours ago | parent | prev | next [-] |
| That's a lot of PeDoFiles! (But seriously, great work here!) |
| |
|
| ▲ | meidan_y 3 hours ago | parent | prev [-] |
| (2025) just follow hn guideline, impressive voter ring though |
| |
| ▲ | alain94040 3 hours ago | parent [-] | | We're in early February 2025 [edit:2026] and the article was written on Dec 23, 2025, which makes it less than two months old. I think it's ok not to include a year in the submission title in that case. I personally understand a year in the submission as a warning that the article may not be up to date. | | |
| ▲ | petepete 3 hours ago | parent | next [-] | | We're in Feb 2026. I'm not used to typing it yet, either. | |
| ▲ | embedding-shape 2 hours ago | parent | prev | next [-] | | Less about the age, and more about confusing what they are analyzing, for the files that were just released like a week ago. | |
| ▲ | michaelmcdonald 3 hours ago | parent | prev | next [-] | | "We're in early February ~2025~ *2026*" | |
| ▲ | GlitchRider47 3 hours ago | parent | prev [-] | | Generally, I'd agree with you. However, the recent Epstein file dump was in 2026, not 2025, so I would say it is relevant in this case.. |
|
|