| ▲ | Macha 9 hours ago | |||||||
> The key insight is that bank statement PDFs are almost always columnar. Of course, this relies on the PDF having a proper text layer; if your bank sends you scanned images, you’re out of luck (though I’ve yet to encounter one that does). When you convert them to text while preserving the layout, you get something that looks like this: So I decided to try this out with my bank who's export options are (one of the mentioned slightly silly multi-line format) XLSX or PDF only, and it appears they've done some "encryption" (really a simple substitution cipher and an embedded font with the characters jumbled up so it renders correctly) to the PDF to prevent this. All the marketing text and headers are in the pdftotext output fine but the actual data is all accented and non-printable characters (also if you copy/paste out). The substitution cipher does seem stable across a few statements, but still seems like less work to work off the XLSX | ||||||||
| ▲ | netsharc 8 hours ago | parent | next [-] | |||||||
I remember seeing an online shop that did the whole font substitution to prevent web-scraping of their prices.. I think they even changed the substitution between elements so one couldn't just do a single pass replacement and get the original data back.. I guess nowadays it's very cheap to run a headless browser, screenshot the output, and run it through OCR.. hah, to prevent that they'd have to design their webpage as 1 full screen Captcha.. | ||||||||
| ▲ | abdullahkhalids 7 hours ago | parent | prev | next [-] | |||||||
My bank outputs different data in the description field for CSV and PDF. The PDF statement descriptions are longer and contain more information. | ||||||||
| ▲ | lalitmaganti 8 hours ago | parent | prev | next [-] | |||||||
Interesting! You might want to try Tabula in that case. For that type of "obfuscated" PDFs I've come across, it does well, it's just a lot slower to run than pdf2text. | ||||||||
| ||||||||
| ▲ | inetknght 8 hours ago | parent | prev [-] | |||||||
That's a ridiculously dumb idea on the bank's part. Print the PDF to an image. Then use OCR. Then import the output from that instead. | ||||||||
| ||||||||