| ▲ | lulzx 12 hours ago |
| fixed. |
|
| ▲ | forgotpwd16 12 hours ago | parent | next [-] |
| Yeah, sorry for confusion. When said Unicode, meant foreign text rather (just) the unescaped symbols, e.g. Greek. At one random Greek textbook[0], zpdf output is (extract | head -15): 01F9020101FC020401F9020301FB02070205020800030209020701FF01F90203020901F9012D020A0201020101FF01FB01FE0208
0200012E0219021802160218013202120222 0209021D0212021D012E013202200222000301FA021A0220021C022002160213012E0222000F000301F90206012C
020301FF02000205020101FC020901F90003020001F9020701F9020E020802000205020A
01FC028C0213021B022002230221021800030200012E021902180216021201320221021A012E00030209021D0212021D012E013202200222000301FA021A0220021C022002160213012E0222000F000301F90206012C
0200020D02030208020901F90203020901FF0203020502080003012B020001F9012B020001F901FA0205020A01FD01FE0208
020201300132012E012F021A012F0210021B013202200221012E0222 0209021D0212021D012E013202200222000301FA021A0220021C022002160213012E0222000F000301F90206012C
This for entire book. Mutool extracts the text just fine.[0]: https://repository.kallipos.gr/handle/11419/15087 |
| |
| ▲ | lulzx 4 hours ago | parent | next [-] | | works now! ΑΛΕΞΑΝΔΡΟΣ ΤΡΙΑΝΤΑΦΥΛΛΙΔΗΣ
Καθηγητής Τμήματος Βιολογίας, ΑΠΘ ΝΙΚΟΛΕΤΑ ΚΑΡΑΪΣΚΟΥ
Επίκουρη Καθηγήτρια Τμήματος Βιολογίας, ΑΠΘ
ΚΩΝΣΤΑΝΤΙΝΟΣ ΓΚΑΓΚΑΒΟΥΖΗΣ
Μεταδιδάκτορας Τμήματος Βιολογίας, ΑΠΘ
Γονιδιώματα
Δομή, Λειτουργία και Εφαρμογές
| | |
| ▲ | forgotpwd16 3 hours ago | parent [-] | | Nice! Speed wasn't even compromised. Still 5x when benching. Also saw now there's page with tool compiled to wasm. Cool. | | |
| |
| ▲ | lulzx 10 hours ago | parent | prev [-] | | sorry, I haven't yet figured out non-latin with tounicode references. |
|
|
| ▲ | TZubiri 12 hours ago | parent | prev [-] |
| Lol, but there's 100 competitors in the PDF text extraction space, some are multi million dollar industries: AWS textract, ABBY PDFreader, PDFBox, I think you may be underestimating the challenge here. |