| ▲ | forgotpwd16 12 hours ago |
| 74910,74912c187768,187779
< [Example 1: If you want to use the code conversion facetcodecvt_utf8to output tocouta UTF-8 multibyte sequence
< corresponding to a wide string, but you don't want to alter the locale forcout, you can write something like:\237 D.27.21954
\251ISO/IECN4950wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
< std::string mbstring = myconv.to_bytes\050L"Hello\134n"\051;
---
>
> [Example 1: If you want to use the code conversion facet codecvt_utf8 to output to cout a UTF-8 multibyte sequence
> corresponding to a wide string, but you don’t want to alter the locale for cout, you can write something like:
>
> § D.27.2
> 1954
>
> © ISO/IEC
> N4950
>
> wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
> std::string mbstring = myconv.to_bytes(L"Hello\n");
Is indeed faster but output is messier. And doesn't handle Unicode in contrast to mutool that does. (Probably also explains the big speed boost.) |
|
| ▲ | TZubiri 12 hours ago | parent | next [-] |
| In my experience with parsing PDFs, speed has never been an issue, it has always been a matter of quality. |
| |
| ▲ | DetroitThrow 11 hours ago | parent [-] | | I tried a small PDF and got a memory error. It's definitely much faster than MuPDF on that file. | | |
| ▲ | littlestymaar 3 hours ago | parent [-] | | “The fastest PDF extractor is the one that crashes at the beginning of the file” or something. |
|
|
|
| ▲ | lulzx 12 hours ago | parent | prev [-] |
| fixed. |
| |
| ▲ | forgotpwd16 12 hours ago | parent | next [-] | | Yeah, sorry for confusion. When said Unicode, meant foreign text rather (just) the unescaped symbols, e.g. Greek. At one random Greek textbook[0], zpdf output is (extract | head -15): 01F9020101FC020401F9020301FB02070205020800030209020701FF01F90203020901F9012D020A0201020101FF01FB01FE0208
0200012E0219021802160218013202120222 0209021D0212021D012E013202200222000301FA021A0220021C022002160213012E0222000F000301F90206012C
020301FF02000205020101FC020901F90003020001F9020701F9020E020802000205020A
01FC028C0213021B022002230221021800030200012E021902180216021201320221021A012E00030209021D0212021D012E013202200222000301FA021A0220021C022002160213012E0222000F000301F90206012C
0200020D02030208020901F90203020901FF0203020502080003012B020001F9012B020001F901FA0205020A01FD01FE0208
020201300132012E012F021A012F0210021B013202200221012E0222 0209021D0212021D012E013202200222000301FA021A0220021C022002160213012E0222000F000301F90206012C
This for entire book. Mutool extracts the text just fine.[0]: https://repository.kallipos.gr/handle/11419/15087 | | |
| ▲ | lulzx 4 hours ago | parent | next [-] | | works now! ΑΛΕΞΑΝΔΡΟΣ ΤΡΙΑΝΤΑΦΥΛΛΙΔΗΣ
Καθηγητής Τμήματος Βιολογίας, ΑΠΘ ΝΙΚΟΛΕΤΑ ΚΑΡΑΪΣΚΟΥ
Επίκουρη Καθηγήτρια Τμήματος Βιολογίας, ΑΠΘ
ΚΩΝΣΤΑΝΤΙΝΟΣ ΓΚΑΓΚΑΒΟΥΖΗΣ
Μεταδιδάκτορας Τμήματος Βιολογίας, ΑΠΘ
Γονιδιώματα
Δομή, Λειτουργία και Εφαρμογές
| | |
| ▲ | forgotpwd16 3 hours ago | parent [-] | | Nice! Speed wasn't even compromised. Still 5x when benching. Also saw now there's page with tool compiled to wasm. Cool. | | |
| |
| ▲ | lulzx 10 hours ago | parent | prev [-] | | sorry, I haven't yet figured out non-latin with tounicode references. |
| |
| ▲ | TZubiri 12 hours ago | parent | prev [-] | | Lol, but there's 100 competitors in the PDF text extraction space, some are multi million dollar industries: AWS textract, ABBY PDFreader, PDFBox, I think you may be underestimating the challenge here. |
|