Remix.run Logo
Helmut10001 4 days ago

The biggest migration challenge isn't finding one-to-one replacements for software, but rebuilding tested workflows and processes.

For years, I've had a seamless document management process on Windows for all my receipts and bills:

    1. My ScanSnap scans, auto-crops, and OCRs documents into a designated folder.
    2. A small open-source tool, DropIt [1], monitors that folder.
    3. Based on about 100 custom rules that parse the OCR'd text (for tax IDs, phone numbers, etc.), DropIt automatically renames and moves the PDFs into the correct subfolders.
    4. Nextcloud then syncs the organized files, and I can discard the paper originals.
This "fire-and-forget" system has been incredibly reliable.

When I explored replicating this on Linux, I found the building blocks exist. For instance, ocrmypdf seems to be a powerful OCR tool, and SANE drivers combined with gscan2pdf can handle the scanning. [2] I also found several tools for automated file renaming and organization.[3] However, the Fujitsu ScanSnap Home software provides an all-in-one experience for the initial capture.[4] More importantly, I'd have to manually translate all my pattern-matching rules from DropIt to a new system, likely a collection of shell scripts. I still feel that this is too fragile. I would need to program all exceptions myself: file renaming issues, special characters, length of document names, issues with OCR and alerting, should anything go wrong. The system needs to be fail-safe because once I throw the original away, there is no going back.

Then, another challenge is to find the time to replace this reliable system with the shortest "downtime" possible. I need this daily.. so I already decided I need a migration phase, where both systems run in parallel. Perhaps this better explains my slowness to migrate to Linux.

The fact that there isn't a well-known, integrated tool for this on Linux seems suspicious. It makes me wonder if I'm approaching the problem from the wrong direction. Is there a more "Linux-native" philosophy for this kind of workflow automation that I'm missing?

And yes, I'm aware of Paperless-ngx. It's a fantastic project, but I'm committed to my current folder structure and prefer to avoid a solution that centralizes my documents in a database, away from my Nextcloud setup and my filesystem-first-philosophy for document management. I don't trust that paperless-ngx will be available in 40+ years from now, but I need my document management to last that long.

[1]: http://www.dropitproject.com/

[2]: https://github.com/ocrmypdf/OCRmyPDF

[3]: https://github.com/ptmrio/autorename-pdf

[4]: https://forum.manjaro.org/t/fujitsu-scansnap-home-software-f...