| ▲ | brettnbutter 2 hours ago | |||||||
Nice collection you have there. Just asked the Sleuth for some examples of that, and here's one to add to your Unional National one: https://www.finhist.com/bank-runs/episodes/19827.html https://snewpapers.com/components/0b22f0ca-60d2-4d63-be99-74... Yes I agree the layouts are the trickiest part. I tried a few and ended up using some of the Paddle Paddle models for document layout analysis and orientation and such, which give bounding boxes and predicted reading order, but the reading orders aren't great even with SOTA most recent models on complex layouts, or even simple layouts when you have mastheads or images or other artifacts to work around. It's still valuable information that can be combined with heuristics though to stitch together a more accurate reading order, as the starting point of a pipeline | ||||||||
| ▲ | zzleeper an hour ago | parent [-] | |||||||
Great! Was thinking about PP but because I only ran an order of magnitude fewer articles (under 1mm pages; by piggybacking on Dell's OCR) I relied on Arcanum ( https://www.arcanum.com/en/newspaper-segmentation/about/ ) which was cheap enough (but I think not cheap enough at your scale). Cheers! | ||||||||
| ||||||||