| ▲ | zzleeper 2 hours ago | ||||||||||||||||
Looks cool, congrats! I've also worked with this data, but only for research purposes: https://www.finhist.com/bank-runs/episodes/13895.html https://www.finhist.com/bank-runs/index.html Surprisingly, I found out that layout was the trickiest thing, as newspaper articles often had multiple layers of headers, spanned multiple columns, etc. Do you have a preferred solution on that? | |||||||||||||||||
| ▲ | brettnbutter 2 hours ago | parent [-] | ||||||||||||||||
Nice collection you have there. Just asked the Sleuth for some examples of that, and here's one to add to your Unional National one: https://www.finhist.com/bank-runs/episodes/19827.html https://snewpapers.com/components/0b22f0ca-60d2-4d63-be99-74... Yes I agree the layouts are the trickiest part. I tried a few and ended up using some of the Paddle Paddle models for document layout analysis and orientation and such, which give bounding boxes and predicted reading order, but the reading orders aren't great even with SOTA most recent models on complex layouts, or even simple layouts when you have mastheads or images or other artifacts to work around. It's still valuable information that can be combined with heuristics though to stitch together a more accurate reading order, as the starting point of a pipeline | |||||||||||||||||
| |||||||||||||||||