▲ | tmpfs 4 months ago | |||||||||||||
Interesting as I was researching this recently and certainly not impressed with the quality of the Readability implementations in various languages. Although Readability.js was clearly the best, it being Javascript didn't suit my project. In the end I found the python trifatura library to extract the best quality content with accurate meta data. You might want to compare your implementation to trifatura to see if there is room for improvement. | ||||||||||||||
▲ | acrophobic 4 months ago | parent | next [-] | |||||||||||||
> ...it being Javascript didn't suit my project. If you're using Go, I maintain Go ports of Readability[0] and Trafilatura[1]. They're actively maintained, and for Trafilatura, the extraction performance is comparable to the Python version. | ||||||||||||||
| ||||||||||||||
▲ | fabmilo 4 months ago | parent | prev | next [-] | |||||||||||||
reference to the library: https://trafilatura.readthedocs.io/en/latest/ for the curious: Trafilatura means "extrusion" in Italian. | This method creates a porous surface that distinguishes pasta trafilata for its extraordinary way of holding the sauce. search maccheroni trafilati vs maccheroni lisci :) (btw I think you meant trafilatura not trifatura) | ||||||||||||||
| ||||||||||||||
▲ | winddude 4 months ago | parent | prev [-] | |||||||||||||
It's a bit old, but I bench marked a number of the web extraction tools years ago, https://github.com/Nootka-io/wee-benchmarking-tool, resiliparse-plain was my clear winner at the time. |