Remix.run Logo
tmpfs 4 months ago

Interesting as I was researching this recently and certainly not impressed with the quality of the Readability implementations in various languages. Although Readability.js was clearly the best, it being Javascript didn't suit my project.

In the end I found the python trifatura library to extract the best quality content with accurate meta data.

You might want to compare your implementation to trifatura to see if there is room for improvement.

acrophobic 4 months ago | parent | next [-]

> ...it being Javascript didn't suit my project.

If you're using Go, I maintain Go ports of Readability[0] and Trafilatura[1]. They're actively maintained, and for Trafilatura, the extraction performance is comparable to the Python version.

[0]: https://github.com/go-shiori/go-readability

[1]: https://github.com/markusmobius/go-trafilatura

derekperkins 4 months ago | parent | next [-]

We've been active users of go-trafilatura and love it

breadchris 4 months ago | parent | prev [-]

this is what i came here to see, thanks!

fabmilo 4 months ago | parent | prev | next [-]

reference to the library: https://trafilatura.readthedocs.io/en/latest/

for the curious: Trafilatura means "extrusion" in Italian.

| This method creates a porous surface that distinguishes pasta trafilata for its extraordinary way of holding the sauce. search maccheroni trafilati vs maccheroni lisci :)

(btw I think you meant trafilatura not trifatura)

thm 4 months ago | parent [-]

Been using it since day one but development has stalled quite a bit since 2.0.0.

winddude 4 months ago | parent | prev [-]

It's a bit old, but I bench marked a number of the web extraction tools years ago, https://github.com/Nootka-io/wee-benchmarking-tool, resiliparse-plain was my clear winner at the time.