▲ | jeroenhd 2 days ago | |
XSLT is a terrible tool for that job. RDF combined with something like SPARQL is much closer to that, and makes for one of the greatest knowledge processing tools nobody ever uses. XSLT is designed to work on XML while HTML documents are almost always SGML-based. The semantics don't work the same and applying XML engines on HTML often breaks things in weird and unexpected ways. basic HTML parsing rules like "a <head> tag doesn't need to be closed and can simply be auto-closed by a <body>" will seriously confuse XML engines. To effectively use XSLT to extract information from the web, you'd first need to turn HTML into XML. | ||
▲ | oefrha 2 days ago | parent | next [-] | |
Hey, it works great on the dozens of XHTML websites lying around. Dozens! | ||
▲ | int_19h 2 days ago | parent | prev | next [-] | |
XSLT is designed to work on the XML Infoset, which is basically just an abstract tree of elements with attributes. Which is why XSLT has e.g. HTML output method, even though you use XML snippets to generate it. If you already have logic to parse HTML into a tree, it's trivial to run XSLT on it. Indeed, most recent version of XSLT uses the same trick to process JSON even. | ||
▲ | aragilar 2 days ago | parent | prev [-] | |
I think it's the other way round, it's XML -> HTML not HTML -> XML. |