▲ | tannhaeuser 3 days ago | ||||||||||||||||
Worth keeping in mind that the native MSO formats were using "structured storage", a horrible binary chunked serialization and metadata format from an era where binary embedding of document streams in other application documents via "Object linking and embedding" (OLE, see also Apple's OpenDoc format) was deemed desirable, with zero consideration given to third-party apps and segment formats tied to C++ data structures. Compared to that, OOXML is still a huge progress, and while it's complex I wouldn't say it's maliciously so. The Shakespeare example is a good one where the sentence is split into multiple spans to apply style rules yet the bare text content could be extracted by just removing all XML tags. Whereas the ODF variant is actually less recommendable as it relies on an unneccesarily complex formatting and text addressing language on top of XML. The article says > Even at a glance [ODF's markup] is more intelligible. Strip the text: namespaces and it’s nearly valid HTML. The only thing that needs explaining is that ODF doesn’t wrap To be with a dedicated “bold” tag. Instead, it applies an auto-style named T1 to a <text:span>, an act of separating content and presentation that mirrors established web practices. but this definitely makes things more complex for data exchange compared to OOXML. | |||||||||||||||||
▲ | quotemstr 3 days ago | parent | next [-] | ||||||||||||||||
Can you explain what's wrong with the concept of a container format that allows embedding subdocuments of different types? > zero consideration given to third-party apps and segment formats The reality is the opposite. COM serialization was specifically built to allow for composing components (and serializations thereof) that didn't know about each other into a single document. That's why it leans so heavily on GUIDs for names: they avoid collisions without needing coordination. That's a laudable goal, not pointless bloat. And the COM people implemented it pretty efficiently too! > C++ data structures What gives you that idea? Yes, the OLE stream thing was a binary format, but so is DER for ASN.1. Every webpage you load goes over a binary tagged object format not too different from OLE/COM's. But due to a persistence of myths from the 90s, people still think of the Office binary format as "horrible" when it's actually quite elegant, especially considering the problems the authors had to solve and their constraints in doing so. In many ways, we've regressed. > Markup The author of the article nails it when he says ODF is meant to be a markup language and OOXML is the serialization of an object graph. So what? Do people write ODF by hand? There are countless JSON formats just as inscrutable as MSO's legacy streams. Anyway, the idea that the MSO binary format was crap because it was binary, lazy, and represented a "memory dump" is an old myth that just won't die. It wasn't a memory dump, it wasn't lazy, and it wasn't crap. Yes, there are real problems with some of the things people put inside the OLE container, but it's facile and wrong to blame the container or the OLE stream composition model for the problem. | |||||||||||||||||
| |||||||||||||||||
▲ | mschuster91 3 days ago | parent | prev [-] | ||||||||||||||||
IIRC Adobe's PSD file format is similar, which made it very very complex to reverse engineer on one side - and vulnerable to exploits on the other side. |