Remix.run Logo
bayindirh 2 days ago

XML is not a file format only. It's a complete ecosystem built around that file. Protocols, verifiers, file formats built on top of XML.

You can get XML and convert it to everything. I use it to model 3D objects for example, and the model allows for some neat programming tricks while being efficient and more importantly, human readable.

Except being small, JSON is worst of both worlds. A hacky K/V store, at best.

ongy 2 days ago | parent | next [-]

Calling XML human readable is a stretch. It can be with some tooling, but json is easier to read with both tooling and without. There's some level of the schema being relevant to how human readable the serialization is, but I know significantly fewer people that can parse an XML file by sight than json.

Efficient is also... questionable. It requires the full turing machine power to even validate iirc. (surely does to fully parse). by which metric is XML efficient?

bayindirh 2 days ago | parent | next [-]

By efficiency, I mean it's text and compresses well. If we mean speed, there are extremely fast XML parsers around see this page [0] for state of the art.

For hands-on experience, I used rapidxml for parsing said 3D object files. A 116K XML file is parsed instantly (the rapidxml library's aim is to have speed parity with strlen() on the same file, and they deliver).

Converting the same XML to my own memory model took less than 1ms including creation of classes and interlinking them.

This was on 2010s era hardware (a 3rd generation i7 3770K to be precise).

Verifying the same file against an XSLT would add some milliseconds, not more. Considering the core of the problem might took hours on end torturing memory and CPU, a single 20ms overhead is basically free.

I believe JSON and XML's readability is directly correlated with how the file is designed and written (incl. terminology and how it's formatted), but to be frank, I have seen both good and bad examples on both.

If you can mentally parse HTML, you can mentally parse XML. I tend to learn to parse any markup and programming language mentally so I can simulate them in my mind, but I might be an outlier.

If you're designing a file format based on either for computers only, approaching Perl level regular expressions is not hard.

Oops, forgot the link:

[0]: https://pugixml.org/benchmark.html

StopDisinfo910 2 days ago | parent | prev | next [-]

> Calling XML human readable is a stretch.

That’s always been the main flaw of XML.

There are very few use case where you wouldn’t be better served by an equivalent more efficient binary format.

You will need a tool to debug xml anyway as soon as it gets a bit complex.

bayindirh 2 days ago | parent | next [-]

A simple text editor of today (Vim, KATE) can real-time sanity check an XML file. Why debug?

StopDisinfo910 2 days ago | parent [-]

Because issue with XML are pretty much never sanity check. After all XML is pretty much never written by hand but by tools which will most likely produce valid xml.

Most of the time you will actually be debugging what’s inside the file to understand why it caused an issue and find if that comes from the writing or receiving side.

It’s pretty much like with a binary format honestly. XML basically has all the downside of one with none of the upside.

bayindirh 2 days ago | parent [-]

I mean, I found it pretty trivial to write parsers for my XML files, which are not simple ones, TBH. The simplest one of contains a bit more than 1700 lines.

It's also pretty easy to emit, "I didn't find what I'm looking for under $ELEMENT" while parsing the file, or "I expected a string but I got $SOMETHING at element $ELEMENT".

Maybe I'm distorted because I worked with XML files more than decade, but I never spent more than 30 seconds while debugging an XML parsing process.

Also, this was one of the first parts I "sealed" in the said codebase and never touched it again, because it worked, even if the coming file is badly formed (by erroring out correctly and cleanly).

StopDisinfo910 2 days ago | parent [-]

> It's also pretty easy to emit, "I didn't find what I'm looking for under $ELEMENT" while parsing the file, or "I expected a string but I got $SOMETHING at element $ELEMENT".

I think we are actually in agreement. You could do exactly the same with a binary format without having to deal with the cumbersomeness of xml which is my point.

You are already treating xml like one writing errors in your own parsers and "sealing" it.

What’s the added value of xml then?

bayindirh 2 days ago | parent [-]

> cumbersomeness of xml...

Telling the parser to navigate to first element named $ELEMENT, checking a couple of conditions and assigning values in a defensive manner is not cumbersome in my opinion.

I would not call parsing binary formats cumbersome (I'm a demoscene fan, so I aspire to match their elegance and performance in my codebases), but not the pragmatic approach for that particular problem at hand.

So, we arrive to your next question:

> What’s the added value of xml then?

It's various. Let me try to explain.

First of all, it's a self documenting text format. I don't need an extensive documentation for it. I have a spec, but someone opening it in a text editor can see what it is, and understand how it works. When half (or most) of the users of your code are non-CS researchers, that's a huge plus.

Talking about non-CS researchers, these folks will be the ones generating these files from different inputs. Writing an XML in any programming language incl. FORTRAN and MATLAB (not kidding) is 1000 times easier and trivial than writing a binary blob.

Expanding that file format I have developed on XML is extremely easy. You change a version number, and maybe add a couple of paths to your parser, and you're done. If you feel fancy, allow for backwards compatibility, or just throw an error if you don't like the version (this is for non-CS folks mostly. I'm not that cheap). I don't need to work with nasty offsets or slight behavior differences causing to pull my hairs out.

The preservation is much easier. Scientific software rots much quicker than conventional software, so keeping file format readable is better for preservation.

"Sealing" in that project's parlance means "verify and don't touch it again". When you're comparing your results with a ground truth with 32 significant digits, you don't poke here and there leisurely. If it works, you add a disclaimer that the file is "verified at YYYYMMDD", and is closed for modifications, unless necessary. Same principle is also valid for performance reasons.

So, building a complex file format over XML makes sense. It makes the format accessible, cross-platform, easier to preserve and more.

scotty79 a day ago | parent | prev [-]

With this you have efficient binary format and generality of XML

https://en.m.wikipedia.org/wiki/Efficient_XML_Interchange

But somehow google forgot to implement this.

int_19h 2 days ago | parent | prev [-]

It's kinda funny to see "not human readable" as an argument in favor of JSON over XML, when the former doesn't even have comments.

queenkjuul a day ago | parent [-]

And yet, it's still easier for me to parse with my eyes

mortarion 2 days ago | parent | prev | next [-]

I mean, at least JSON has a native syntax to indicate an array, unlike XML which requires that you tack on a schema.

<MyRoot> <AnElement> <Item></Item> </AnElement> </MyRoot>

Serialize that to a JavaScript object, then tell me, is "AnElement" a list or not?

That's one of the reasons why XML is completely useless on the web. The web is full of XML that doesn't have a schema because writing one is a miserable experience.

bayindirh 2 days ago | parent [-]

This is why you can have attributes in a tag. You can make an XML file self explanatory.

Consider the following example:

    <MyRoot>
      <AnElement type="list" items="1">
        <Item>Hello, World!</Item>
      </AnElement>
    <MyRoot>
Most parsers have type aware parsing, so that if somebody tucks string to a place where you expect integer, you can get an error or nil or "0" depending on your choice.
dminik 2 days ago | parent [-]

I had the displeasure of parsing XML documents (into Rust) recently. I don't ever want to do this again.

JSON for all it's flaws is beautifully simple in comparison. A number is either a number or the document is invalid. Arrays are just arrays and objects are just objects.

XML on the other hand is the wild west. This particular XML beast had some difficulty sticking to one thing.

Take for instance lists. The same document had two different ways to do them:

  <Thing>
    <Name>...</Name>
    <Image>...</Image>
    <Image>...</Image>
  </Thing>

  <Thing>
    <Name>...</Name>
    <Images>
      <Image>...</Image>
      <Image>...</Image>
    </Images>
  </Thing>
Various values were scattered between attributes and child elements with no rhyme or reason.

To prevent code reuse, some element names were namespaced, so you might have <ThingName /> and <FooName />.

To round off my already awful day, some numbers were formatted with thousands separators. Of course, these can change depending on your geographical location.

Now, one could say that this is just the fault of the specific XML files I was parsing. And while I would partially agree, the fact that a format makes this possible is a sign of it's quality.

Since there's no clear distinction between objects and arrays you have to pick one. Or multiple.

Since objects can be represented with both attributes and children you have to pick one. Or both.

Since there are no numbers in XML, you can just write them out any way you want. Multiple ways is of course preferable.

jll29 a day ago | parent | next [-]

The file you got sounds neither valid nor well-formed. It might not even be XML.

I know you describe a real-life situation, but if XML gets abused it's not XML's fault - like it's not JSON's fault if JSON gets abused

dminik a day ago | parent [-]

Could you elaborate why you think so?

As far as I can tell, the file was a fully valid XML file. The issue is that doesn't really tell you (or guarantee) much.

There's just no one specific way to do a thing.

bayindirh a day ago | parent | prev [-]

There's a trade-off and tension between simplicity and flexibility. In the recent days the post titled "I prefer RST over Markdown" has surfaced again [0][1], showing the same phenomenon clearly.

Simple formats are abuse-proof because of their limitations, and it makes perfect sense in some cases (I'm a Markdown fan, for example, but prefer LaTeX for serious documents). Flexible formats are more prone to abuse and misuse. XML is extremely flexible and puts the burden of designing and sanity checking the file to the producers and consumers of the file format in question. This is why it has a couple of verification standards built on top of it.

I personally find very unproductive to yell at a particular file format because it doesn't satisfy some users' expectation out of the box. The important distinction is whether it provides the capability to address those or not. XML has all the bells and whistles and then some to craft sane, verifiable and easily parseable files.

I also strongly resist that the notion of making everything footgun proof. Not only it stifles creativity and progress, it makes no sense. We should ban all kinds of blades, then. One shall read the documentation of the thing they are intending to handle before starting. The machine has no brain, we shall use ours instead.

I'm also guilty of it myself. Some of my v1 code holds some libraries very wrong, but at least I reread the docs and correct the parts iteration by iteration (and no, I don't use AI in any form or shape for learning and code refactoring).

So if somebody misused any format and made it miserable to parse, I'd rather put the onus on the programmer who implemented the file format on top of that language, not the language itself (XML is a markup language).

The only file format I don't prefer to use is YAML [2]. The problem is its "always valid" property. This puts YAML into "Risk of electric shock. Read the manual and read it again before it operate this" category. I'm sure I can make it work if I need to, but YAML's time didn't come for me, yet. I'd rather use INI or TOML (INI++) for configuring things.

[0]: https://news.ycombinator.com/item?id=41120254

[1]: https://news.ycombinator.com/item?id=44934386

[2]: https://noyaml.com/

agos 2 days ago | parent | prev [-]

it's a lot of things, none of them in the browser anymore

bayindirh 2 days ago | parent [-]

RSS says hi!

agos 2 days ago | parent [-]

as much as it pains me to say it, that is also a sailed ship

bayindirh 2 days ago | parent [-]

I still follow feeds, my blog's RSS feed gets ~1.5K fetches every day.

How is it a sailed ship?

agos 2 days ago | parent [-]

how many of those 1.5K you think are using a web browser to read that feed?

bayindirh 2 days ago | parent [-]

The platform I use doesn't give statistics on that (I don't host my blog), but I assume the number is >0, since there's a lot of good browser based and free RSS readers.