I keep seeing people make the same mistake as XML made over and over; without learning from it. I will clarify the problem thusly:

> The more capabilities you add to a interchange format, the harder that format is to parse.

There is a reason why JSON is so popular, it supports so little, that it is legitimately easy to import. Whereas XML supports attributes, namespaces, CDATA, DTDs, QNames, xml:base, xml:lang, XInclude, etc etc. They gave it everything, including the kitchen sink.

There was a thread here the other day about using Sqlite as an interchange format to REDUCE complexity. Look, I love Sqlite, as an application specific data-store. But much like XML it has a ton of capabilities, which is good for a data-store, but awful for an interchange format with multiple producers/consumers with their own ideas.

CSV may be under-specified, but it remains popular largely due to its simplicity to produce/consume. Unfortunately, we're seeing people slowly ruin JSON by adding e.g. commands to the format, with others than using those "comments" to hold data (e.g. type information), which must be parsed. Which is a bad version of an XML Attribute.

▲ GuB-42 6 hours ago | parent | next [-]

I think JSON has the opposite problem, it is too simple, the lack of comments in particular is particularly bad for many common usages of the format today.

I know some implementations of JSON support comments and other things, but is is not true JSON, in the same way that most simple XML implementations are not true XML. That's what I say "opposite problem", XML is too complex, and most practical uses of XML use incomplete implementations, while many practical uses of JSON use extended implementations.

By the way, this is not a problem for what JSON was designed for: a text interchange format, with JS being the language of choice, but it has gone beyond its design: configuration files, data stores, etc...

▲

da_chicken a few seconds ago | parent | next [-]

I've said it before, but I maintain that XML has only two real problems:

1. Attributes should not exist. They make the document suddenly have two dimensions instead of one, which significantly increases complexity. Anything that could be an attribute should actually be a child element.

2. There should be one close tag: `</>` which closes the last element, which burns a significant amount of space with useless syntax. Other than that and the self-closing `<tag />` (which itself is less useful without attributes) there isn't much that you need. Maybe a document close tag like `<///>`

You'll notice that, yes, JSON solves both of those things. That's a part of why it's so popular. The other is just that a lot more effort was put into maximizing the performance of JavaScript than shredding XML, and XSLT, the intended solution to this problem, is infamous at this point.

The problem of comments is kind of a non-issue in practice, IMO. You can just add a `"_COMMENT"` element or similar. Sure, yes, it will get parsed. But you shouldn't have that many comments that it will cause a genuine performance issue.

However, JSON still has two problems:

1. Schema support. You can't validate that a file before de-serializing it in your application. JSON Schema does exist, but it's support is still thin, IMX.

2. Many serializers are pretty bad with tabular data, and nearly all of them are bad with tabular data by default. So sometimes it's a data serialization format that's bad at serializing bulk data. Yeah, XML is worse at this. Yeah, you can use the `"colNames": ["id", ...], "rows": [ [1,...],[2,...] ]` method or go columnar with `"id": [1,2,...], "name": [...], "createDate": [...]`, but you had better be sure both ends can support that format.

In both cases, it seems like there is an attempt to resolve both of those issues. OpenAPI 3.1 has JSON schema included in it. The most popular JSON parsers seem to be adding tabular data support. I guess we'll see.

▲

conartist6 6 hours ago | parent | prev [-]

A lot of people dislike that decision not to include comments in JSON, but I think while shocking it was and is totally correct.

In a programming language it's usually free to have comments because the comment is erased before the program runs; we usually render comments in grey text because they can't change the meaning of the program.

In a data language you have no such luxury. In a data language there's no comment erasure happening between the producer and the consumer, so comments are just dangerous as they would without doubt evolve into a system of annotations -- an additional layer of communication which would then not be standardized at all and which then would grow into a wild west of nonstandard features and compatibility workarounds.

▲

phlakaton 4 hours ago | parent | next [-]

I don't dislike the decision at all, FWIW! For data interchange it's totally reasonable. But it does make JSON ill-suited for a bunch of applications that JSON has been forcefully and unfortunately applied to.

▲

zahlman 2 hours ago | parent | prev | next [-]

> In a programming language it's usually free to have comments because the comment is erased before the program runs

That's inherent to the language specification, but it isn't inherent to the document. You have to have a system with rules that require that erasure.

Nothing prevents one from mandating a system that strips those comments out of JSON. You could even "compile" JSON to, I don't know, BSON or msgpack or something.

Just as nothing prevents one from creating tooling to, say, extract type annotations from comments in a dynamically typed language.

▲

heresie-dabord 5 hours ago | parent | prev | next [-]

> while shocking it was and is totally correct

Agreed —— consider how comments have been abused in HTML, XML, and RSS.

Any solution or technology that can be abused will be abused if there are no constraints.

▲

blackcatsec 5 hours ago | parent | prev | next [-]

Could you imagine hitting a rest api and like 25% of the bytes are comments? lol

▲

dunham 4 hours ago | parent | next [-]

Worse than that - people will start tagging "this value is a Date" via comments, and you'll need to parse ad-hoc tags in the comments to decode the data. People already do tagging in-band, but at least it's in-band and you don't have to write a custom parser.

▲

bmacho 5 hours ago | parent | prev [-]

HTML and JS both have comments, I don't see the problem

	▲	Someone1234 4 hours ago \| parent [-]
		And both are poor interchange formats. When things stay in their lane, there is no "problem." When you try to make an interchange format using a language with too many features, or comments that people abuse to add parsable information (e.g. "type information") then there is a BIG problem.

▲

jancsika 3 hours ago | parent | prev | next [-]

> so comments are just dangerous as they would without doubt evolve into a system of annotations -- an additional layer of communication which would then not be standardized at all and which then would grow into a wild west of nonstandard features and compatibility workarounds

IIRC Douglas Crockford explicitly stated that he saw people initially using comments for a purpose like ad hoc preprocessor directives.

▲

quotemstr 4 hours ago | parent | prev [-]

No, it was obviously and flagrantly incorrect, as evidenced by the success of interchange formats that do allow for comments, including many real world systems that pragmatically allow comments even when JSON says they shouldn't. This is Stockholm Syndrome.

But what can we expect from a spec that somehow deems comments bad but can't define what a number is?

▲ python-b5 4 hours ago | parent | prev | next [-]

I've been working on an XML parser of my own recently and, to be honest, as long as you're fine with a non-validating parser (which are still compliant), it's really not that bad. You have to parse DTDs, but you don't need to actually _do_ anything with them. Namespaces are annoying but they're not in the main spec. CDATA sections aren't all that useful, but they're easy to parse. As far as I'm aware, parsers don't actually need to handle xml:lang/xml:space/etc themselves - they're for use by applications using the parser. Really the only thing that's been particularly frustrating for me is entity expansion.

If you want to support the wider XML ecosystem, with all the complex auxiliary standards, then yes, it's a lot of work, but the language itself isn't that awful to parse. It's a little messy, but I appreciate it at least being well-specified, which JSON is absolutely not.

▲ 0xbadcafebee 4 hours ago | parent | prev | next [-]

The problem is that engineers of data formats have ignored the concept of layers. With network protocols, you make one layer (Ethernet), you add another layer (IP), then another (TCP), then another (HTTP). Each one fits inside the last, but is independent, and you can deal with them separately or together. Each one has a specialty and is used for certain things. The benefits are 1) you don't need "a kitchen sink", 2) you can replace layers as needed for your use-case, 3) you can ship them together or individually.

I don't think anyone designs formats this way, and I doubt any popular formats are designed for this. I'm not that familiar with enterprise/big-data formats so maybe one of them is?

For example: CSV is great, but obviously limited, and not specified all that well. A replacement table data format could be binary (it's 2026, let's stop "escaping quotes", and make room for binary data). Each row can have header metadata to define which columns are contained, so you can skip empty columns. Each cell can be any data format you want (specifically so you can layer!). The header at the beginning of the data format could (optionally) include an index of all the rows, or it could come at the end of the file. And this whole table data format could be wrapped by another format. Due to this design, you can embed it in other formats, you can choose how to define cells (pick a cell-data-format of your choosing to fit your data/type/etc, replace it later without replacing the whole table), you can view it out-of-order, you can stream it, and you can use an index.

	▲	inejge 18 minutes ago \| parent \| next [-]
		> With network protocols, you make one layer (Ethernet), you add another layer (IP), then another (TCP), then another (HTTP). Each one fits inside the last, but is independent, and you can deal with them separately or together. It looks neat when you illustrate it with stacked boxes or concentric circles, but real-world problems quickly show the ugly seams. For example, how do you handle encryption? There are arguments (and solutions!) for every layer, each with its own tradeoffs. But it can't be neatly slotted into the layered structure once and for all. Then you have things like session persistence, network mobility, you name it. Data formats have other sets of tradeoffs pulling them in different directions, but I don't think that layered design would come near to solving any of them.
	▲	mristin an hour ago \| parent \| prev \| next [-]
		Have a look at Asset Administration Shells (AAS) -- it is a data exchange format built on top of JSON and XML (and RDF, and OPC UA and Protobuf, etc.). https://industrialdigitaltwin.org/ (Disclaimer: I work on AAS SDKs https://github.com/aas-core-works.)
	▲	gmueckl 3 hours ago \| parent \| prev [-]
		Some early binary formats followed similar concepts. Look up Interchange File Format, AIFF, RIFF, and their applications and all the file formats using this structure to this day.

▲ conartist6 7 hours ago | parent | prev | next [-]

Just gonna drop this here : ) https://docs.bablr.org/guides/cstml

CSTML is my attempt to fix all these issues with XML and revive the idea of HTML as a specific subset of a general data language.

As you mention one of the major learnings from the success of JSON was to keep the syntax stupid-simple -- easy to parse, easy to handle. Namespaces were probably the feature to get the most rework.

In theory it could also revive the ability we had with XHTML/XSLT to describe a document in a minimal, fully-semantic DSL, only generating the HTML tag structure as needed for presentation.

▲ phlakaton 4 hours ago | parent | next [-]

I unfortunately disagree that your syntax is "stupid-simple." But it highlights an impedance mismatch between XML users and JSON users.

JSON treats text as one of several equally-supported datatypes, and quotes all strings. Great if your data is heavily structured, and text is short and mixed with other types of data. Awful if your data is text.

XML and other SGML apps put the text first and foremost. Anything that's not text needs to be tagged, maybe with an attribute to indicate the intended type. It's annoying to express lots of structured, short-valued data. But it's simple and easy for text markup where the text predominates.

CSTML at first glance seems to fall into the JSON camp. Quoting every string literal makes plenty of sense in JSON, but not in the HTML/text-markup world you seem to want to play in.

	▲	conartist6 3 hours ago \| parent \| next [-]
		Yeah "impedance mismatch" is a good way of putting it. I wouldn't say we fall into the JSON camp at all though, but quite squarely into the XML-ish camp! We just wrap the inner text in quotes to make sure there's no confusion between the formatting of the text stored IN the document and the formatting of the document itself. HTML is hiding a lot of complexity here: https://blog.dwac.dev/posts/html-whitespace/. We're actually doing exactly what the author of that detailed investigation recommends. You can see how it plays out when CSTML is used to store an HTML document https://github.com/bablr-lang/bablr-docs/blob/1af99211b2e31f.... Having the string wrappers makes it possible to precisely control spaces and newlines shown to the user while also having normal pretty-formatting. Compare this to a competing product SrcML which uses XML containers for parse trees and no wrapper strings. Take a look at the example document here: https://www.srcml.org/about.html. A simple example is three screens wide because they can't put in line breaks and indentation without changing the inner text!
	▲	conartist6 3 hours ago \| parent \| prev [-]
		As to the simplicity of the syntax I think you would understand what I mean if you were writing a parser. It's particularly gratifying that you can easily interpret CSTML with a stream parser. XML cannot work this way because this particular case is ambiguous: `<Name` What does Name mean in this fragment of syntax? Is it the name of a namespace? Or the name of a node? We won't know until we look forward and see if the next character is : That's why we write `<Namespace:Name />` as `:Namespace: <Name />` - it means there's no point in the left-to-right parse at which the meaning is ambiguous. And finally CSTML has no entity lookups so there's no need to download a DTD to parse it correctly.

▲ Chaosvex 4 hours ago | parent | prev [-]

I realised the other day that some of my test code has 'jumped' rather than 'jumps' for the intended panagram. Glad to see I'm not alone. :^)

▲

conartist6 4 hours ago | parent [-]

Haha yeah someone pointed that out to me and I decided to leave it. I just needed a sentence, I'm not actually trying to show off every glyph in a font.

	▲	Chaosvex 4 hours ago \| parent [-]
		That was my reasoning for not fixing it, too. Fair!

▲ necovek 4 hours ago | parent | prev | next [-]

Funnily enough, XML was an attempt to simplify SGML so it is easier to parse (as SGML only ever had one compliant parser, nsgml).

	▲	tannhaeuser 3 hours ago \| parent [-]
		SGML has at least SP/OpenSP, sgmljs, and nsgml as full-featured, stand-alone parsers. There are also parsers integrated into older versions of products such as MarkLogic, ArborText, and other pre-XML authoring suites, renderers, and CMSs. Then there are language runtime libs such as SWI Prolog's with a fairly complete basic SGML parser. ISO 8879 (SGML) doesn't define an API or a set of required language features; it just describes SGML from an authoring perspective and leaves the rest to an application linked to a parser. It even uses that term for the original form of stylesheets ("link types", reusing other SGML concepts such as attributes to define rendering properties). SGML doesn't even require a parser implementation to be able to parse an SGML declaration which is a complex formal document describing features, character sets, etc. used by an SGML document, the idea being that the declaration could be read by a human operator to check and arrange for integration into a foreign document pipeline. Even SCRIPT/VS (part of IBM's DCF and the origin of GML) could thus technically be considered SGML. There are also a number of historical/academic parsers, and SGML-based HTML parsers used in old web browsers.

▲ PunchyHamster 6 hours ago | parent | prev | next [-]

Constant erosion of data formats into the shittiest DSLs in existence is annoying. "Oh, hey, instead of writing Python, how about you write in

* YAML, with magical keywords that turn data into conditions/commands * template language for the YAML in places when that isn't enough * ....Python, because you need to eventually write stuff that ingests the above either way .... ansible is great isn't it?"

... and for some reason others decide "YES THIS IS AWESOME" and we now have a bunch of declarative YAML+template garbage.

> There was a thread here the other day about using Sqlite as an interchange format to REDUCE complexity. Look, I love Sqlite, as an application specific data-store. But much like XML it has a ton of capabilities, which is good for a data-store, but awful for an interchange format with multiple producers/consumers with their own ideas.

It's just a bunch of records put in tables with pretty simple data types. And it's trivial to convert into other formats while being compact and queryable on its own. So as far as formats go, you could do a whole lot worse.

	▲	gaigalas 4 hours ago \| parent \| next [-]
		Basic dicts, arrays and templates might be the killer feature set for declarative data languages. If everyone coalesces to those eventually, it means there's something to it.
	▲	01HNNWZ0MV43FF 3 hours ago \| parent \| prev [-]
		One issue with SQLite is that it's _not_ rewritten every time like JSON and XML, so if you forget to vacuum it or roundtrip it through SQL, you can easily leak deleted data in the binary file.

▲ xienze 7 hours ago | parent | prev | next [-]

> Whereas XML supports attributes, namespaces, CDATA, DTDs, QNames, xml:base, xml:lang, XInclude, etc etc. They gave it everything, including the kitchen sink.

But you don't have to use all those things. Configure your parser without namespace support, DTD support, etc. I'd much rather have a tool with tons of capabilities that can be selectively disabled rather than a "simple" one that requires _me_ to bolt on said extra capabilities.

▲

catlifeonmars 6 hours ago | parent | next [-]

It has the same problem as YAML, there are many, many ways to misconfigure your parser and there lie interesting security vulnerabilities. complex dsls are difficult to implement parsers for.

A simple dsl can be implemented in many programming languages very cheaply and can easily be verified against a specification. S-expressions are probably the most trivial language to write parsers for.

JSON is also pretty simple, but the spec being underspecified leads to ambiguous parsing (another security issue). In particular: duplicate key handling, key order, and array item order are not specified and different parsers may treat them differently.

▲

necovek 4 hours ago | parent | prev | next [-]

If you do not go with DTD or XSD, you are only doing XML lookalike language, as these are XML mechanisms to really define the XML schema: a compliant parser won't be able to validate it, or maybe even to parse it.

Thus people go with custom parsers (how hard can it be, right?), and then have to keep fixing issues as someone or other submits an XML with CDATA in or similar.

	▲	zahlman 2 hours ago \| parent [-]
		What if we just formalize some reasonable minimal subset, and call it something else?

▲

cbm-vic-20 7 hours ago | parent | prev [-]

As a data interchange format, you can only depend on the lowest commonly implemented features, which for XML is the base XML spec. For example, Namespaces is a "recommendation", and a conformant XML parser doesn't need to support it.

	▲	smashed 6 hours ago \| parent [-]
		The problem comes when malicious actors start crafting documents with extra features that should not be parsed, but many software will wrongly parse them because they use the default, full featured parser. Or various combinations of this. It's a pretty well understood problem and best practices exist, not everyone implements them.

▲ moron4hire 6 hours ago | parent | prev | next [-]

I consider CSV to be a signal of an unserious organization. The kind of place that uses thousand line Excel files with VBA macros instead of just buying a real CRM already. The kind of place that thinks junior developers are cheaper than senior developers. The kind of place where the managers brow beat you into working overtime by arguing from a single personal perspective that "this is just how business is done, son."

People will blithely parrot, "it's a poor Workman who blames his tools." But I think the saying, as I've always heard it used to suggest that someone who is complaining is a just bad at their job, is a backwards sentiment. Experts in their respective fields do not complain about their tools not because they are internalizing failure as their own fault. They don't complain because they insist on only using the best tools and thus have nothing to complain about.

▲

thibaut_barrere 4 hours ago | parent | next [-]

Most people salaries transfers & healthcare offers literally run on a mix of CSV and XML!

CSV is probably the most low tech, stack-insensitive way to pass data even these days.

(I run & maintain long term systems which do exactly that).

▲

PunchyHamster 6 hours ago | parent | prev | next [-]

Ah, such youthful ignorace.

You just classified probably every single bank in existence as "unserious organization"

	▲	Someone1234 3 hours ago \| parent \| next [-]
		Yep, healthcare, grocery, logistics, data science. Heck it would be easier to list industries that DON'T have any CSV. There aren't many. In terms of interchange formats these are quite popular/common: EDI (serialized as text or binary), CSV, XML, ASN.1, and JSON are extremely popular. I 100% assure everyone reading that their personal information was transmitted as CSV at least once in the last week; but once is a very low estimate.
	▲	clhodapp 2 hours ago \| parent \| prev [-]
		They kind of actually are, though. Not because they use CSV's but because, as an industry, they have not figured out how to reliably create, exchange, and parse well-formed CSV's.

▲

brabel 4 hours ago | parent | prev | next [-]

> The kind of place that thinks junior developers are cheaper than senior developers…

Unless the junior developers start accepting lower salaries once they become senior developers, that is a fact. Do you mean that they think junior developers are cheaper even when considering the cost per output, maybe?

	▲	clhodapp 2 hours ago \| parent [-]
		I believe they're referring to the fact that if almost all of your code is written by junior developers without mentorship, you will end up wasting a lot of your development budget because your codebase is a mess.

▲

phlakaton 4 hours ago | parent | prev | next [-]

LOL, I chose a Google Sheet and CSV for my current project, and I'm very serious about it. It's a short-term solution, and it fits my needs perfectly.

▲

groundzeros2015 6 hours ago | parent | prev [-]

Boy. Wait until you see how much of the world runs on Unix tabular columns

▲ quotemstr 5 hours ago | parent | prev [-]

> XML supports attributes, namespaces, CDATA, DTDs, QNames, xml:base, xml:lang, XInclude, etc etc. They gave it everything, including the kitchen sink.

Ah, the old "throw a bag of nouns at the reader and hope he's intimidated" rhetorical flutist. These things are either non-issues (like QName), things a parser does for you, or optional standards adjacent to XML but not essential to it, e.g. XInclude.

▲

thayne an hour ago | parent | next [-]

> things a parser does for you

IME there are two kinds of xml implementations, ones that handle DTDs and entitie definitions for you and are insecure by default (XXE and SSRF vulnerabilities), and ones that don't and reject valid XML documents.

▲

maccard 4 hours ago | parent | prev | next [-]

> Ah, the old "throw a bag of nouns at the reader and hope he's intimidated" rhetorical flutist.

The accusation here is a defleciton. OP's point isn't a gish gallop, it's that xml is absolutely littered with edge cases and complexities that all need to be understood.

> optional standards adjacent to XML but not essential

This is exactly OP's point. The standard is everything and the kitchen sink, except for all the bits it doesn't include which are almost imperceptible from the actual standard because of how widely used they are.

	▲	quotemstr 4 hours ago \| parent [-]
		XInclude isn't part of the standard, and IME, a minority of systems support it anyway. The OP's comment is an obvious gish-gallop. You can assemble a similarly scary noun list for practically any technology. Probably the same kind of person who tries to praise JSON's lack of comments as a feature or something.

▲

4 hours ago | parent | prev [-]

[deleted]