Funny thing, pseudo-XML is going through a big resurgence right now, because models love it, while they seriously struggle with JSON.

▲ roflcopter69 a day ago | parent | next [-]

I'd be really interested in what you mean. Are the any studies that quantify this difference in model performance when using JSON or XML? What could be a good intuition for why there might be a big difference? If XML is better than JSON for LLMs, why isn't everyone and the grandma recommending me to use XML instead of JSON? Why is Google Gemini API offering structured output only with JSON schema instead of XML schema?

▲

simonw a day ago | parent | next [-]

I don't know if the XML is better than JSON thing still holds with this year's frontier models, but it was definitely a thing last year. Here's Anthropic's documentation about that: https://docs.claude.com/en/docs/build-with-claude/prompt-eng...

Note that they don't actually suggest that the XML needs to be VALID!

My guess was that JSON requires more characters to be escaped than XML-ish syntax does, plus matching opening and closing tags makes it a little easier for the LLM not to lose track of which string corresponds to which key.

▲

bird0861 a day ago | parent [-]

the Qwen team is still all in on XML and they make a good case for it

	▲	roflcopter69 a day ago \| parent [-]
		Can you please provide a source? I'd love to know their exact reasoning and/or evidence that XML is the way to go.

▲

samuelknight a day ago | parent | prev [-]

(1) JSON requires lots of escape characters that mangle the strings + hex escapes and (2) it's much easier for model attention to track when a semantic block begins and ends when it's wrapped by the name of that section

...

</instructions>

can be much easier than

{

"instructions": "..\n...\n"

}

especially when there are newlines, quotes and unicode

	▲	roflcopter69 a day ago \| parent [-]
		Thanks for the reply, that part about the models attention is pretty interesting! I would suspect that a single attention layer won't be able to figure out to which token a token for an opening bracket should attend the most to. Think of {"x": {y: 1}} so with only one layer of attention, can the token for the first opening bracket successfully attend to exactly the matching closing bracket? I wonder if RNNs work better with JSON or XML. Or maybe they are just fine with both of them because a RNN can have some stack-like internal state that can match brackets? Probably, it would be a really cool research direction to measure how well Transformer-Mamba hybrid models like Jamba perform on structured input/output formats like JSON and XML and compare them. For the LLM era, I could only find papers that do this evaluation with transformer-based LLMs. Damn, I'd love to work at a place that does this kind of research, but guess I'm stuck with my current boring job now :D Born to do cutting-edge research, forced to write CRUD apps with some "AI sprinkled in". Anyone hiring here?

▲ koolala a day ago | parent | prev | next [-]

HTML? Is the main advantage of XML for understandability the labeled closing tags? Lisp has the same struggle too?

	▲	imiric a day ago \| parent [-]
		Tangent: the fact XHTML didn't gain traction is a mistake we've been paying off for decades. Browser engines could've been simpler; web development tools could've been more robust and powerful much earlier; we would be able to rely on XSLT and invent other ways of processing and consuming web content; we would have proper XHTML modules, instead of the half-baked Web Components we have today. Etc. Instead, we got standards built on poorly specified conventions, and we still have to rely on 3rd-party frameworks to build anything beyond a toy web site. Stricter web documents wouldn't have fixed all our problems, but they would have certainly made a big impact for the better.

▲ tasuki a day ago | parent | prev [-]

What is pseudo-XML?

▲ simonw a day ago | parent [-]

Looks like XML but isn't actually valid XML. This for example:

  <title>This & that</title>
  <author>Simon</author>
  <body>Article content goes here</body>

If you ask an LLM for the title, author and body it will give you the right answer, even though that is not a valid XML document.

	▲	tasuki 10 hours ago \| parent \| next [-]
		Quite obvious, how didn't I think of it? Thanks!
	▲	joquarky a day ago \| parent \| prev [-]
		If XML had been like this from the start, it might have won. Just look at HTML vs XHTML.