Remix.run Logo
vitaelabitur 7 hours ago

I tokenized these and they seem to use around 20% less tokens than the original JSONs. Which makes me think a schema like this might optimize latency and costs in constrained LLM decoding.

I know that LLMs are very familiar with JSON, and choosing uncommon schemas just to reduce tokens hurts semantic performance. But a schema that is sufficiently JSON-like probably won't disrupt model path/patterns that much and prevent unintended bias.

nurumaik 7 hours ago | parent [-]

Minified json would use even less tokens

vitaelabitur 6 hours ago | parent [-]

Yeah, but I tried switching to minified JSON on a semantic labelling task and saw a ~5% accuracy drop.

I suspect this happened because most of the pre-training corpus was pretty-printed JSON, and the LLM was forced to derail from likely path and also lost all "visual cues" of nesting depth.

This might happen here too, but maybe to a lesser extent. Anyways, I'll stop building castles in the air now and try it sometime.

memoriuaysj 5 hours ago | parent [-]

if you really care about structured output switch to XML. much better results, which is why all AI providers tend to use pseudo-xml in their system prompts and tool definitions