| ▲ | Show HN: A new benchmark for testing LLMs for deterministic outputs(interfaze.ai) | ||||||||||||||||||||||
| 37 points by khurdula 4 hours ago | 14 comments | |||||||||||||||||||||||
When building workflows that rely on LLMs, we commonly use structured output for programmatic use cases like converting an invoice into rows or meeting transcripts into tickets or even complex PDFs into database entries. The model may return the schema you want, but with hallucinated values like `invoice_date` being off by 2 months or the transcript array ordered wrongly. The JSON is valid, but the values are not. Structured output today is a big part of using LLMs, especially when building deterministic workflows. Current structured output benchmarks (e.g., JSONSchemaBench) only validate the pass rate for JSON schema and types, and not the actual values within the produced JSON. So we designed the Structured Output Benchmark (SOB) that fixes this by measuring both the JSON schema pass rate, types, and the value accuracy across all three modalities, text, image, and audio. For our test set, every record is paired with a JSON Schema and a ground-truth answer that was verified against the source context manually by a human and an LLM cross-check, so a missing or hallucinated value will be considered to be wrong. Open source is doing pretty well with GLM 4.7 coming in number 2 right after GPT 5.4. We noticed the rankings shift across modalities: GLM-4.7 leads text, Gemma-4-31B leads images, Gemini-2.5-Flash leads audio. For example, GPT-5.4 ranks 3rd on text but 9th on images. Model size is not a predictor, either: Qwen3.5-35B and GLM-4.7 beat GPT-5 and Claude-Sonnet-4.6 on Value Accuracy. Phi-4 (14B) beats GPT-5 and GPT-5-mini on text. Structured hallucinations are the hardest bug. Such values are type-correct, schema-valid, and plausible, so they slip through most guardrails. For example, in one audio record, the ground truth is "target_market_age": "15 to 35 years", and a model returns "25 to 35". This is invisible without field-level checks. Our goal is to be the best general model for deterministic tasks, and a key aspect of determinism is a controllable and consistent output structure. The first step to making structured output better is to measure it and hold ourselves against the best. | |||||||||||||||||||||||
| ▲ | stared 3 hours ago | parent | next [-] | ||||||||||||||||||||||
Thank you for sharing benchmark. However, the results are selective. Why no Opus 4.7? Why Gemini 3.1 Pro is missing? If there is some other criterion (e.g. models within certain time or budget), great - just make it explicit. When I see "Top 5 at a glance" and it missed key frontier models, I am (at best) confused. | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | zihotki 3 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
I wonder if this benchmark brings any value. Models are already quite capable and reach high scores in it. | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | maxdo an hour ago | parent | prev | next [-] | ||||||||||||||||||||||
gpt 5.5 seems to be the recent leader overall, it make sense to include it , just to see what you trade off for speed/open source nature vs cutting edge leader. | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | dalberto 3 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
A benchmark without Opus 4.6/4.7 feels incomplete. | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | broyojo 2 hours ago | parent | prev | next [-] | ||||||||||||||||||||||
hmm why can't structured decoding be used? | |||||||||||||||||||||||
| |||||||||||||||||||||||
| ▲ | iLoveOncall 2 hours ago | parent | prev [-] | ||||||||||||||||||||||
This is just a hallucinations benchmark on a subset of outputs, not sure there's a value over general hallucinations benchmarks? > Our goal is to be the best general model for deterministic tasks I'm sorry but this simply doesn't make sense. If you want a deterministic output don't use an LLM. | |||||||||||||||||||||||
| |||||||||||||||||||||||