This is just a hallucinations benchmark on a subset of outputs, not sure there's a value over general hallucinations benchmarks?

> Our goal is to be the best general model for deterministic tasks

I'm sorry but this simply doesn't make sense. If you want a deterministic output don't use an LLM.

▲

nemo1618 an hour ago | parent | next [-]

LLMs are not inherently non-deterministic. This is a common misconception. You used to be able to set temp=0 and a fixed seed and get the same output every time. This broke when labs started implementing batching, and no one bothered fixing it because the benefits of batching vastly outweighed the demand for deterministic output.

I am hopeful deterministic output will return, though; DeepSeek v4 claims to have implemented "bitwise batch-invariant and deterministic kernels," though I haven't tested it myself.

	▲	sroussey 3 minutes ago \| parent [-]
		Thinking Machines Lab uses batch invariant kernels, btw.

▲

khurdula 3 hours ago | parent | prev [-]

General hallucinations benchmarks tend to be knowledge specific like GPQA or MMLU but none specifically measure structured output end-to-end which is one of the biggest use case for LLMs.

Many developer workflows use LLMs to produce structured artifacts due to it's flexibility of consuming unstructured inputs.

> "don't use an LLM"

Partially agree, that's what we're building towards at interfaze.ai a hybrid between transformers (LLMs) and traditional CNN/DNN architecture to solve this problem of "deterministic" output. This give devs the flexibility of custom schema definitions and unstructured input while still getting high quality structured output like you would get from a CNN models like EasyOCR.

The industry is moving toward using LLMs for more and more deterministic tasks so this benchmarks allows us to now measure it.