Chinese AI models are ~8 months behind and falling further behind

	▲	Chinese AI models are ~8 months behind and falling further behind(twitter.com)
		2 points by enraged_camel 9 hours ago \| 6 comments

	▲	ilia-a 8 hours ago \| parent \| next [-]
		That doesn't seem right and seems to miss GLM 5.1 and Kimi 2.6. Not to mention there is the whole argument of cost/value that Chinese OSS models have vs GPT/Claude.
	▲	giardini 9 hours ago \| parent \| prev \| next [-]
		No problem: they're always at most just one theft away from you!8-)
	▲	tokkkie 8 hours ago \| parent \| prev \| next [-]
		chinese models feel strong in japan — kanji. but outside language? maybe ... max sonnet 4.5 level. do benchmarks reflect that gap in english region?
	▲	allears 8 hours ago \| parent \| prev \| next [-]
		Not everybody needs cutting-edge performance. Cost per token is turning out to be more important.
	▲	jqpabc123 8 hours ago \| parent \| prev \| next [-]
		Chinese models are cheaper and likely to remain so due to lower energy costs.
	▲	ollin 7 hours ago \| parent \| prev [-]
		The source here is "CAISI Evaluation of DeepSeek V4 Pro" [1]; the US NIST ran their own benchmarks (including several internal ones) and reported the following table: \| Domain \| Benchmark \| Model (reasoning level) \| \| \| \| \|--:-------------------\|------------------------\|-------------------------\|-----------------------------\|--------------------------\|-----------------------\| \| \| \| OpenAI GPT-5.5 (xhigh) \| OpenAI GPT-5.4 mini (xhigh) \| Anthropic Opus 4.6 (max) \| DeepSeek V4 Pro (max) \| \| Cyber \| CTF-Archive-Diamond \| 71% \| 32% \| 46% \| 32% \| \| Software Engineering \| SWE-Bench Verified* \| 81% \| 73% \| 79% \| 74% \| \| \| PortBench \| 78% \| 41% \| 60% \| 44% \| \| Natural Sciences \| FrontierScience \| 79% \| 74% \| 72% \| 74% \| \| \| GPQA-Diamond \| 96% \| 87% \| 91% \| 90% \| \| Abstract Reasoning \| ARC-AGI-2 semi-private \| 79% \| – \| 63% \| 46% \| \| Mathematics \| OTIS-AIME-2025 \| 100% \| 90% \| 92% \| 97% \| \| \| PUMaC 2024 \| 96% \| 93% \| 95% \| 96% \| \| \| SMT 2025 \| 99% \| 92% \| 94% \| 96% \| \| IRT-Estimated Elo \| IRT-Estimated Elo \| 1260 ± 28 \| 749 ± 46 \| 999 ± 27 \| 800 ± 28 \| Notably, two of the benchmarks with the biggest capability gap are CAISI-internal/private ones (CTF-Archive-Diamond, PortBench). I read this as "DeepSeek is well-tuned for public benchmarks, and less generally intelligent than GPT5.5 on held-out tasks" but a less-charitable reading would be "US government reports US models do best on benchmarks that only the US government can run". Agent benchmarking is fraught with peril [2] and an impartial benchmarker (who disproportionately overlooks bugs/issues in their evaluation of certain models) can absolutely tilt the scales, so I would not be surprised if a PRC-led benchmarking of frontier models came to the opposite conclusion. [1] https://www.nist.gov/news-events/news/2026/05/caisi-evaluati... [2] https://epoch.ai/gradient-updates/why-benchmarking-is-hard