Remix clone Hacker News

new | show | ask | jobs Github

	▲	ollin 9 hours ago
		The source here is "CAISI Evaluation of DeepSeek V4 Pro" [1]; the US NIST ran their own benchmarks (including several internal ones) and reported the following table: \| Domain \| Benchmark \| Model (reasoning level) \| \| \| \| \|--:-------------------\|------------------------\|-------------------------\|-----------------------------\|--------------------------\|-----------------------\| \| \| \| OpenAI GPT-5.5 (xhigh) \| OpenAI GPT-5.4 mini (xhigh) \| Anthropic Opus 4.6 (max) \| DeepSeek V4 Pro (max) \| \| Cyber \| CTF-Archive-Diamond \| 71% \| 32% \| 46% \| 32% \| \| Software Engineering \| SWE-Bench Verified* \| 81% \| 73% \| 79% \| 74% \| \| \| PortBench \| 78% \| 41% \| 60% \| 44% \| \| Natural Sciences \| FrontierScience \| 79% \| 74% \| 72% \| 74% \| \| \| GPQA-Diamond \| 96% \| 87% \| 91% \| 90% \| \| Abstract Reasoning \| ARC-AGI-2 semi-private \| 79% \| – \| 63% \| 46% \| \| Mathematics \| OTIS-AIME-2025 \| 100% \| 90% \| 92% \| 97% \| \| \| PUMaC 2024 \| 96% \| 93% \| 95% \| 96% \| \| \| SMT 2025 \| 99% \| 92% \| 94% \| 96% \| \| IRT-Estimated Elo \| IRT-Estimated Elo \| 1260 ± 28 \| 749 ± 46 \| 999 ± 27 \| 800 ± 28 \| Notably, two of the benchmarks with the biggest capability gap are CAISI-internal/private ones (CTF-Archive-Diamond, PortBench). I read this as "DeepSeek is well-tuned for public benchmarks, and less generally intelligent than GPT5.5 on held-out tasks" but a less-charitable reading would be "US government reports US models do best on benchmarks that only the US government can run". Agent benchmarking is fraught with peril [2] and an impartial benchmarker (who disproportionately overlooks bugs/issues in their evaluation of certain models) can absolutely tilt the scales, so I would not be surprised if a PRC-led benchmarking of frontier models came to the opposite conclusion. [1] https://www.nist.gov/news-events/news/2026/05/caisi-evaluati... [2] https://epoch.ai/gradient-updates/why-benchmarking-is-hard