| ▲ | ollin 9 hours ago | |
The source here is "CAISI Evaluation of DeepSeek V4 Pro" [1]; the US NIST ran their own benchmarks (including several internal ones) and reported the following table:
Notably, two of the benchmarks with the biggest capability gap are CAISI-internal/private ones (CTF-Archive-Diamond, PortBench). I read this as "DeepSeek is well-tuned for public benchmarks, and less generally intelligent than GPT5.5 on held-out tasks" but a less-charitable reading would be "US government reports US models do best on benchmarks that only the US government can run". Agent benchmarking is fraught with peril [2] and an impartial benchmarker (who disproportionately overlooks bugs/issues in their evaluation of certain models) can absolutely tilt the scales, so I would not be surprised if a PRC-led benchmarking of frontier models came to the opposite conclusion.[1] https://www.nist.gov/news-events/news/2026/05/caisi-evaluati... [2] https://epoch.ai/gradient-updates/why-benchmarking-is-hard | ||