| ▲ | Chinese AI models are ~8 months behind and falling further behind(twitter.com) | |
| 2 points by enraged_camel 9 hours ago | 6 comments | ||
| ▲ | ilia-a 8 hours ago | parent | next [-] | |
That doesn't seem right and seems to miss GLM 5.1 and Kimi 2.6. Not to mention there is the whole argument of cost/value that Chinese OSS models have vs GPT/Claude. | ||
| ▲ | giardini 9 hours ago | parent | prev | next [-] | |
No problem: they're always at most just one theft away from you!8-) | ||
| ▲ | tokkkie 8 hours ago | parent | prev | next [-] | |
chinese models feel strong in japan — kanji. but outside language? maybe ... max sonnet 4.5 level. do benchmarks reflect that gap in english region? | ||
| ▲ | allears 8 hours ago | parent | prev | next [-] | |
Not everybody needs cutting-edge performance. Cost per token is turning out to be more important. | ||
| ▲ | jqpabc123 8 hours ago | parent | prev | next [-] | |
Chinese models are cheaper and likely to remain so due to lower energy costs. | ||
| ▲ | ollin 7 hours ago | parent | prev [-] | |
The source here is "CAISI Evaluation of DeepSeek V4 Pro" [1]; the US NIST ran their own benchmarks (including several internal ones) and reported the following table:
Notably, two of the benchmarks with the biggest capability gap are CAISI-internal/private ones (CTF-Archive-Diamond, PortBench). I read this as "DeepSeek is well-tuned for public benchmarks, and less generally intelligent than GPT5.5 on held-out tasks" but a less-charitable reading would be "US government reports US models do best on benchmarks that only the US government can run". Agent benchmarking is fraught with peril [2] and an impartial benchmarker (who disproportionately overlooks bugs/issues in their evaluation of certain models) can absolutely tilt the scales, so I would not be surprised if a PRC-led benchmarking of frontier models came to the opposite conclusion.[1] https://www.nist.gov/news-events/news/2026/05/caisi-evaluati... [2] https://epoch.ai/gradient-updates/why-benchmarking-is-hard | ||