| ▲ | Benchmarking open-weight models for security research(dualuse.dev) | |
| 1 points by lebovic 12 hours ago | 1 comments | ||
| ▲ | lebovic 12 hours ago | parent [-] | |
GLM 5.1 is surprisingly capable. Anecdotally, I couldn't notice a difference until ~120K tokens. Qwen 3.6 35B A3B also exceeded my expectations. It's surprisingly performant, even though the previous generation wasn't even able to use the testing harness. (Tbd on Kimi K2.6; the eval is still running.) | ||