▲ | zone411 a day ago | |
On the extended version of NYT Connections - https://github.com/lechmazur/nyt-connections/: Claude Opus 4 Thinking 16K: 52.7. Claude Opus 4 No Reasoning: 34.8. Claude Sonnet 4 Thinking 64K: 39.6. Claude Sonnet 4 Thinking 16K: 41.4 (Sonnet 3.7 Thinking 16K was 33.6). Claude Sonnet 4 No Reasoning: 25.7 (Sonnet 3.7 No Reasoning was 19.2). Claude Sonnet 4 Thinking 64K refused to provide one puzzle answer, citing "Output blocked by content filtering policy." Other models did not refuse. | ||
▲ | zone411 a day ago | parent [-] | |
On my Thematic Generalization Benchmark (https://github.com/lechmazur/generalization, 810 questions), the Claude 4 models are the new champions. |