| ▲ | HarHarVeryFunny 7 hours ago | |
That remains to be seen. It's notable that Anthropic are still using SWEBench as a coding benchmark rather than the newer more difficult DeepSWE which shows them well behind GPT 5.5 Bear in mind that all the marketing efforts such as solving Erdos problem are the result of concerted RL training to impart those narrow capabilities, and how much of any benchmark results, or "early access" paid shill vibe reports, reflect improved performance for more general real-world use cases remains to be seen. | ||