| ▲ | firasd 4 hours ago | |
Some interesting stats here about the current landscape https://arena.ai/leaderboard/agent Agent Arena (Dynamic ranking of models on how well they orchestrate tools for real-world agentic tasks, based on signals like tool reliability, task completion, and steerability.) Top 10, Highest rank to lowest Claude Fable 5 (High), Claude Opus 4.8 (Thinking), GPT 5.5 (xHigh), Claude Opus 4.7 (Thinking), GPT 5.5 (High), Claude Opus 4.7, Claude Opus 4.6, GPT 5.5, GPT 5.4 (High), GLM 5.2 (Max) Text Arena View overall rankings across various AI models in text-to-text tasks across math, coding, creative writing, and other open-ended domains. Top 10, Highest rank to lowest claude-fable-5, claude-opus-4-6-thinking, claude-opus-4-7-thinking, claude-opus-4-6, claude-opus-4-7, muse-spark, gemini-3.1-pro-preview, gemini-3-pro, claude-opus-4-8-thinking, gpt-5.5-high | ||
| ▲ | dakolli 6 minutes ago | parent | next [-] | |
The only real world task benchmark I know of is Scale Labs RLI https://labs.scale.com/leaderboard/rli Its clear to me these models are useless on any real world task, a 4% pass rate on $20-30/hr Upwork tasks. This whole trend of agentic engineering is a giant money grab. | ||
| ▲ | mydreamof 2 hours ago | parent | prev [-] | |
there is no GPT 5.6 init, so what's the point? | ||