Remix clone Hacker News

new | show | ask | jobs Github

	▲	nijave 2 hours ago
		>Almost every open weight model launch this year has come with claims that it matches or exceeds Sonnet. I've been trying a lot of them and I have yet to see it in practice, even when the benchmarks show a clear lead. This has been my experience as well. I've been testing an agent built with Strands Agents which receives a load balancer latency alert and is expected to query logs with AWS Athena (Trino) then drill down with Datadog spans/traces to find the root cause. Admittedly, "devops" domain knowledge is important here My notes so far: "us.anthropic.claude-sonnet-4-6" # working, good results "us.anthropic.claude-sonnet-4-20250514-v1:0" # has problems following the prompt instructions "us.anthropic.claude-sonnet-4-5-20250929-v1:0" # working, good results "us.anthropic.claude-opus-4-5-20251101-v1:0" "us.anthropic.claude-opus-4-6-v1" # best results, slower, more expensive "amazon.nova-pro-v1:0" # completely fails "openai.gpt-oss-120b-1:0" # tool calling broken "zai.glm-5" # seems to work pretty well, a little slow, more expensive than Sonnet "minimax.minimax-m2.5" # didn't diagnose correctly "zai.glm-4.7" # good results but high tool call count, more expensive than Sonnet "mistral.mistral-large-3-675b-instruct" # misdiagnosed--somehow claimed a Prometheus scrape issue was involved "moonshotai.kimi-k2.5" # identified the right endpoints but interpreted trace data/root cause incorrectly "moonshot.kimi-k2-thinking" # identified endpoint, 1 correct root cause, 1 missing index hallucination Using models on AWS Bedrock. I let Claude Code w/ Opus 4.7 iterate over the agent prompt but didn't try to optimize per model. Really the only thing that came close to Sonnet 4.5 was GLM-5. The real kicker is, Sonnet is also the cheapest since it supports prompt caching The Kimi ones were close to working but didn't quite make the mark