| ▲ | XCSme 6 hours ago |
| I just tested it on my benchmarks[0], it's GLM-5.2 level, at 2x cost, but also 2x faster. Weak spots (categories it fails): - Trivia — 0/3 - basically not much built-in knowledge
- Combined tool-calling tasks — score 45/100, sometimes makes invalid tool calls
- Puzzle Solving — score 77, flubs carwash-like tests
[0]: https://aibenchy.com/compare/anthropic-claude-sonnet-4-6-med... |
|
| ▲ | nsoonhui 5 hours ago | parent | next [-] |
| Your benchmark has Gemini 3.5 Flash as the best model, which doesn't compute for me |
| |
| ▲ | XCSme 5 hours ago | parent | next [-] | | It is on top for many benchmarks, only not the coding/agentic ones. Still one of the most intelligent models overall, most likely to get any question you ask correctly (without tools). | |
| ▲ | BoorishBears 5 hours ago | parent | prev [-] | | This guy had a terrible broken benchmark that gets hawked every release, and I wish HN would ban accounts that essentially exist to hawk a personally owned site, especially such a bad one. | | |
| ▲ | pbgcp2026 39 minutes ago | parent | next [-] | | I get similar results in my own tests. And Gemini 3.1 Pro is consistently on top of my ratings. Not everyone is coding monkey, I prefer staying a programmer. | | |
| ▲ | BoorishBears 15 minutes ago | parent [-] | | They're referencing Gemini 3.5 Flash being the top model, you must not be great with detail. And no (strong) programmer would jump to assuming other people are coding monkeys just because they disagree with what a strong LLM is: that's the kind of thinking reserved for the glorified coding monkeys who wasted their life getting better at writing CRUD apps and are now upset that someone's tooling is dropping the already very low bar there. |
| |
| ▲ | UqWBcuFx6NV4r 5 hours ago | parent | prev [-] | | If you were right, the karma system would largely take care of this. It really sounds like this is more of your personal view | | |
| ▲ | BoorishBears 4 hours ago | parent [-] | | Karma systems are never perfect, and most people will not assume this is a pattern. (ie. won't feel the need to downvote them just for having yet another crappy AI benchmark) I only recognize it because I build a product that leaves me looking for information on every major release... and every major release a new crop of folks reply confused about the anomalies on top of anomalies that they're seeing, and they slowly learn this person is just way more unserious than the dogged distribution would imply. |
|
|
|
|
| ▲ | XCSme 6 hours ago | parent | prev | next [-] |
| As always, note: faster than GLM-5.2 doesn't mean too much, as GLM-5.2 is served by different providers, so the inference speed can vary drastically between providers or over time. |
| |
| ▲ | 2muchtime 4 hours ago | parent | next [-] | | Opencode Go/Zen claim to use infrastructure based in the EU, USA and Singapore that have a 0 retention policy. | |
| ▲ | yieldcrv 6 hours ago | parent | prev [-] | | What’s everyone favorite GLM provider? z.ai doesnt always have the most reliable AI but I don’t mind the party seeing my trade secrets and thoughts compared to an American corporation + the party seeing my trade secrets and thoughts. So thats not a functional difference to me, and the Chinese one won’t reply to subpoenas so thats a value add tbh So I’ll consider all, fastest tokens/sec wins | | |
| ▲ | reissbaker 2 hours ago | parent | next [-] | | I'm biased because I run an inference company, https://synthetic.new. That being said I think we're pretty good at serving at GLM-5.2 — and other models, like Kimi K2.7! — and our privacy policy is quite good: zero data retention for prompts and completions on API requests. Our average streaming TPS for GLM-5.2 (aka, tokens after factoring out time-to-first-token, which varies based on geography) is 97tps over the last 24hrs, although it's slightly lower at peak traffic in the mornings PST where it's 50-70 tps. We're also subscription-based which is nicer for coding than e.g. Fireworks which is per-token billing. | | |
| ▲ | yieldcrv 2 hours ago | parent [-] | | got a 500 error page on the site's chat, but I'll try the API | | |
| ▲ | reissbaker 40 minutes ago | parent [-] | | Interesting: I don't see anything in our error logs but we could be missing something (and personally the chat works for me + my unsubscribed test account). If you email us at hi@synthetic.new though we should be able to fix anything you're running into! |
|
| |
| ▲ | eli 6 hours ago | parent | prev | next [-] | | Fireworks.ai is solid. And if you care more about speed than cost they have a "fast" variant that I think just throws more hardware at the model for about 2x the cost. | | |
| ▲ | david-gpu 4 hours ago | parent [-] | | The privacy policy indicates that they track you and share your data to ad networks like Meta. Yikes. | | |
| ▲ | pranaybhatia 44 minutes ago | parent [-] | | Hi, PM at Fireworks here. We have zero data retention so we do not log any of your API requests. Realize you're talking about website activity which is different and will check and update on that too. |
|
| |
| ▲ | pbgcp2026 37 minutes ago | parent | prev | next [-] | | Run it on Amazon Bedrock or GCP vertex. No problems at all. | |
| ▲ | Onavo 3 hours ago | parent | prev [-] | | > the Chinese one won’t reply to subpoenas so thats a value add tbh That's not something that's definite. They are not quite like the Russians. A lot of the governments in Asia are overly pragmatic and will happily strong arm their companies to throw users under the bus for the sake of a trade deal. There's a reason why Snowden ran to the Russians and not China. Also, if they have any subsidiaries in the US, they may not have a choice in the matter. |
|
|
|
| ▲ | WorldPeas 5 hours ago | parent | prev [-] |
| the (imperfect) comparison having used both for planning and execution is that GLM5.2 is too jumpy and eager to do things, often to a fault (e.g. deploying/using git when it shouldn't) while sonnet 5 was much lazier than any Claude model I have used has been, not adding an addendum to a plan that I asked for, then lying that it did when asked. Looking at the analysis[0] I don't think it's worth it for me. Maybe for others. Fable was certainly much better. [0]: https://artificialanalysis.ai/models/claude-sonnet-5 |