| ▲ | observationist 6 hours ago | ||||||||||||||||||||||||||||||||||||||||
Grok 4 Heavy wasn't considered in comparisons. Grok meets or exceeds the same benchmarks that Gemini 3 excels at, saturating mmlu, scoring highest on many of the coding specific benchmarks. Overall better than Claude 4.5, in my experience, not just with the benchmarks. Benchmarks aren't everything, but if you're going to contrast performance against a selection of top models, then pick the top models? I've seen a handful of companies do this, including big labs, where they conveniently leave out significant competitors, and it comes across as insecure and petty. Claude has better tooling and UX. xAI isn't nearly as focused on the app and the ecosystem of tools around it and so on, so a lot of things end up more or less an afterthought, with nearly all the focus going toward the AI development. $300/month is a lot, and it's not as fast as other models, so it should be easy to sell GLM as almost as good as the very expensive, slow, Grok Heavy, or so on. GLM has 128k, grok 4 heavy 256k, etc. Nitpicking aside, the fact that they've got an open model that is just a smidge less capable than the multibillion dollar state of the art models is fantastic. Should hopefully see GLM 4.7 showing up on the private hosting platforms before long. We're still a year or two from consumer gear starting to get enough memory and power to handle the big models. Prosumer mac rigs can get up there, quantized, but quantized performance is rickety at best, and at that point you look at the costs of self hosting vs private hosts vs $200/$300 a month (+ continual upgrades) Frontier labs only have a few years left where they can continue to charge a pile for the flagship heavyweight models, I don't think most people will be willing to pay $300 for a 5 or 10% boost over what they can run locally. | |||||||||||||||||||||||||||||||||||||||||
| ▲ | nl 3 hours ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||
It seems like someone at X.ai likes maxing benchmarks but real world usage shows it significantly behind frontier models. I do appreciate their desire to be the most popular coding model on OpenRouter and offer Grok4-Fast for free. That's a notable step down from frontier models but fine for lots of bug fixing. I've put hundreds of millions of tokens through it. | |||||||||||||||||||||||||||||||||||||||||
| ▲ | lame-robot-hoax 6 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||
Grok, in my experience, is extremely prone to hallucinations when not used for coding. It will readily claim to have access to internal Slack channels at companies, it will hallucinate scientific papers that do not exist, etc. to back its claims. I don’t know if the hallucinations extend to code, but it makes me unwilling to consider using it. | |||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||
| ▲ | Alifatisk 5 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||
In my experience, Grok 4 expert performs way worse then what the benchmarks say. I’ve tried it with coding, writing and instructions following. The only thing it excels at currently and searching for things across the web is+ twitter. Otherwise, I would never use it for anything else. At coding, it always includes an error, when it patches it, it introduces another one. When writing creative text and had to follow instructions, it hallucinates a lot. Based on my experience, I am suspecting XAI for bench-maxing on Artificial Analysis because no way Grok 4 expert performs close to Gpt-5.2, Claude sonnet 4.5 and Gemini 3 pro | |||||||||||||||||||||||||||||||||||||||||
| ▲ | Madmallard 6 minutes ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||
" Grok 4 Heavy wasn't considered in comparisons. Grok meets or exceeds the same benchmarks that Gemini 3 excels at, saturating mmlu, scoring highest on many of the coding specific benchmarks. Overall better than Claude 4.5, in my experience, not just with the benchmarks." I think these types of comments should just be forbidden from Hacker News. It's all feelycraft and impossible to distinguish from motivated speech. | |||||||||||||||||||||||||||||||||||||||||
| ▲ | kristianp 6 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||
Perhaps people are steering clear of grok due to its extremist political training. | |||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||
| ▲ | guluarte an hour ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||
Opus > Codex > Gemini in my opinion, grok is not even close | |||||||||||||||||||||||||||||||||||||||||
| ▲ | claudiug 5 hours ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||
every time i use grok is get some bad results. basically is all 1000% perfect from his point of view, review the code... "bollocks" methods that dont exists or just one line of code or method created with a nice comment: //#TODO implement | |||||||||||||||||||||||||||||||||||||||||