Remix.run Logo
observationist 6 hours ago

Grok 4 Heavy wasn't considered in comparisons. Grok meets or exceeds the same benchmarks that Gemini 3 excels at, saturating mmlu, scoring highest on many of the coding specific benchmarks. Overall better than Claude 4.5, in my experience, not just with the benchmarks.

Benchmarks aren't everything, but if you're going to contrast performance against a selection of top models, then pick the top models? I've seen a handful of companies do this, including big labs, where they conveniently leave out significant competitors, and it comes across as insecure and petty.

Claude has better tooling and UX. xAI isn't nearly as focused on the app and the ecosystem of tools around it and so on, so a lot of things end up more or less an afterthought, with nearly all the focus going toward the AI development.

$300/month is a lot, and it's not as fast as other models, so it should be easy to sell GLM as almost as good as the very expensive, slow, Grok Heavy, or so on.

GLM has 128k, grok 4 heavy 256k, etc.

Nitpicking aside, the fact that they've got an open model that is just a smidge less capable than the multibillion dollar state of the art models is fantastic. Should hopefully see GLM 4.7 showing up on the private hosting platforms before long. We're still a year or two from consumer gear starting to get enough memory and power to handle the big models. Prosumer mac rigs can get up there, quantized, but quantized performance is rickety at best, and at that point you look at the costs of self hosting vs private hosts vs $200/$300 a month (+ continual upgrades)

Frontier labs only have a few years left where they can continue to charge a pile for the flagship heavyweight models, I don't think most people will be willing to pay $300 for a 5 or 10% boost over what they can run locally.

nl 3 hours ago | parent | next [-]

It seems like someone at X.ai likes maxing benchmarks but real world usage shows it significantly behind frontier models.

I do appreciate their desire to be the most popular coding model on OpenRouter and offer Grok4-Fast for free. That's a notable step down from frontier models but fine for lots of bug fixing. I've put hundreds of millions of tokens through it.

lame-robot-hoax 6 hours ago | parent | prev | next [-]

Grok, in my experience, is extremely prone to hallucinations when not used for coding. It will readily claim to have access to internal Slack channels at companies, it will hallucinate scientific papers that do not exist, etc. to back its claims.

I don’t know if the hallucinations extend to code, but it makes me unwilling to consider using it.

observationist 6 hours ago | parent [-]

Fair - it's gotten significantly better over the last 4 months or so, and hallucinations aren't nearly as bad as they once were. When I was using Heavy, it was excellent at ensuring grounding and factual statements, but it's not worth $100 more than ChatGPT Pro in capabilities or utility. In general, it's about the same as ChatGPT Pro - once every so often I'll have to call out the model making something up, but for the most part they're good at using search tools and ensuring claims get grounding and confirmation.

I do expect them to pull ahead, given the resources and the allocation of developers at xAI, so maybe at some point it'll be clearly worth paying $300 a month compared to the prices of other flagships. For now, private hosts and ChatGPT Pro are the best bang for your buck.

F7F7F7 an hour ago | parent [-]

What are you doing with GPT Pro? I've compared it directly with Claude Max x20 and Google's premium offer. I just don't see myself ever leaving Claude Code as my daily driver. Codex is slow and opaque, albeit accurate. And Gemini is just super clumsy inside of it's CLI (and in OpenRouter) often confusing BASH and plans with actual output.

Alifatisk 5 hours ago | parent | prev | next [-]

In my experience, Grok 4 expert performs way worse then what the benchmarks say.

I’ve tried it with coding, writing and instructions following. The only thing it excels at currently and searching for things across the web is+ twitter.

Otherwise, I would never use it for anything else. At coding, it always includes an error, when it patches it, it introduces another one. When writing creative text and had to follow instructions, it hallucinates a lot.

Based on my experience, I am suspecting XAI for bench-maxing on Artificial Analysis because no way Grok 4 expert performs close to Gpt-5.2, Claude sonnet 4.5 and Gemini 3 pro

Madmallard 6 minutes ago | parent | prev | next [-]

" Grok 4 Heavy wasn't considered in comparisons. Grok meets or exceeds the same benchmarks that Gemini 3 excels at, saturating mmlu, scoring highest on many of the coding specific benchmarks. Overall better than Claude 4.5, in my experience, not just with the benchmarks."

I think these types of comments should just be forbidden from Hacker News.

It's all feelycraft and impossible to distinguish from motivated speech.

kristianp 6 hours ago | parent | prev | next [-]

Perhaps people are steering clear of grok due to its extremist political training.

observationist 6 hours ago | parent [-]

This is a silly meme.

knowsuchagency 6 hours ago | parent [-]

Mecha hitler

observationist 5 hours ago | parent [-]

Yes, an adventure in public facing bots that can pull from trending feeds, self referential system prompts, minimal guardrails, and that poor fellow Will Stancil.

The absence of guard rails is a good thing - what happened with mechahitler was a series of feature rollouts that combined with Pliny trending, resulting in his latest grok jailbreak ending up in the prompt, followed by the trending mechahitler tweets, and so on. They did a whole lot of new things all at once with the public facing bot, and didn't consider unintended consequences.

I'd rather a company that has a mechahitler incident and laughs it off than a company that pre-emptively clutches pearls on behalf of their customers, or smugly insists that we should just trust them, and that their vision of "safety" is best for everyone.

zamalek 3 hours ago | parent | next [-]

Unfortunately grok doesn't even meet that bar anymore. There was the very recent incident where it claimed Musk was the best at everything, so xAI are clearly not beyond baking in intentional bias/clutching pearls.

https://techcrunch.com/2025/11/20/grok-says-elon-musk-is-bet...

bigyabai an hour ago | parent | prev [-]

> The absence of guard rails is a good thing

It's really not. I have no axe to grind with Elon, but X and it's reputation for "oops we made a mistake" critical failures is a no-go. I don't feel safe signing up to try whatever their free model when their public image is nonstop obvious mistakes. There is no world where I'm bringing those models to work, and explaining to HR why my web traffic included a Mechahitler response (or worse).

Anthropic and OpenAI are Silicon Valley circuses in a relative sense, but they take this stuff seriously and make genuine advancements. XAI could disappear tomorrow and the human race would not lose any irreplaceable research. It's a dedicated fart-huffing division on the best of days, I hope you're not personally invested in their success.

guluarte an hour ago | parent | prev | next [-]

Opus > Codex > Gemini in my opinion, grok is not even close

claudiug 5 hours ago | parent | prev [-]

every time i use grok is get some bad results. basically is all 1000% perfect from his point of view, review the code... "bollocks" methods that dont exists or just one line of code or method created with a nice comment: //#TODO implement