| ▲ | TheAceOfHearts 16 hours ago | |||||||||||||
They scored a 31.1% on ARC AGI 2 which puts them in first place. Also notable which models they include for comparison: Gemini 2.5 Pro, Claude Sonnet 4.5, and GPT-5.1. That seems like a minor snub against Grok 4 / Grok 4.1. | ||||||||||||||
| ▲ | buildfocus 15 hours ago | parent | next [-] | |||||||||||||
My impression is that Grok is very rarely used in practice outside of a niche of die-hard users, partly because of very different tuning to other models, and partly the related public reputation around it. https://firstpagesage.com/reports/top-generative-ai-chatbots... suggests 0.6% of chat use cases, well below the other big names, and I suspect those stats for chat are higher than other scenarios like business usage. Given all that, I can see how Gemini might not be focused on competing with them. | ||||||||||||||
| ||||||||||||||
| ▲ | kranke155 15 hours ago | parent | prev | next [-] | |||||||||||||
Grok seems extremely prone to hallucination in my experience. It also constantly asserts certainty on fuzzy topics. | ||||||||||||||
| ▲ | jmmcd 15 hours ago | parent | prev [-] | |||||||||||||
About ARC 2: I would want to hear more detail about prompts, frameworks, thinking time, etc., but they don't matter too much. The main caveat would be that this is probably on the public test set, so could be in pretraining, and there could even be some ARC-focussed post-training - I think we don't know yet and might never know. But for any reasonable setup, if no egregious cheating, that is an amazing score on ARC 2. | ||||||||||||||