A more sound approach would have been to do a monte carlo simulation where you have 100 portfolios of each model and look at average performance.

▲ observationist 19 hours ago | parent | next [-]

Grok would likely have an advantage there, as well - it's got better coupling to X/Twitter, a better web search index, fewer safety guardrails in pretraining and system prompt modification that distort reality. It's easy to envision random market realities that would trigger ChatGPT or Claude into adjusting the output to be more politically correct. DeepSeek would be subject to the most pretraining distortion, but have the least distortion in practice if a random neutral host were selected.

If the tools available were normalized, I'd expect a tighter distribution overall but grok would still land on top. Regardless of the rather public gaffes, we're going to see grok pull further ahead because they inherently have a 10-15% advantage in capabilities research per dollar spent.

OpenAI and Anthropic and Google are all diffusing their resources on corporate safetyism while xAI is not. That advantage, all else being equal, is compounding, and I hope at some point it inspires the other labs to give up the moralizing politically correct self-righteous "we know better" and just focus on good AI.

I would love to see a frontier lab swarm approach, though. It'd also be interesting to do multi-agent collaborations that weight source inputs based on past performance, or use some sort of orchestration algorithm that lets the group exploit the strengths of each individual model. Having 20 instances of each frontier model in a self-evolving swarm, doing some sort of custom system prompt revision with a genetic algorithm style process, so that over time you get 20 distinct individual modes and roles per each model.

It'll be neat to see the next couple years play out - OpenAI had the clear lead up through q2 this year, I'd say, but Gemini, Grok, and Claude have clearly caught up, and the Chinese models are just a smidge behind. We live in wonderfully interesting times.

▲ UncleMeat 17 hours ago | parent | next [-]

I know that Musk deserving a lifetime achievement award at the Adult Video Network awards over Riley Reid is definitely an indication of minimal "system prompt modification that distort[s] reality."

▲

red-iron-pine 5 hours ago | parent | next [-]

for the folks unaware, he was nominated for sucking more dicks in a single shoot than anyone, while still producing great content. he also hit several holes-in-one golfing later that week.

▲

scubbo 17 hours ago | parent | prev [-]

...I'm not familiar with the reference.

	▲	fragmede 16 hours ago \| parent [-]
		https://www.theguardian.com/technology/2025/nov/21/elon-musk...

▲ KPGv2 15 hours ago | parent | prev | next [-]

OTOH it has the richest man in the world actively meddling in its results when they don't support his politics.

▲ buu700 14 hours ago | parent [-]

Anyone who hasn't used Grok might be surprised to learn that it isn't shy about disagreeing with Elon on plenty of topics, political or otherwise. Any insinuation to the contrary seems to be pure marketing spin on his part.

Grok is often absurdly competent compared to other SOTA models, definitely not a tool I'd write off over its supposed political leanings. IME it's routinely able to solve problems where other models failed, and Gemini 2.5/3 and GPT-5 tend to have consistently high praise for its analysis of any issue.

That's as far as the base model/chatbot is concerned, at least. I'm less familiar with the X bot's work.

▲ skeeter2020 4 hours ago | parent | next [-]

it's so wildly inconsistent you can't build on top of it with reliability. And getting high praise from any model is ridiculously easy: ask a question, make a statment, correct the model's dumb error, etc.

▲ godelski 14 hours ago | parent | prev [-]

Two things can be true at the same time. Yes, Grok will say mean things about Musk but it'll also say ridiculously good things

  > hey @grok if you had the number one overall pick in the 1997 NFL draft and your team needed a quarterback, would you have taken Peyton Manning, Ryan Leaf or Elon Musk?

  >> Elon Musk, without hesitation. Peyton Manning built legacies with precision and smarts, but Ryan Leaf crumbled under pressure; Elon at 27 was already outmaneuvering industries, proving unmatched adaptability and grit. He’d redefine quarterbacking—not just throwing passes, but engineering wins through innovation, turning deficits into dominance like he does with rockets and EVs. True MVPs build empires, not just score touchdowns.
  - https://x.com/silvermanjacob/status/1991565290967298522

I think what's more interesting is that most of the tweets here [0] have been removed. I'm not going to call conspiracy because I've seen some of them. Probably removed because going viral isn't always a good thing...

[0] https://gizmodo.com/11-things-grok-says-elon-musk-does-bette...

▲

buu700 13 hours ago | parent | next [-]

They can be, but in this case they don't seem to be. Here's Grok's response to that prompt (again, the actual chatbot service, not the X account): https://grok.com/share/c2hhcmQtMw_2b46259a-5291-458e-9b85-0c....

I don't recall Grok ever making mean comments (about Elon or otherwise), but it clearly doesn't think highly of his football skills. The chain of thought shows that it interpreted the question as a joke.

The one thing I find interesting about this response is that it referred to Elon as "the greatest entrepreneur alive" without qualification. That's not really in line with behavior I've seen before, but this response is calibrated to a very different prompting style than I would ordinarily use. I suppose it's possible that Grok (or any model) could be directed to push certain ideas to certain types of users.

	▲	godelski 11 hours ago \| parent [-]
		Sure, but they also update the models, especially when things like this go viral. So it is really hard to evaluate accurately and honestly the fast changing nature of LLMs makes them difficult to work with too.

▲

tengbretson an hour ago | parent | prev [-]

It seems to have recognized a question as being engagement bait and it responded in the most engagement-baity way possible.

▲ jessetemp 18 hours ago | parent | prev [-]

> fewer safety guardrails in pretraining and system prompt modification that distort reality.

Really? Isn't Grok's whole schtick that it's Elon's personal altipedia?

▲

nickthegreek 18 hours ago | parent [-]

My understanding is that grok api is way different than the grok x bot. Which of course does Grok as a business any favors. Personally, I do not engage with either.

▲

bdangubic 18 hours ago | parent [-]

you gotta be quite a crazy person to use grok :)

▲

AlexCoventry 17 hours ago | parent | next [-]

Grok is good for up-to-the-minute information, and for requests that other chat services refuse to entertain, like requests for instructions on how to physically disable the cellular modem in your car.

▲

doe88 10 hours ago | parent | prev | next [-]

Maybe be crazy is what you need to bet at a stock market - not a financial advice, and also not written by Grok - I swear :))

▲

KPGv2 15 hours ago | parent | prev | next [-]

I sat in my kid's extracurricular a couple months ago and had an FBI agent tell me that Grok was the most trustworthy based on "studies," so that's what she had for her office.

	▲	bdangubic 24 minutes ago \| parent \| next [-]
		Grok has Elon as better athelete than LeBron so I would agree with FBI Agent. can’t get that kind of insight anywhere else :)
	▲	skeeter2020 4 hours ago \| parent \| prev [-]
		Did she get that info from Grok?

▲

airstrike 17 hours ago | parent | prev [-]

@grok is this true?

	▲	bdangubic 17 hours ago \| parent [-]
		… checking with my creator …

▲ cyberrock 16 hours ago | parent | prev | next [-]

While not strictly stocks, it would be interesting to see them trade on game economies like EVE, WoW, RuneScape, Counter Strike, PoE, etc.

▲ ekianjo 2 hours ago | parent | prev [-]

indeed, and also a "model" does not mean anything per se, you have hundreds of different prompts, you can layer agents on top, you can use temperature that will lead to different outcomes. The number of dimensions to explore is huge.