It's interesting that the benchmark they are choosing to emphasize (in the one chart they show and even in the "fast" name of the model) is token output speed.

I would have thought it uncontroversial view among software engineers that token quality is much important than token output speed.

▲ eterm 5 days ago | parent | next [-]

It depends how fast.

If an LLM is often going to be wrong anyway, then being able to try prompts quickly and then iterate on those prompts, could possibly be more valuable than a slow higher quality output.

Ad absurdum, if it could injest and work on an entire project in milliseconds, then it has mucher geater value to me, than a process which might take a day to do the same, even if the likelihood of success is also strongly affected.

It simply enables a different method of interactive working.

Or it could supply 3 different suggestions in-line while working on something, rather than a process which needs to be explicitly prompted and waited on.

Latency can have critical impact on not just user experience but the very way tools are used.

Now, will I try Grok? Absolutely not, but that's a personal decision due to not wanting anything to do with X, rather than a purely rational decision.

▲

34679 5 days ago | parent | next [-]

>If an LLM is often going to be wrong anyway, then being able to try prompts quickly and then iterate on those prompts, could possibly be more valuable than a slow higher quality output.

Before MoE was a thing, I built what I called the Dictator, which was one strong model working with many weaker ones to achieve a similar result as MoE, but all the Dictator ever got was Garbage In, so guess what came out?

▲

LinXitoW 4 days ago | parent | next [-]

Sounds more like a Mixture of Idiots.

▲

charcircuit 5 days ago | parent | prev | next [-]

That doesn't seem similar to MoE at all.

	▲	34679 4 days ago \| parent [-]
		Well, I really didn't provide sufficient detail to make that determination either way.

▲

_kb 5 days ago | parent | prev [-]

You just need to scale out more. As you approach infinite monkeys, sorry - models, you'll surely get the result you need.

▲

dingnuts 5 days ago | parent [-]

why's this guy getting downvoted? SamA says we need a Dyson Sphere made of GPUs surrounding the solar system and people take it seriously but this guy takes a little piss out of that attitude and he's downvoted?

this site is the fucking worst

	▲	kelnos 5 days ago \| parent [-]
		Maybe because this site is full of people with differing opinions and stances on things, and react differently to what people say and do? Not sure who was taking SamA seriously about that; personally I think he's a ridiculous blowhard, and statements like that just reinforce that view for me. Please don't make generalizations about HN's visitors'/commenters' attitudes on things. They're never generally correct.

▲

postalcoder 5 days ago | parent | prev | next [-]

Besides being a faster slot machine, to the extent that they're any good, a fast agentic LLM would be very nice to have for codebase analysis.

▲

fmbb 5 days ago | parent [-]

For 10% less time you can get 10% worse analysis? I don’t understand the tradeoff.

	▲	kelnos 5 days ago \| parent [-]
		I mean, if that's literally what the numbers are, sure, maybe that's not great. But what if it's 10% less time and 3% worse analysis? Maybe that's valuable.

▲

giancarlostoro 5 days ago | parent | prev [-]

> If an LLM is often going to be wrong anyway, then being able to try prompts quickly and then iterate on those prompts, could possibly be more valuable than a slow higher quality output.

Asking any model to do things in steps is usually better too, as opposed to feeding it three essays.

▲

ffsm8 5 days ago | parent [-]

I thought the current vibe was doing the former to produce the latter and then use the output as the task plan?

▲

giancarlostoro 5 days ago | parent [-]

I don't know what other people are doing, I mostly use LLMs:

* Scaffolding

* Ask it what's wrong with the code

* Ask it for improvements I could make

* Ask it what the code does (amazing for old code you've never seen)

* Ask it to provide architect level insights into best practices

One area where they all seem to fail is lesser known packages they tend to either reference old functionality that is not there anymore, or never was, they hallucinate. Which is part of why I don't ask it for too much.

Junie did impress me, but it was very slow, so I would love to see a version of Junie using this version of Grok, it might be worthwhile.

▲

ffsm8 5 days ago | parent | next [-]

> Ask it what's wrong with the code

That's phase 1, ask it to "think deeply" (Claude keyword, only works with the anthropic models) while doing that. Then ask it to make a detailed plan of solving the issue and write that into current-fix.md and ask it to add clearly testable criteria when the issuen is solved.

Now you manually check the criteria wherever they sound plausible, if not - it's analysis failed and its output was worthless.

But if it sounds good, you can then start a new session and ask it to read the-markdown-file and implement the change.

Now you can plausibility check the diff and are likely done

But as the sister comment pointed out, agentic coding really breaks apart with large files like you usually have in brownfield projects.

▲

dingnuts 5 days ago | parent | prev | next [-]

> amazing for old code you've never seen

not if you have too much! a few hundred thousand lines of code and you can't ask shit!

plus, you just handed over your company's entire IP to whoever hosts your model

	▲	giancarlostoro 5 days ago \| parent \| next [-]
		If Apple keeps improving things, you can run the model locally. I'm able to run models on my Macbook with an M4 that I can't even run on my 3080 GPU (mostly due to VRAM constraints) but they run reasonably fast, would the 3080 be faster? Sure, but its also plenty fast to where I'm not sitting there waiting longer than I wait for a cloud model to "reason" and look things up. I think the biggest thing for offline LLMs will have to be consistency for having them search the web with an API like Google's or some other search engines API, maybe Kagi could provide an API for people who self-host LLMs (not necessarily for free, but it would still be useful).
	▲	miohtama 5 days ago \| parent \| prev [-]
		It's a fair trade off for smaller companies where IP or the software is necessary evil, not the main unique value added. It's hard to see what evil would anyone do with crappy legacy code. The IP risks taken may be well worth of productiviry boosts.

▲

miohtama 5 days ago | parent | prev [-]

I hope in the future tooling and MCP will be better so agents can directly check what functionality exists in the installed package version instead of hallucinations.

▲ jsheard 5 days ago | parent | prev | next [-]

That's far from the worst metric that xAI has come up with...

https://xcancel.com/elonmusk/status/1958854561579638960

▲

Rover222 5 days ago | parent [-]

what's wrong with rapid updates to an app?

▲

LeafItAlone 5 days ago | parent | next [-]

I have a coworker who outshines everybody else in number of commits and pushes in any given time period. It’s pretty amazing the number they can accomplish!

Of course, 95% of them are fixing things they broke in earlier commits and their overall quality is the worst on the team. But, holy cow, they can output crap faster than anyone I’ve seen.

▲

kelnos 5 days ago | parent | prev | next [-]

That metric doesn't really tell you anything. Maybe I'm making rapid updates to my app because I'm a terrible coder and I keep having to push out fixes to critical bugs. Maybe I'm bored and keep making little tweaks to the UI, and for some reason think that's worth people's time to upgrade. (And that's another thing: frequent upgrades can be annoying!)

But sure, ok, maybe it could mean making much faster progress than competitors. But then again, it could also mean that competitors have a much more mature platform, and you're only releasing new things so often because you're playing catch-up.

(And note that I'm not specifically talking about LLMs here. This metric is useless for pretty much any kind of app or service.)

▲

ori_b 5 days ago | parent | prev | next [-]

It's like measuring how fast your car can go by counting how often you clean the upholstery.

There's nothing wrong with doing it, but it's entirely unrelated to performance.

▲

Rover222 5 days ago | parent [-]

I don't think he was saying their release cadence is a direct metric on their model performance. Just that the team iterates and improves the app user experience much more quickly than on other teams.

	▲	jdiff 5 days ago \| parent \| next [-]
		He seems to be stating that app release cadence correlates with internal upgrades that correlate with model performance. There is no reason for this to be true. He does not seem to be talking about user experience.
	▲	kelnos 5 days ago \| parent \| prev \| next [-]
		Oh c'mon, I know it's usually best to try to interpret things in the most charitable way possible, but clearly Musk was implying the actual meat of things, the model itself, is what's being constantly improved. But even if your interpretation is correct, frequency of releases still is not a good metric. That could just mean that you have a lot to fix, and/or you keep breaking and fixing things along the way.
	▲	ori_b 5 days ago \| parent \| prev [-]
		It's a fucking chat. How many times a day do you need to ship an update?

▲

cosmicgadget 5 days ago | parent | prev | next [-]

They aren't a metric for showing you are better than the competition.

	▲	Rover222 5 days ago \| parent [-]
		It's a metric for showing you can move more quickly on product improvements. Anyone who has worked on a product team at a large tech company knows how much things get slowed down by process bloat.

▲

tzs 5 days ago | parent | prev [-]

See the reply, currently at #2 on that Twitter thread, from Jamie Voynow.

▲ ojosilva 5 days ago | parent | prev | next [-]

After trying Cerebras free API (not affiliated) which delivers Qwen Coder 480b and gpt-oss-120b a mind boggling ~3000 tps, that output speed is the first thing I checked out when considering a model for speed. I just wish Cerebras had a better overall offering on their cloud, usage is capped at 70M tokens / day and people are reporting that it's easily hit and highly crippling for daily coding.

	▲	scottyeager 4 days ago \| parent [-]
		They have a "max" plan with 120m tokens/day limit for $200/month: https://www.cerebras.ai/blog/introducing-cerebras-code

▲ peab 5 days ago | parent | prev | next [-]

depends for what.

For autocompleting simple functions (string manipulation, function definitions, etc), the quality bar is pretty easy to hit, and speed is important.

If you're just vibe coding, then yeah, you want quality. But if you know what you're doing, I find having a dumber fast model is often nicer than a slow smart model that you still need to correct a bit, because it's easier to stay in flow state.

With the slow reasoning models, the workflow is more like working with another engineer, where you have to review their code in a PR

▲ M4v3R 5 days ago | parent | prev | next [-]

Speed absolutely matters. Of course if the quality is trash then it doesn't matter, but a model that's on par with Claude Sonnet 4 AND very speedy would be an absolute game changer in agentic coding. Right now you craft a prompt, hit send and then wait, and wait, and then wait some more, and after some time (anywhere from 30 seconds to minutes later) the agent finishes its job.

It's not long enough for you to context switch to something else, but long enough to be annoying and these wait times add up during the whole day.

It also discourages experimentation if you know that every prompt will potentially take multiple minutes to finish. If it instead finished in seconds then you could iterate faster. This would be especially valuable in the frontend world where you often tweak your UI code many times until you're satisfied with it.

▲ CuriouslyC 5 days ago | parent | prev | next [-]

For agentic workflows, speed and good tool use are the most important thing. Agents should use tools for things by design, and that can include reasoning tools and oracles. The agent doesn't need to be smart, it just needs a line to someone who is that can give the agent a hyper-detailed plan to follow.

▲ 6r17 5 days ago | parent | prev | next [-]

Tbh I kind of disagree ; there are certain use-cases were legitimately speed would be much more interesting such as generating a massive amount of HTML. Tough I agree this makes it look like even more of a joke for anything serious.

They reduce the costs tough !

▲ scottyeager 4 days ago | parent | prev | next [-]

Fast inference can change the entire dynamic or working with these tools. At the typical speeds, I usually try to do something else while the model works. When the model works really fast, I can easily wait for it to finish.

So the total difference includes the cost of context switching, which is big.

Potentially speed matters less in a scenario that is focused on more autonomous agents running in the background. However I think most usage is still highly interactive these days.

▲ defen 5 days ago | parent | prev | next [-]

> I would have thought it uncontroversial view among software engineers that token quality is much important than token output speed.

We already know that in most software domains, fast (as in, getting it done faster) is better than 100% correct.

▲ jml78 5 days ago | parent | prev | next [-]

To a point. If gpt5 takes 3 minutes to output and qwen3 does it in 10 seconds and the agent can iterate 5 times to finish before gpt5, why do I care if gpt5 one shot it and qwen took 5 iterations

▲

wahnfrieden 5 days ago | parent | next [-]

It doesn’t though. Fast but dumb models don’t progressively get better with more iterations.

▲

Jcampuzano2 5 days ago | parent | next [-]

There are many ways to skin a cat.

Often all it takes is to reset to a checkpoint or undo and adjust the prompt a bit with additional context and even dumber models can get things right.

I've used grok code fast plenty this week alongside gpt 5 when I need to pull out the big guns and it's refreshing using a fast model for smaller changes or for tasks that are tedious but repetitive during things like refactoring.

	▲	wahnfrieden 5 days ago \| parent [-]
		Yes fast/dumb models are useful! But that's not what OP said - they said they can be as useful as the large models by iterating them. Do you use them successfully in cases where you just had to re-run them 5 times to get a good answer, and was that a better experience than going straight to GPT 5?

▲

dmix 5 days ago | parent | prev [-]

That very much depends on the usecase

Different models for different things.

Not everyone is solving complicated things every time they hit cmd-k in Cursor or use autocomplete, and they can easily switch to a different model when working harder stuff out via longer form chat.

▲

ant6n 4 days ago | parent | prev [-]

ChaptGPT5 takes 5 times the time to finish, and still produces garbage.

▲ giancarlostoro 5 days ago | parent | prev | next [-]

I'm more curious if its based on Grok 3 or what, I used to get reasonable answers from Grok 3. If that's the case, the trick that works for Grok and basically any model out there is to ask for things in order and piecemeal, not all at once. Some models will be decent at the 'all at once' approach, but when me and others have asked it in steps it gave us much better output. I'm not yet sure how I feel about Grok 4, have not really been impressed by it.

▲ esafak 5 days ago | parent | prev | next [-]

I agree. Coding faster than humans can review it is pointless. Between fast, good, and cheap, I'd prioritize good and cheap.

Fast is good for tool use and synthesizing the results.

▲ furyofantares 5 days ago | parent | prev | next [-]

Fast can buy you a little quality by getting more inference on the same task.

I use Opus 4.1 exclusively in Claude Code but then I also use zen-mcp server to get both gpt5 and gemini-2.5-pro to review the code and then Opus 4.1 responds. I will usually have eyeballed the code somewhere in the middle here but I'm not fully reviewing until this whole dance is done.

I mean, I obviously agree with you in that I've chosen the slowest models available at every turn here, but my point is I would be very excited if they also got faster because I am using a lot of extra inference to buy more quality before I'm touching the code myself.

▲ dotancohen 5 days ago | parent [-]

  > I use Opus 4.1 exclusively in Claude Code but then I also use zen-mcp server to get both gpt5 and gemini-2.5-pro to review the code and then Opus 4.1 responds.

I'd love to hear how you have this set up.

	▲	mchusma 5 days ago \| parent [-]
		This is a nice setup. I wonder how much it helps in practice? I suspect most of the problems opus has for me are more context related, and I’m not sure more models would help. Speculation on my part.

▲ londons_explore 5 days ago | parent | prev [-]

[flagged]