Nemotron 3 Super was released recently. That's a direct competitor to gpt-oss-120b. https://developer.nvidia.com/blog/introducing-nemotron-3-sup...

▲

evilduck 2 days ago | parent | next [-]

In terms of ability, maybe, in terms of speed, it's not even close. Check out the Prompt Processing speeds between them: https://kyuz0.github.io/amd-strix-halo-toolboxes/

gpt-oss-120b is over 600 tokens/s PP for all but one backend.

nemotron-3-super is at best 260 tokens/s PP.

Comparing token generation, it's again like 50 tokens/sec vs 15 tokens/sec

That really bogs down agentic tooling. Something needs to be categorically better to justify halving output speed, not just playing in the margins.

	▲	mratsim a day ago \| parent [-]
		In my case with vLLM on dual RTX Pro 6000 gpt-oss-120b: (unknown prefill), ~175 tok/s generation. I don't remember the prefill speed but it certainly was below 10k Nemotron-3-Super: 14070 tok/s prefill, ~194.5 tok/s generation. (Tested fresh after reload, no caching, I have a screenshot.) Nemotron-3-Super using NVFP4 and speculative decoding via MTP 5 tokens at a time as mentioned in Nvidia cookbook: https://docs.nvidia.com/nemotron/nightly/usage-cookbook/Nemo...

▲

coder68 2 days ago | parent | prev [-]

I gave it a whirl but was unenthused. I'll try it again, but so far have not really enjoyed any of the nvidia models, though they are best in class for execution speed.

▲

markab21 2 days ago | parent [-]

I'll pipe in here as someone working on an agentic harness project using mastra as the harness.

Nemotron3-super is, without question, my favorite model now for my agentic use cases. The closest model I would compare it to, in vibe and feel, is the Qwen family but this thing has an ability to hold attention through complicated (often noisy) agentic environments and I'm sometimes finding myself checking that i'm not on a frontier model.

I now just rent a Dual B6000 on a full-time basis for myself for all my stuff; this is the backbone of my "base" agentic workload, and I only step up to stronger models in rare situations in my pipelines.

The biggest thing with this model, I've found, is just making sure my environment is set up correctly; the temps and templates need to be exactly right. I've had hit-or-miss with OpenRouter. But running this model on a B6000 from Vast with a native NVFP4 model weight from Nvidia, it's really good. (2500 peak tokens/sec on that setup) batching. about 100/s 1-request, 250k context. :)

I can run on a single B6000 up to about 120k context reliably but really this thing SCREAMS on a dual-b6000. (I'm close to just ordering a couple for myself it's working so well).

Good luck .. (Sometimes I feel like I'm the crazy guy in the woods loving this model so much, I'm not sure why more people aren't jumping on it..)

	▲	girvo a day ago \| parent \| next [-]
		> I'm not sure why more people aren't jumping on it Simple: most of the people you’re talking to aren’t setting these things up. They’re running off the shelf software and setups and calling it a day. They’re not working with custom harnesses or even tweaking temperature or templates, most of them.
	▲	pertymcpert a day ago \| parent \| prev [-]
		I’d be very interested in trying it if you could spare the time to write up how to tune it well. If not thanks for the input anyway.