Remix.run Logo
dvt 15 hours ago

What excites me most about these new 4figure/second token models is that you can essentially do multi-shot prompting (+ nudging) and the user doesn't even feel it, potentially fixing some of the weird hallucinatory/non-deterministic behavior we sometimes end up with.

volodia 13 hours ago | parent | next [-]

That is also our view! We see Mercury 2 as enabling very fast iteration for agentic tasks. A single shot at a problem might be less accurate, but because the model has a shorter execution time, it enables users to iterate much more quickly.

lostmsu 11 hours ago | parent | prev [-]

Regular models are very fast if you do batch inference. GPT-OSS 20B gets close to 2k tok/s on a single 3090 at bs=64 (might be misremembering details here).

rahimnathwani 8 hours ago | parent [-]

Right but everyone else is talking about latency, not throughput.