Remix.run Logo
chrislattner 2 days ago

If you want the fastest open source implementation on Blackwell and AMD MI355, check out Modular's MAX nightly. You can pip install it super fast, check it out here: https://www.modular.com/blog/day-zero-launch-fastest-perform...

-Chris Lattner (yes, affiliated with Modular :-)

nabakin 2 days ago | parent | next [-]

Faster than TensorRT-LLM on Blackwell? Or do you not consider TensorRT-LLM open source because some dependencies are closed source?

melodyogonna 2 days ago | parent [-]

I reviewed the TensorRT-LLM commit history from the past few days and couldn't find any updates regarding Gemma 4 support. By contrast, here is the reference for MAX:https://github.com/modular/modular/commit/57728b23befed8f3b4...

nabakin 2 days ago | parent [-]

If OP meant they have the fastest implementation of Gemma 4 on Blackwell at the moment, I guess that is technically true. I doubt that will hold up when TensorRT-LLM finishes their implementation though.

pama 2 days ago | parent [-]

How is the sglang performance on Blackwell for this model?

nabakin 2 days ago | parent [-]

Dunno but there's a PR for it. Probably also more performant than Modular.

jjcm 2 days ago | parent | prev [-]

What % of a speedup should I be expecting vs just running this the standard pytorch approach?