| ▲ | LoganDark 4 hours ago | |
I poured a couple days into custom Burn inference for Qwen3-Coder-Next only to find it doesn't come with a speculative decoder, so on my M4 Max I can't push it much further than 120t/s. That's still kinda slow, though still faster than llama.cpp's 70.9t/s and MLX's 80.6t/s with the same model. Claude Fable 5 is recommending I use the Qwen3 MTP -- I worry that will compromise the quality somewhat, but might give it a try to see if I can get more usable speeds. | ||