| ▲ | ThomasBb 3 hours ago | |
Beyond the models getting better; there are still huge gains available in the inference engine side with new tricks like Dflash, MRT, turboquant - for some usecases these can multiply the speeds. There are even some model specific optimized kernels like for DeepSeek 4 flash that seem wild. Makes me feel we are nowhere near the optimum yet. Examples: https://dasroot.net/posts/2026/05/gemma-4-speed-hacks-mtp-df... | ||
| ▲ | brrrrrm 3 hours ago | parent [-] | |
what's MRT? | ||