| ▲ | zozbot234 3 hours ago | |
The performance gain in the recent Flash-MoE implementations is seemingly obtained mostly by coalescing the data for each single MoE layer-expert into a single sequential extent which can be read efficiently from SSD. If so, this will actually require some changes in the underlying GGUF format; though the GGUF standard provides explicitly for specifying different data layouts, so the additions are arguably minor. As far as the TurboQuant thing goes, it seems that attn-rot has recently been merged in, which is a lightweight variety of it and written by the original llama.cpp author, so not an outside pull req. | ||
| ▲ | Aurornis 2 hours ago | parent [-] | |
> As far as the TurboQuant thing goes, it seems that attn-rot has recently been merged in, which is a lightweight variety of it and written by the original llama.cpp author, so not an outside pull req. Yes, read the first sentence of the PR for it. The project is a constant target for vibecoded PRs and they're trying to stay in front of it: > In anticipation of the incoming flood of vibe generated PRs implementing TurboQuant, I'm raising the baseline a bit | ||