> inspired to write this post by antirez’s recent project DwarfStar 4, which is a version of llama.cpp that’s been stripped down to run only DeepSeek-V4-Flash

This is not true, it is its own project.

Indebted to llama.cpp, sure, but not a stripped down version

▲

antirez an hour ago | parent | next [-]

Yep, the code overlap is minimal, a few kernels. Some quantization code for the quantizer it implements. DwarfStar 4 is not a fork of llama.cpp, but without llama.cpp the project would be a lot more lacking, since I was able to get all the details that mattered in a second. But it is not a stripped down llama.cpp. This does not reduce in any way how much llama.cpp is not just for this project, but for all the projects that followed and are following. It's not a matter of code: the street to follow, the quants formats, the lessons, the optimized kernels you can check to learn the patterns.

▲

embedding-shape an hour ago | parent | prev | next [-]

Truth seems to sit somewhere in-between, DwarfStar 4 seems to mainly exists only because of llama.cpp, and authors basically were very inspired by llama.cpp's code, and even in some places literally have copied pieces from it, all with proper attribution and everything, I'm not trying to say this is bad, seems OK to me:

> ds4.c does not link against GGML, but it exists thanks to the path opened by the llama.cpp project and the kernels, quantization formats, GGUF ecosystem, and hard-won engineering knowledge developed there. We are thankful and indebted to llama.cpp and its contributors. Their implementation, kernels, tests, and design choices were an essential reference while building this DeepSeek V4 Flash-specific inference path. Some source-level pieces are retained or adapted here under the MIT license: GGUF quant layouts and tables, CPU quant/dot logic, and certain kernels. For this reason, and because we are genuinely grateful, we keep the GGML authors copyright notice in our LICENSE file. - https://github.com/antirez/ds4#acknowledgements-to-llamacpp-...

Been a lot of fun to play around with it since https://news.ycombinator.com/item?id=48142885 (~2 days ago), managed to make the generation go from 47.85 t/s to 57.07 t/s so far :)

▲

antirez an hour ago | parent [-]

Send patches! But remember that many speedups end being not exactly correct and the logits drift. But there is extensive testing and even ds4-eval now to test how it performs.

	▲	embedding-shape 41 minutes ago \| parent [-]
		Hah, it's quick hacks for me to understand CUDA better, I'm unlikely to have time to make them proper enough :( But maybe opening an issue talking about what I tried and what worked, makes sense. I did confirm no logits drift, as you so nicely have provided tooling for ensuring exactly this, thanks for the great care that obviously gone into the project, been a pleasure to play around with! :)

▲

an hour ago | parent | prev [-]

[deleted]