Remix.run Logo
zozbot234 6 hours ago

Cloud hardware is not inherently more "proper" than what's being proposed here, there's nothing wrong per se about targeting slower inference speeds in an on prem single-user context.

Aurornis 6 hours ago | parent | next [-]

> Cloud hardware is not inherently more "proper" than what's being proposed here

Cloud hardware can run the original model. Quantization will reduce quality. The quality drop to Q4 is not trivial.

Cloud hardware is also massively faster in time to first token and token generation speed.

> there's nothing wrong per se about targeting slower inference speeds in a local single-user context.

If that's what the user wants and expects then it's fine

Most people working interactively with an LLM would suffer from slower turns.

zozbot234 5 hours ago | parent [-]

> Cloud hardware can run the original model. Quantization will reduce quality.

New models are often being released in quantized format to begin with. This is true of both Kimi and the new DeepSeek V4 series. There is no "original model", the model is generated using Quantization Aware Training (QAT).

Aurornis 5 hours ago | parent [-]

> There is no "original model", the model is generated using Quantization Aware Training (QAT).

The original model is the model used for the benchmarks

People will say "You can run it locally!" then show the benchmarks of the original model, but what they really mean is that you can run a heavily quantized adaptation of the model which has difference performance characteristics.

zozbot234 4 hours ago | parent [-]

That remark was specific to newer models like Kimi 2.x and DeepSeek V4 series, and this is clearly stated in my comment.

As for other models, we quantize them because we are generally constrained by the model's total footprint in bytes, and running a larger model that's been quantized to fit in the same footprint as a smaller one improves performance compared to a smaller original, generally up to Q4 or so, with even tighter quantizations (up to Q2) being usable for some uses such as general Q&A chat.

cbg0 6 hours ago | parent | prev [-]

The quantization for some models can be very detrimental and their quality can drop considerably from the posted benchmarks which are probably at bf16, this is why having considerable RAM can be important.