| ▲ | proxysna 2 hours ago | |||||||
Qwen3.5-27B with a 4bit quant can be run on a 24G card with no problem. With 2 Nvidia L4 cards and some additional vllm flags, i am serving 10 developers at 20-25tok/sek, off-peak is around 40tok/sek. Developers are ok with that performance, but ofc they requested more GPU's for added throughput. | ||||||||
| ▲ | PcChip an hour ago | parent | next [-] | |||||||
question: why not use something like Claude? is it for security reasons? | ||||||||
| ||||||||
| ▲ | tandr 2 hours ago | parent | prev [-] | |||||||
What would be these additional vllm flags, if you don't mind sharing? | ||||||||