Qwen3.5-27B with a 4bit quant can be run on a 24G card with no problem. With 2 Nvidia L4 cards and some additional vllm flags, i am serving 10 developers at 20-25tok/sek, off-peak is around 40tok/sek. Developers are ok with that performance, but ofc they requested more GPU's for added throughput.

▲

PcChip an hour ago | parent | next [-]

question: why not use something like Claude? is it for security reasons?

	▲	lambda 10 minutes ago \| parent [-]
		Some people would rather not hand over all of their ability to think to a single SaaS company that arbitrarily bans people, changes token limits, tweaks harnesses and prompts in ways that cause it to consume too many tokens, or too few to complete the task, etc. I don't use any non-FLOSS dev tools; why would I suddenly pay for a subscription to a single SaaS provider with a proprietary client that acts in opaque and user hostile ways?

▲

tandr 2 hours ago | parent | prev [-]

What would be these additional vllm flags, if you don't mind sharing?