| ▲ | vLLM large scale serving: DeepSeek 2.2k tok/s/h200 with wide-ep(blog.vllm.ai) | |||||||
| 87 points by robertnishihara 15 hours ago | 7 comments | ||||||||
| ▲ | kingstnap 4 hours ago | parent | next [-] | |||||||
Impressive performance work. It's interesting that you still see these 40+% perf gains like this. Makes you think that you will continue to see the costs for a fixed level of "intelligence" dropping. | ||||||||
| ||||||||
| ▲ | snakepit 2 hours ago | parent | prev | next [-] | |||||||
Still have to update it for snakepit 0.11.0, but I did start a vLLM wrapper for Elixir | ||||||||
| ▲ | androiddrew 4 hours ago | parent | prev | next [-] | |||||||
Now all we need is better support for AMD gpus, both CDNA and RDNA types | ||||||||
| ||||||||
| ▲ | danielhanchen 4 hours ago | parent | prev | next [-] | |||||||
Love vLLM! | ||||||||
| ▲ | vessenes 4 hours ago | parent | prev [-] | |||||||
As a user of a lot of coding tokens I’m most interested in latency - these numbers are presumably for heavily batched workloads. I dearly wish Claude had a cerebras endpoint. I’m sure I’d use more tokens because I’d get more revs, but I don’t think token usage would increase linearly with speed: I need time to think about what I want to and what’s happened or is proposed. But I feel like I would be able to stay in flow state if the responses were faster, and that’s super appealing. | ||||||||