| ▲ | kingstnap 6 hours ago | |
Impressive performance work. It's interesting that you still see these 40+% perf gains like this. Makes you think that you will continue to see the costs for a fixed level of "intelligence" dropping. | ||
| ▲ | davidhyde 17 minutes ago | parent | next [-] | |
vLLM needs to perform similar operations to an operating system. If you write an operating system in Python you will have scope for many 40% improvements all over the place and in the end it won’t be Python anymore, at least under the hood it won’t be. | ||
| ▲ | whoevercares 6 hours ago | parent | prev [-] | |
Absolutely. LLM inference is still a greenfield — things like overlap scheduling and JIT CUDA kernels are very recent. We’re just getting started optimizing for modern LLM architectures, so cost/perf will keep improving fast. | ||