Remix.run Logo
kingstnap 6 hours ago

Impressive performance work. It's interesting that you still see these 40+% perf gains like this.

Makes you think that you will continue to see the costs for a fixed level of "intelligence" dropping.

davidhyde 17 minutes ago | parent | next [-]

vLLM needs to perform similar operations to an operating system. If you write an operating system in Python you will have scope for many 40% improvements all over the place and in the end it won’t be Python anymore, at least under the hood it won’t be.

whoevercares 6 hours ago | parent | prev [-]

Absolutely. LLM inference is still a greenfield — things like overlap scheduling and JIT CUDA kernels are very recent. We’re just getting started optimizing for modern LLM architectures, so cost/perf will keep improving fast.