Remix clone Hacker News

new | show | ask | jobs Github

	▲	mutkach 9 days ago
		> Inspired by GPUs, we parallelized this effort across multiple engineers. One engineer tried vLLM, another SGLang, and a third worked on TensorRT-LLM. We were able to quickly get TensorRT-LLM working, which was fortunate as it is usually the most performant inference framework for LLMs. > TensorRT-LLM It is usually the hardest to setup correctly and is often out of the date regarding the relevant architectures. It also requires compiling the model on the exact same hardware-drivers-libraries stack as your production environment which is a great pain in the rear end to say the least. Multimodal setups also been a disaster - at least for a while - when it was near-impossible to make it work even for mainstream models - like Multimodal Llamas. The big question is whether it's worth it, since when running the GPT-OSS-120B on H100 using vLLM is flawless in comparison - and the throughput stays at 130-140 t/s for a single H100. (It's also somewhat a clickbait of a title - I was expecting to see 500t/s for a single GPU, when in fact it's just a tensor-parallel setup) It's also funny that they went for a separate release of TRT-LLM just to make sure that gpt-oss will work correctly, TRT-LLM is a mess
	▲	philipkiely 9 days ago \| parent [-]
		TRT-LLM has its challenges from a DX perspective and yeah for Multi-modal we still use vLLM pretty often. But for the kind of traffic we are trying to serve -- high volume and latency sensitive -- it consistently wins head-to-head in our benchmarking and we have invested a ton of dev work in the tooling around it.