Remix clone Hacker News

new | show | ask | jobs Github

	▲	nicoinstrument 12 hours ago
		I'm learning about inference by running vLLM on a k8s cluster (EKS), building a gateway to keep a <2s TTFT SLO. Most recent ha-ha moment: I kept wondering if it was normal that my cluster was only able to process 4 requests per second per vLLM engine (just seemed really low to me). I realized a better metric is in-flight requests... Each engine is processing 70 requests at any given time, streaming tokens for over 30s. Code: https://github.com/Nicolas-Richard/vllm-on-eks
	▲	iugtmkbdfil834 12 hours ago \| parent [-]
		Deeper dives into those uncover interesting limitations that don't seem to be documented anywhere. On the other hand, it is through those reverse shibboleths that I am now able to tell that my boss's boss has no idea what he is talking about llm-wise.