OK... we need way more information than this to validate this claim! I can run Qwen-8B at 1 billion tokens per second if you don't check the model's output quality. No information is given about the source code, correctness, batching, benchmark results, quantization, etc. etc. etc.

▲

lukebechtel a day ago | parent [-]

We validate with MMLU and Hellaswag presently, and are getting this independently verified by a 3rd party.

We have considered open-sourcing some of our optimized inference libraries in the future, but have not yet come to a decision on this.

Also if you need a rough intuition as to why this is possible: it's because this entire inference stack was built for exactly one model, and thus we can really tune the entire framework accordingly.

▲

rfw300 17 hours ago | parent [-]

I've no problem with the intuition. But I would hope for a lot more focus in the marketing materials on proving the (statistical) correctness of the implementation. 15% better inference speed is not worth it to use a completely unknown inference engine not tested across a wide range of generation scenarios.

	▲	lukebechtel 14 hours ago \| parent \| next [-]
		This is a fair critique! We plan to use our system to generate many more inference libraries of this nature, and I'll make it a point to release better, broader correctness measures when we do so.
	▲	LuxBennu 4 hours ago \| parent \| prev [-]
		[flagged]