| ▲ | rfw300 a day ago | ||||||||||||||||||||||
OK... we need way more information than this to validate this claim! I can run Qwen-8B at 1 billion tokens per second if you don't check the model's output quality. No information is given about the source code, correctness, batching, benchmark results, quantization, etc. etc. etc. | |||||||||||||||||||||||
| ▲ | lukebechtel a day ago | parent [-] | ||||||||||||||||||||||
We validate with MMLU and Hellaswag presently, and are getting this independently verified by a 3rd party. We have considered open-sourcing some of our optimized inference libraries in the future, but have not yet come to a decision on this. Also if you need a rough intuition as to why this is possible: it's because this entire inference stack was built for exactly one model, and thus we can really tune the entire framework accordingly. | |||||||||||||||||||||||
| |||||||||||||||||||||||