| ▲ | p1esk 5 hours ago | |||||||||||||||||||||||||||||||
By perf I mean how much does it cost to serve 1T model to 1M users at 50 tokens/sec. | ||||||||||||||||||||||||||||||||
| ▲ | zozbot234 4 hours ago | parent [-] | |||||||||||||||||||||||||||||||
All 1T models are not equal. E.g. how many active parameters? what's the native quantization? how long is the max context? Also, it's quite likely that some smaller models in common use are even sub-1T. If your model is light enough, the lower throughput doesn't necessarily hurt you all that much and you can enjoy the lightning-fast speed. | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||