| ▲ | Kirby64 2 hours ago | |||||||||||||||||||||||||
But I’m not missing the point. If you can run one frontier model at 750t/s, then you can probably run many many instances of an SLM in parallel at a rate that exceeds 15k/s. That’s kinda the point of the flash or ultrafast variants. And they’re on something much more modern than llama3.1. | ||||||||||||||||||||||||||
| ▲ | windexh8er 2 hours ago | parent [-] | |||||||||||||||||||||||||
Yes, you are missing the point. 1) It's a demo. [0] 2) It hasn't been updated for 4+ months. You don't need LLMs for everything. That is 100% the point. You can burn down the world with all of your frontier LLMs that are being used for simple queries OR we can do something faster and more efficient like this. Just because you can run a SotA model at "fast" speeds, again, severely misses the point. And no, you can't run anything from Anthropic or OAI on-prem, so until you can there's really no comparison. If people want to continue down the path of gate-kept models with no other options then we'll all follow you off the cliff. | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||