▲ | pj_mukh 4 days ago | ||||||||||||||||||||||
Amazing, this is so so useful. Thank you especially for the phone model vs tok/s breakdown. Do you have such tables for more models? For models even leaner than Gemma3 1B. How low can you go? Say if I wanted to tweak out 45toks/s on an iPhone 13? P.S: Also, I'm assuming the speeds stay consistent with react-native vs. flutter etc? | |||||||||||||||||||||||
▲ | rshemet 4 days ago | parent [-] | ||||||||||||||||||||||
thank you! We're continue to add performance metrics as more data comes in. A Qwen 2.5 500M will get you to ≈45tok/sec on an iPhone 13. Inference speeds are somewhat linearly inversely proportional to model sizes. Yes, speeds are consistent across frameworks, although (and don't quote me on this), I believe React Native is slightly slower because it interfaces with the C++ engine through a set of bridges. | |||||||||||||||||||||||
|