▲ | rshemet 4 days ago | |||||||
thank you! We're continue to add performance metrics as more data comes in. A Qwen 2.5 500M will get you to ≈45tok/sec on an iPhone 13. Inference speeds are somewhat linearly inversely proportional to model sizes. Yes, speeds are consistent across frameworks, although (and don't quote me on this), I believe React Native is slightly slower because it interfaces with the C++ engine through a set of bridges. | ||||||||
▲ | pickettd 4 days ago | parent | next [-] | |||||||
I also want to add on that I really appreciate the benchmarks. When I was working with RAG llama.cpp through RN early last year I had pretty acceptable tok/sec results up through 7-8b quantized models (on phones like the S24+ and iPhone 15pro). MLC was definitely higher tok/sec but it is really tough to beat the community support and availability in the gguf ecosystem. | ||||||||
▲ | Reebz 4 days ago | parent | prev [-] | |||||||
Looking at the current benchmarks table, I was curious: what do you think is wrong with Samsung S25 Ultra? Most of the standard mobile CPU benchmarks (GeekBench, AnTuTu, et al) show a 20-40% performance gain over S23/S24 Ultra. Also, this bucks the trend where most other devices are ranked appropriately (i.e. newer devices perform better). Thanks for sharing your project. | ||||||||
|