| ▲ | philipp-gayret 2 hours ago | |
Nice work, I applied my own benchmarking tools to it. On my single NVidia Spark I get 173.3 tokens/s on baseline config, 372.4 tokens/s with added tuning/parallel options. Most notably time to first token is incredibly low, similar models take ~6000ms. Bonsai was 70ms (almost 100x reduction) with flash attention Having said all that, gemma4-e4b-q4km did much better and I can achieve 70% of the tokens/s on the same machine, specifically in context of tool use and for running agents. | ||