| ▲ | refibrillator 8 months ago | |
Note to others reading along: in the last appendix page the OP paper reports DFloat11 reduces tokens/sec by ~2-3x for the Llama-3.1-8b and Qwen-2.5-14b/32b and Mistral-small-24b models (throughput penalty not reported for others). Using DFloat11, tokens/sec was higher only when compared relative to running inference with some layers offloaded to CPU. Classic comp sci tradeoff between space and speed, no free lunch, etc. | ||