| ▲ | mikeayles 5 hours ago | |
So for people wondering if it can be used to accelerate LLM inference, sadly not. I've been trying to hit 100,000tokens/s with a 3.28m dumb model, and even this is an order of magnitude too large to benefit. It appears to be focussed more on latency, than throughput. Happy to be corrected? | ||
| ▲ | ssivark 27 minutes ago | parent | next [-] | |
When aiming for 100k tok/s, you would still have CUDA overheads (on the order of microseconds) -- which might become the bottleneck, even if you do everything else right with the inference architecture. How are you planning to overcome that? EDIT: Oh, on second read, do you mean you're running the model on an FPGA? | ||
| ▲ | ag2718 5 hours ago | parent | prev | next [-] | |
You're correct that this work is not very applicable for LLMs and that the focus here is primarily on latency. | ||
| ▲ | ai_fry_ur_brain 2 hours ago | parent | prev [-] | |
Was anyone thinking this? | ||