| ▲ | simonw 6 hours ago | ||||||||||||||||
Yes. I collected some details here: https://simonwillison.net/2026/Mar/18/llm-in-a-flash/ | |||||||||||||||||
| ▲ | anemll 2 hours ago | parent | next [-] | ||||||||||||||||
Thanks for posting this, that's how I first found out about Dan's experiment! SSD speed doubled in the M5P/M generation, that makes it usable! I think one paper under the radar is "KV Prediction for Improved Time to First Token" https://arxiv.org/abs/2410.08391 which hopefully can help with prefill for Flash streaming. | |||||||||||||||||
| |||||||||||||||||
| ▲ | superjan 4 hours ago | parent | prev [-] | ||||||||||||||||
That was a very good summary. One detail the post could use is mentioning that 4 or 10 experts invoked where selected from the 512 experts the model has per layer (to give an idea of the savings). | |||||||||||||||||