| ▲ | cold_harbor 3 hours ago | |||||||
their MLA architecture cuts KV cache by ~5-13x vs standard attention. that's why inference is actually cheaper to run, not just a price war to gain market share. | ||||||||
| ▲ | zozbot234 3 hours ago | parent | next [-] | |||||||
That's also a game changer for local inference. It unlocks long contexts, batched inference and storing the KV cache to disk on ordinary consumer platforms. | ||||||||
| ▲ | vitorsr an hour ago | parent | prev | next [-] | |||||||
Yes. The discount was most likely a "post-market trial" of how efficient the caching works for the new generation models. | ||||||||
| ||||||||
| ▲ | hmaddipatla 3 hours ago | parent | prev [-] | |||||||
[dead] | ||||||||