Remix.run Logo
cold_harbor 3 hours ago

their MLA architecture cuts KV cache by ~5-13x vs standard attention. that's why inference is actually cheaper to run, not just a price war to gain market share.

zozbot234 3 hours ago | parent | next [-]

That's also a game changer for local inference. It unlocks long contexts, batched inference and storing the KV cache to disk on ordinary consumer platforms.

vitorsr an hour ago | parent | prev | next [-]

Yes. The discount was most likely a "post-market trial" of how efficient the caching works for the new generation models.

trollbridge 12 minutes ago | parent [-]

I've "adjusted" my workflows now to use the cache. (Basically read all the files in your project very early on in your session, etc., simple stuff like that.)

Nearly all requests are cached now. It's amazing.

hmaddipatla 3 hours ago | parent | prev [-]

[dead]