You still need to hold the model in memory. If you have for example 16 GB ram, the gains aren't that much
That's not what consumes the most memory at scale. The KV caches are per-user.