Remix.run Logo
simianwords 13 hours ago

There’s a case for intelligent caching: coarse grained 1h and 5min type TTls are not optimal.

PunchyHamster 12 hours ago | parent [-]

Caching LLM is not like caching normal content; the longer it is the more beneficial it is and it only stops being worth when user stops current session.

So you'd need some adaptive algorithm to decide when to keep caching and when to purge it whole, possibly on client side, but if you give client the control, people will make it use most cache possible just to chase diminishing returns. So fine grained control here isn't all that easy; other possible option is just to have cache size per account and then intelligently purge it instead of relying just on TTL

8 hours ago | parent | next [-]
[deleted]
cyanydeez 11 hours ago | parent | prev [-]

keep in mind, efficient KV caching needs to be next to the GPU, so you sls need you HA to keep routing the user to the same hardware.

the hardware VM model is almost identical. Each session can go anywhere to start but a live session cant just be routed anywhere without penalty.