Agree with this and I have been thinking about it recently as well. I think you could implement a cord-like vocabulary to identify large duplicated substrings for exact deduplication and pairwise correlations or vocabulary profiles/small classifiers for forward-looking or speculative deduplications. A clear example is the GPL license, it’s a large substring you might encounter often and highly likely to be accompanied by lots of c code.
This is probably something that you’d be doing on the CPU though before sending anything to the GPU, though that’s definitely the sensitive surface since it’s hardware without good multitenancy. I assume the interface between the CPU and GPU is where you would be most likely to make a mistake where you start decoding data from one fd that was meant for another, or from the wrong position, and get someone else’s data.
I wouldn’t be confident that these are active exploits from deliberately abusing kv cache optimizations though, possibly just the kind of bugs you get from active low level performance tuning/systems work. Since this is something I have seen across providers lately I personally suspect it to be a driver issue.