Yeah despite the conceptual statelessness, there is quite a bit of state that hangs around though--KV cache and context. I still haven't been able to find anything concrete in docs about how these are isolated. In any case it's clearly a different class of issue than the one from the article. Not endemic to how LLMs work, just normal web session stuff, modulo some GPU memory handling.

▲

ipython 7 hours ago | parent [-]

As far as I know the only data of the two you identified are cached inside of the inference layer - the KV cache. Then again, I am not an expert in designing and operating inference, so I could be incorrect on that.

Either way, both of those are controlled by deterministic code and not the LLM itself. So controlling for that risk is much simpler to model IMO since the mitigation can be applied universally and deterministically rather than hoping and praying some non-deterministic system will respect your wishes.

▲

wolttam 4 hours ago | parent [-]

In other words: controlling for that kind of potential data-mixing is the same as in any other application where customer data is co-located within the same running process/memory/storage space.

▲

jcgrillo 2 hours ago | parent [-]

Yes, however the companies that are responsible for doing it have already shown their asses a little bit with all the jailbreaking stuff, and we know they produce really awful code from all the recent harness issues... To my mind that indicates this critical invariant deserves a little scrutiny. But with all the vibe slop being slung these days who knows what's safe anymore.

All that is to say I sure would appreciate a coherent, clear technical explanation of how they ensure user data are separate while serving concurrent queries.

	▲	wolttam an hour ago \| parent [-]
		They’re valid things to be concerned about IMO. I think you’re looking for an answer you’re not going to get unfortunately. I think there actually is a higher than average risk of data leakage with the insane optimizations that go into model serving - GLM5.1 had an issue of going into jibberish when their infra was under high load, and it turned out to be a cross-request KV cache contamination issue.[1] Personally, my effort has been to use local models only as of late, and it’s gone pretty well! [1]: https://z.ai/blog/scaling-pain