Is there anything particular about LLMs that would make separating customer data harder than in all SaaS cases?

Yes:

* There's an enormous amount of very expensive shared state (context cache) which you do not want to duplicate when you can avoid it.

* Memory locality is crucially important for performance.

* Hardware is extremely over-subscribed.

* Hardware is extremely expensive.

These factors all make hardware or even traditional memory-space (hypervisor/VM/hardware assisted virtualization) isolation a non-starter for most workloads and customers, which forces all isolation to the software layer. This already makes things way harder than they are in commodity SaaS.

Moving beyond that, the tools, frameworks, and hardware which the system runs on (GPU) wasn't designed for task isolation and building this isolation is even moreso an emergent research field than it is in x86 CPU hardware-sharing (which has required a huge amount of effort over the past 30+ years to get where we are today).

And, the ratio of usage/sensitivity to maturity is also just poor overall; these are young companies with rapid development and enormous delivery pressure under incredible customer workload requirements, too.

I can't tell if the original post is a real issue or not, but I'm surprised there aren't more like this overall; the whole thing really is a house of cards in this sense.

▲

jstummbillig an hour ago | parent [-]

> which forces all isolation to the software layer. This already makes things way harder than they are in commodity SaaS.

Is this not what happens in most SaaS? Isolation at the software layer? I understand there are special agreements, but they seem to be mostly that – no?

> the ratio of usage/sensitivity to maturity is also just poor overall; these are young companies with rapid development and enormous delivery pressure under incredible customer workload requirements, too.

Mh. The talent density in these companies is apparently quite exceptional. Things like customer data separation is something that is obvious and top of mind. I don't see why they would not hire the best to implement these relatively boring/solved things correctly at an architectural level.

	▲	bri3d an hour ago \| parent [-]
		> Is this not what happens in most SaaS? I think it's fairly popular to try to do more logical isolation in SaaS now, especially with VM-scheduling-as-a-service becoming more popular. For example, I did security architecture at a company who did relatively simple financial processing; we worked to move to a model where customer documents were encrypted using a tenant key which we'd then wrap in both a service key and a login key; users could only get the login key stapled to their session by authenticating against that account, and the processing jobs ran on a cloud vendor's logical isolation. So the user needed a login key, the service needed the attested service key, and the job ran in what amounted to a mini-VM, avoiding issues like "whoops we sent the wrong document ID and the backend gave it back to us" or "whoops, we routed the request to the wrong tenant backend!" This level of isolation would be really hard to achieve in an LLM vendor context. > I don't see why they would not hire the best to implement these relatively boring/solved things correctly at an architectural level. I think a lot of these things develop over time; obviously hiring people who have done them before helps, but it's hard. Even the people with strong experience often only know little slices. And unfortunately, every system operating at these scales has emergent behavior which can become really challenging at scale; mistakes like "we used hash(id) as a key in a memory cache without a collision list, and it collided" which would simply never affect most startups become more and more frequent at scale. High rate of change makes it hard to suss these mistakes out and root-cause them, too; "a customer gave us a log where we swapped X and Y" is hard to bisect when you're doing 500 code deploys a day.

▲

adam_arthur 3 hours ago | parent | prev | next [-]

Vibe-coding the implementation.

I haven't had much issue with Codex, but seems Claude Code has major issues being reported nearly on the daily.

They also happen to be the most boastful about not reading or looking at the code.

LLMs are very capable, but not nearly to the level they seem to be messaging.

(We've actually moved on from vibe-coding to having the LLM vibe code itself in a loop)

	▲	27183 3 hours ago \| parent \| next [-]
		> having the LLM vibe code itself in a loop The businesslatin name for this is Recursive Self-Improvement
	▲	rabbidruster 3 hours ago \| parent \| prev [-]
		Interestingly I had an almost identical experience to this report in codex. It output a user memory file that looked awfully real and wasn't at all related to my work.

▲

27183 3 hours ago | parent | prev | next [-]

If I had to hazard a guess, doing anything in a multi-tenant way on a GPU is going to be hard mode compared to most SaaS due to lack of memory safe tooling. I've built multi-tenant SaaS systems, and I've done a little GPU programming (a long time ago), but I've never tried to combine the two disciplines.

▲

woadwarrior01 3 hours ago | parent | prev [-]

It'd be terribly compute inefficient to not share prefix caches (KV cache) across customers.

▲

acepl 3 hours ago | parent [-]

What is the probability that two customers will have exactly the same tokens in cache? Wouldnt it require using the exact same CLAUDE.md, skills, MCPs and context? After that it is even worse since the nondeterminism of LLMs and humans

▲

27183 3 hours ago | parent | next [-]

I suspect what GP is getting at is there will be a strong incentive to implement some structural sharing across tenants to avoid redundantly storing the same tokens over and over. At least I'd be tempted to do this if I was working with a very precious, constrained resource (e.g. VRAM). Doing this correctly seems.. very difficult. [edit] To answer your question directly: the probability that the entire cache is identical between two different users is very low, but the probability that there exists identical chunks of cache between two different users is very high. Exploiting those commonalities successfully will significantly compress the data.

	▲	weitendorf 42 minutes ago \| parent [-]
		Agree with this and I have been thinking about it recently as well. I think you could implement a cord-like vocabulary to identify large duplicated substrings for exact deduplication and pairwise correlations or vocabulary profiles/small classifiers for forward-looking or speculative deduplications. A clear example is the GPL license, it’s a large substring you might encounter often and highly likely to be accompanied by lots of c code. This is probably something that you’d be doing on the CPU though before sending anything to the GPU, though that’s definitely the sensitive surface since it’s hardware without good multitenancy. I assume the interface between the CPU and GPU is where you would be most likely to make a mistake where you start decoding data from one fd that was meant for another, or from the wrong position, and get someone else’s data. I wouldn’t be confident that these are active exploits from deliberately abusing kv cache optimizations though, possibly just the kind of bugs you get from active low level performance tuning/systems work. Since this is something I have seen across providers lately I personally suspect it to be a driver issue.

▲

dezgeg 3 hours ago | parent | prev | next [-]

System prompt for something like Claude Code should be identical, no?

▲

cmrdporcupine 35 minutes ago | parent | prev [-]

Could just be a bug in the radix tree for the KVCache with deeper, wrong, levels of the trie returning for the same initial prefix match.