Remix.run Logo
KaiserPro 2 days ago

One thing thats not addressed here is that the bigger you scale your shared memory cluster the closer to 100% chance that one node fucks up and corrupts your global memory space.

Currently the fastest way to get data from node a to node b is to RDMA it. which means that any node can inject anything into your memory space.

I'm not really sure how Theseus guards against that.

buildbot 2 days ago | parent [-]

I’m not sure any system prevents RDMA from ruining your day :(

Back in grad school I remember we did something fairly simple but clearly illegal and wedged the machine so bad the out of band management also went down!

KaiserPro 2 days ago | parent [-]

> wedged the machine so bad the out of band management also went down!

Now thats living the dream of a shared cluster!

This is hazy now, but I do remember a massive outage of a lustre cluster, which I think was because there was a dodgy node injecting crap into everyone's memory space via the old lustre fast filesystem kernel driver. I think they switched to NFS export nodes after that. (for the render farm and desktops at least.)