Remix.run Logo
I froze a TCP connection for 10 minutes to migrate a live server(github.com)
3 points by sunchaodong 5 hours ago | 1 comments
sunchaodong 5 hours ago | parent [-]

Hi HN, I'm Sunchao, the creator of libccmc.

Why I built this: AWS Spot instances kill long-lived LLM inference jobs with a 2-minute warning. Losing gigabytes of KV cache and dropping client connections is painful.

How it works: CRIU migrates the entire process, which often breaks when holding GPU locks. libccmc surgically extracts only the TCP connection. It dumps 80 bytes of socket state via TCP_REPAIR. During the move, an eBPF XDP program replies to the client with Window=0 ACKs, putting their TCP stack into a persist timer. We restore the socket on the target, drift the VIP, and the stream continues seamlessly.

The library is pure C and open-source. I’ve tested it by keeping live SSE streams suspended for 10 minutes with zero drops.

I would love your feedback on the eBPF mechanics, the TCP_REPAIR sequence, or any TCP edge cases I might have missed!