| ▲ | I froze a TCP connection for 10 minutes to migrate a live server(github.com) | |
| 3 points by sunchaodong 5 hours ago | 1 comments | ||
| ▲ | sunchaodong 5 hours ago | parent [-] | |
Hi HN, I'm Sunchao, the creator of libccmc. Why I built this: AWS Spot instances kill long-lived LLM inference jobs with a 2-minute warning. Losing gigabytes of KV cache and dropping client connections is painful. How it works: CRIU migrates the entire process, which often breaks when holding GPU locks. libccmc surgically extracts only the TCP connection. It dumps 80 bytes of socket state via TCP_REPAIR. During the move, an eBPF XDP program replies to the client with Window=0 ACKs, putting their TCP stack into a persist timer. We restore the socket on the target, drift the VIP, and the stream continues seamlessly. The library is pure C and open-source. I’ve tested it by keeping live SSE streams suspended for 10 minutes with zero drops. I would love your feedback on the eBPF mechanics, the TCP_REPAIR sequence, or any TCP edge cases I might have missed! | ||