Remix.run Logo
dangoodmanUT 9 hours ago

Alright nerd-snipe snooping research post happning now!

Seems like they are using JuiceFS under the hood, with an overlay root for your CoW semantics. JuiceFS gives them instant clone (because they're not cloning the whole rootfs), while the chnages to the overlay are done as an overlayfs and probably synced back to S3 via a custom block device they have mounted into firecracker.

You can also see they are using juicefs it for the "policy" directly (which I'm assuming is the network policy functionality). iirc juicefs has support for block devices too, so maybe they are using that to back the rootfs overlay.

One concerning thing is the `/var/lib/docker` mount - i ran this in an ubuntu container, did they... attach it? Maybe that's a coincidence, but docker is not installed on the sprite by default. (the terminal is also super busted when used through an ubuntu container)

https://pastebin.com/raw/kt6q9fuA (edit: moved terminal output to pastebin because it was so ugly here)

I played with a similar stack recently, my guess is they are: 1. making some base vm, snapshotting it 2. when you create a vm, they just restore a copy and push metadata to it (probably via one of the mounts) 3. any changes that you make to the rootfs are stored on the juicefs block device (the overlay), which is relatively minimal compared to the base os. JucieFS also supports snapshotting, so that's probably how they support memory + filesystem snapshot and restore so quick

interestingly, seems they provision maybe a max disk size of 100GB for total checkpoints?

```

NAME TYPE SIZE FSTYPE MOUNTPOINTS

loop0 loop 100G /.sprite/checkpoints/active

```

fuse is definitely being used within the VMM, i can see a fuse mount and id being assigned. They're probably using juicefs directly for the policy mount because that doesn't need to be local nvme-cached, just consistent. The local-nvme -> s3 write-through runs on the hypervisor through a custom block device they attach to the firecracker vmm. This might just be the --cache-dir + --writeback cache option in juicefs. Wild guess is just 1 file per block.

guessing the "s3" here is tigris, since fly.io seems to have a relatoinship with them, and that probably keeps latency down for the filesystem