Remix.run Logo
jitl 3 days ago

I am testing a distributed database-like system at work that makes heavy use of swap. At startup, we read a table from S3 and compute a recursive materialized view over it. This needs about 4TB of “memory” per node while computing, which we provide as 512gb of RAM + 3900GB of NVMe zswap enabled swap devices. Once the computation is complete, we’re left with a much smaller working set index (about 400gb) we use to serve queries. For this use-case, swap serves as a performant and less labor intensive approach to manually spilling the computation to disk in application code (although there is some mlock going on; it’s not entirely automatic). This is like a very extreme version of the initialization-only pages idea discussed in the articule.

The warm up computation does take like 1/4 the time if it can live entirely in RAM, but using NVMe as “discount RAM” reduces the United States dollar cost of the system by 97% compared to RAM-only.

zozbot234 3 days ago | parent | next [-]

The problem with heavy swapping on NVMe (or other flash memory) is that it wears out the flash storage very quickly, even for seemingly "reasonable" workloads. In a way, the high performance of NVMe can work against you. Definitely something you want to check out via SMART or similar wearout stats.

ciupicri 3 days ago | parent | next [-]

For what it's worth, these are the lifetime estimates for the Micron 7450 SSD [1]:

    Model  Capacity  4K Rand  128K Seq
               [GB]    [TBW]     [TBW]
    PRO        3840    7_300    24_400
    PRO        7680   14_000    48_800
    MAX        3200   17_500    30_900
    MAX        6400   35_000    61_800
> Values represent the theoretical maximum endurance for the given transfer size and type. Actual lifetime will vary by workload …

> Total bytes written calculated assuming drive is 100% full (user capacity) with workload of 100% random aligned 4KB writes.

[1]: page 6/17, https://assets.micron.com/adobe/assets/urn:aaid:aem:d133a40b...

inkyoto 3 days ago | parent | prev | next [-]

Not an issue for the commenter – since they have mentioned S3, they are either using AWS EBS or instance attached scratch NVMe's which the vendor (AWS) takes care of.

The AWS control plane will detect an ailing SSD backing up the EBS and will proactively evacuate the data before the physical storage goes pear shaped.

If it is an EC2 instance with an instance attached NVMe, the control plane will issue an alert that can be automatically acted upon, and the instance can be bounced with a new EC2 instance allocated from a pool of the same instance type and get a new NVMe. Provided, of course, the design and implementation of the running system are stateless and can rebuild the working set upon a restart.

jitl 3 days ago | parent [-]

EBS is slow. No way we would use it for swap. Gotta be instance storage device. And yes, we can rebuild a node from source data, we do so regularly to release changes anyways.

inkyoto 3 days ago | parent [-]

I figured that you were using instance attached NVMe's since you mentioned the scale of your load – an EBS even with the io2 Express storage class can't keep up with a physical NVMe drive on high intensity I/O tasks.

Regardless, AWS takes care the hardware cycling / migration in either case.

jitl 3 days ago | parent | prev | next [-]

Let’s say we’re spending $1 million on hardware hypothetically with the swap setup.

At that price point, either we use swap and let the kernel engineers move data from RAM to disk and back, or we disable swap and need user space code to move the same data to disk and back. We’d need to price out writing & maintaining the user space implementation (mmap perhaps?) for it to be fair price comparison.

To avoid SSD wear and tear, we could spend $29 million a year more to put the data in RAM only. Not worth!

(We rent EC2 instances from AWS, so SSD wear is baked into the pricing)

p_ing 3 days ago | parent | prev | next [-]

While what you stated is overall not true, who cares with a 97% cost savings vs RAM? Just pop in another NVMe when one fails.

justsomehnguy 2 days ago | parent | prev | next [-]

> that it wears out the flash storage very quickly

Only if you use a consumer grade flash with a non-consumer grade usage.

For anything with DPWD >= 1 it's not an issue, eg:

https://news.ycombinator.com/item?id=45273937

man8alexd 3 days ago | parent | prev [-]

[dead]

dsr_ 3 days ago | parent | prev | next [-]

Have you considered having one box with 4TB of RAM to do the computation, then sending it around to all the other nodes?

jitl 3 days ago | parent [-]

Each node handles an independent ~4TB shard of data in horizontal scale-out fashion. Perhaps we could try some complex shenanigans where we rent 4TB RAM nodes, compute, send to 512GB RAM nodes then terminate the 4TB nodes but that’s a bunch of extra complexity for not much of a win.

dist-epoch 3 days ago | parent | prev [-]

What's the reduction of cost measured in Euros though?