> Large reads (>128kB) get directly streamed from the underlying S3 bucket, which is free.

Always uncached? S3 has pretty bad latency.

MontyCarloHall 2 days ago | parent | next [-]

The threshold at which the cache gets used is configurable, with 128kB the default. The assumption is that any read larger than the threshold will be a long sustained read, for which latency doesn't matter too much. My question is, do reads <128kB (or whatever the threshold is) from files >128kB get saved to the cache, or is it only used for files whose overall size is under the threshold? Frequent random access to large files is a textbook use case for a caching layer like this, but its cost will be substantial in this system.

▲

the8472 2 days ago | parent [-]

NVMe read latency is in the 10-100µs range for 128kB blocks. S3 is about 100ms. That's 3-4 OOMs. The threshold where the total read duration starts to dominate latency would be somewhere in the dozens to hundreds of megabytes, not kilobytes.

▲

2 days ago | parent | next [-]

[deleted]

▲

MontyCarloHall 2 days ago | parent | prev | next [-]

I agree, it's an oddly low threshold. The latency differential of NFS vs. S3 is a couple OOMs, so a threshold of ~10MB seems more appropriate to me. Perhaps it's set intentionally low to avoid racking up immense EFS bills? Setting it higher would effectively mean getting billed $0.03/GB for a huge fraction of reads, which is untenable for most people's applications.

	▲	mgdev 2 days ago \| parent [-]
		Once upon a time S3 used to cache small objects in their keymap layer, which IIRC had a similar threshold. I assume whatever new caching layer they added is piggybacking that. This keeps the new caching layer simple and take advantage of the existing caching. If they went any bigger they'd likely need to rearchitect parts of the keymap or underlying storage layer to accommodate, or else face unpredictable TCO.

▲

antonvs 2 days ago | parent | prev [-]

< NVMe read latency is in the 10-100µs range for 128kB blocks. S3 is about 100ms. That's 3-4 OOMs.

Aren't you comparing local in-process latency to network latency? That's multiple OOM right there.

	▲	the8472 2 days ago \| parent [-]
		No, within the same DC network latency does not add that much. After all EFS also manages 600µs average latency. It's really just S3 that's slow. I assume some large fraction of S3 is spread over HDDs, not SSDs.

▲

huntaub 2 days ago | parent | prev [-]

I imagine (hope) that they are doing some kind of intelligent read-ahead in the frontend servers to optimize for sequential reads that would avoid this looking terrible for applications.