Remix.run Logo
Twirrim 5 days ago

CXL is going to be really interesting.

On the positive side, you can scale out memory quite a lot, fill up PCI slots, even have memory external to your chassis. Memory tiering has a lot of potential.

On the negative side, you've got latency costs to swallow up. You don't get distance from CPU for free (there's a reason the memory on your motherboard is as close as practical to the CPU) https://www.nextplatform.com/2022/12/05/just-how-bad-is-cxl-.... CXL spec for 2.0 is at about 200ns of latency added to all calls to what is stored in memory, so when using it you've got to think carefully about how you approach using it, or you'll cripple yourself.

There's been work on the OS side around data locality, but CXL stuff hasn't been widely available, so there's an element of "Well, we'll have to see".

Azure has some interesting whitepapers out as they've been investigating ways to use CXL with VMs, https://www.microsoft.com/en-us/research/wp-content/uploads/....

tanelpoder 5 days ago | parent | next [-]

Yup, for best results you wouldn't just dump your existing pointer-chasing and linked-list data structures to CXL (like the Optane's transparent mode was, whatever it was called).

But CXL-backed memory can use your CPU caches as usual and the PCIe 5.0 lane throughput is still good, assuming that the CXL controller/DRAM side doesn't become a bottleneck. So you could design your engines and data structures to account for these tradeoffs. Like fetching/scanning columnar data structures, prefetching to hide latency etc. You probably don't want to have global shared locks and frequent atomic operations on CXL-backed shared memory (once that becomes possible in theory with CXL3.0).

Edit: I'll plug my own article here - if you've wondered whether there were actual large-scale commercial products that used Intel's Optane as intended then Oracle database took good advantage of it (both the Exadata and plain database engines). One use was to have low latency durable (local) commits on Optane:

https://tanelpoder.com/posts/testing-oracles-use-of-optane-p...

VMware supports it as well, but using it as a simpler layer for tiered memory.

packetlost 5 days ago | parent | next [-]

> You probably don't want to have global shared locks and frequent atomic operations on CXL-backed shared memory (once that becomes possible in theory with CXL3.0).

I'd bet contested locks spend more time in cache than most other lines of memory so in practice a global lock might not be too bad.

tanelpoder 5 days ago | parent [-]

Yep agreed, for single-host with CXL scenarios. I wrote this comment thinking about a hypothetical future CXL3.x+ scenario with multi-host fabric coherence where one could in theory put locks and control structures that protect shared access to CXL memory pools into the same shared CXL memory (so, no need for coordination over regular network at least).

samus 4 days ago | parent | prev [-]

DBMSs have been managing storage with different access times for decades and it should be pretty easy to adapt an existing engine. Or you could use it as a gigantic swap space. No clue whether additional kernel patches would be required for that.

GordonS 5 days ago | parent | prev | next [-]

Huh, 200ns is less than I imagined; even if it is still almost 100x slower than regular RAM, it's still around 100x faster than NVMe storage.

Dylan16807 5 days ago | parent | next [-]

Regular RAM is 50-100ns.

jauntywundrkind 5 days ago | parent | prev [-]

Most cross-socket traffic is >100ns.

immibis 5 days ago | parent | prev | next [-]

What kind of motherboard, CPU, cables, switches, and end devices would I need to buy to have a CXL network?

afr0ck 5 days ago | parent | next [-]

CXL uses the PCIe physical layer, so you just need to buy hardware that understands the protocol, namely the CPU and the expansion boards. AMD Genoa (e.g. EPYC 9004) supports CXL 1.1 as well as Intel Saphire Rapids and all subsequent models do. For CXL memory expansion boards, you can get from Samsung or Marvell. I got a 128 GB model from Samsung with 25 GB/s read throughput.

wmf 5 days ago | parent | prev [-]

CXL networking is still in the R&D stage.

temp0826 4 days ago | parent | prev | next [-]

I have never had to go deep into NUMA configuration personally but couldn't it be leveraged here?

wmf 4 days ago | parent [-]

Yes, if you want your app to be aware of CXL you can configure it as a separate NUMA node.

tanelpoder 4 days ago | parent [-]

Optane memory modules also present themselves as separate (memory only) NUMA nodes. They’ve given me a chance to play with Linux tiered memory, without having to emulate the hardware for a VM

imtringued 4 days ago | parent | prev [-]

The latency concern is completely overblown because CXL has cache coherence. The moment you do a second request to the same page it will be a cache hit.

I would be more worried about memory bandwidth. You can now add so much memory to your servers that it might take minutes to do a full in-memory table scan.

justincormack 4 days ago | parent [-]

Cache lines are 64 bytes, not page size.