Remix.run Logo
foota 3 days ago

Out of curiosity, you seem knowledgeable here, is it possible to do NVME over RDMA in public cloud (e.g., on AWS)? I was recently looking into this and my conclusion was no, but I'd love to be wrong :)

stonogo 3 days ago | parent [-]

Amazon FSx for Lustre is the product. They do have information on DIY with the underlying tech: https://aws.amazon.com/blogs/hpc/scaling-a-read-intensive-lo...

foota 3 days ago | parent [-]

Thanks for the link! I had seen this, but it wasn't clear to me either how to configure the host as an nvme-of target, nor whether it would actually bypass the host CPU. The article (admittedly now 4 years old) cites single digit GB/second, while I was really hoping for something closer to the full NVME bandwidth. Maybe that's just a reflection of the time though, drives have gotten a lot faster since then.

Edit: this is more like what I was hoping for: https://aws.amazon.com/blogs/aws/amazon-fsx-for-lustre-unloc... although I wasn't looking for a file system product. Ideally a tutorial like... "Create a couple VMs, store a file on one, do XYZ, and then read it from another with this API" was what I was hoping for, or at least some first party documentation of how to use these things together.

stonogo 3 days ago | parent [-]

Probably something more like this? https://github.com/aws-samples/amazon-fsx-tutorial/tree/mast...

foota 2 days ago | parent [-]

So... The issue is that I'm not using lustre. As far as I can tell, NVME over fabrics (nvme-of) for RDMA is implemented by kernel modules nvmet-rdma and nvme-rdma (the first being for the target). This kernel modules supports infibiband and I think fiber channels, but _not_ EFA, and EFA itself is not an implementation of infiniband. There are user space libraries that paper over these differences when using them for just network transport (E.g., libfabrics) and EFA sorta pretends to be IB, but afaict this is just meant to ease integration at the user space level. Unfortunately, since there's no kernel module support for EFA in the nvme-of kernel modules, it doesn't seem possible to use without lustre. I don't know exactly how they're doing it for lustre clients. There seems to be a lustre client kernel module though, so my guess is that it's in there? The lustre networking module, lnet, does have an EFA integration, but it seems to only be as a network transit. I don't see anything in lustre about nvme-of though, so I'm not sure.

Maybe there's something I'm missing though, and it'll just work if I give it try :)

lustre-fan 2 days ago | parent [-]

Yeah, Lustre supports EFA as a network transit between a Lustre client and a Lustre server. It's lnet/klnds/kefalnd/ in the Lustre tree. But Lustre doesn't support NVMeoF directly. It uses a custom protocol. And neither does EFA. Someone would have to modify the NVMeoF RDMA target/host drivers to support it. EFA already supports in-kernel IB clients (that's how Lustre uses EFA today). So it's not an impossible task. It's just that no one has done it.

foota 2 days ago | parent [-]

Hey, thanks for the comment! Also, I'm amused by the specificity of your account haha, do you have something set to monitor HN for mentions of Lustre?

> "But Lustre doesn't support NVMeoF directly. It uses a custom protocol."

Could you link me to this? I searched the lustre repo for nvme and didn't see anything that looked promising, but would be curious to read how this works.

> "And neither does EFA. Someone would have to modify the NVMeoF RDMA target/host drivers to support it."

To confirm, you're saying there'd need to be something like an EFA equivalent to https://kernel.googlesource.com/pub/scm/linux/kernel/git/tor... (and corresponding initiator code)?

> "EFA already supports in-kernel IB clients (that's how Lustre uses EFA today). So it's not an impossible task. It's just that no one has done it."

I think you're saying there's already in-kernel code for interfacing with EFA, because this is how lnet uses EFA? Is that https://kernel.googlesource.com/pub/scm/linux/kernel/git/tyc...? I found this but I wasn't sure if this was actually the dataplane (for lack of a better word) part of things, from what I read it sounded like most of the dataplane was implemented in userspace as a part of libfabric, but it sounds like I might be wrong.

Does this mean you can generally just pretend that EFA is a normal IB interface and have things work out? If that's the case, why doesn't NVME-of just support it naturally? Just trying to figure out how these things fit together, I appreciate your time!

In case you're curious, I have a stateful service that has an NVME backed cache over object storage and I've been wondering what it would take to make it so that we could run some proxy services that can directly read from that cache to scale out the read throughput from an instance.

lustre-fan a day ago | parent [-]

> do you have something set to monitor HN for mentions of Lustre?

Nothing, beside browsing hackernews a bit too much.

> "But Lustre doesn't support NVMeoF directly. It uses a custom protocol."

To be specific, Lustre is a parallel filesystem. Think of it like a bigger version of NFS. You format the NVMe as ext4 or ZFS and mount them as Lustre servers. Once you have an MGS, MDS, and OSS - you can mount the servers as a filesystem. Lustre won't export the NVMe to client as a block device. But you could mount individual Lustre files as a block device, if you want.

> To confirm, you're saying there'd need to be something like an EFA equivalent to https://kernel.googlesource.com/pub/scm/linux/kernel/git/tor... (and corresponding initiator code)?

Essentially, yeah.

> I think you're saying there's already in-kernel code for interfacing with EFA, because this is how lnet uses EFA?

Yes. EFA implements kernel verbs support. Normal user-space applications use user verbs i.e. https://www.kernel.org/doc/html/latest/infiniband/user_verbs.... Kernel verbs support allows kernel-space applications to also use EFA. This is currently implemented in the out-of-tree version of the EFA driver https://github.com/amzn/amzn-drivers/tree/master/kernel/linu.... Lustre interfaces with that with the driver in lnet/klnds/efalnd/. NVMeoF would need some similar glue code.

> Does this mean you can generally just pretend that EFA is a normal IB interface and have things work out? If that's the case, why doesn't NVME-of just support it naturally? Just trying to figure out how these things fit together, I appreciate your time!

There are some similarities (the EFA driver is implemented in the IB subsystem, after all). But the semantics for adding/removing ports/interfaces would be different - so it wouldn't "just work" without some changes. I don't know the full scope of the changes (I haven't dived into it too much). Although, I suspect support would look fairly similar to drivers/nvme/target/rdma.c.

> In case you're curious, I have a stateful service that has an NVME backed cache over object storage and I've been wondering what it would take to make it so that we could run some proxy services that can directly read from that cache to scale out the read throughput from an instance.

If you're looking for a scale out cache in front of s3, that's essentially Lustre/s3 integration https://docs.aws.amazon.com/fsx/latest/LustreGuide/create-dr.... It's a filesystem, so I guess it depends on how your service expects access objects.