Remix.run Logo
lustre-fan 2 days ago

Yeah, Lustre supports EFA as a network transit between a Lustre client and a Lustre server. It's lnet/klnds/kefalnd/ in the Lustre tree. But Lustre doesn't support NVMeoF directly. It uses a custom protocol. And neither does EFA. Someone would have to modify the NVMeoF RDMA target/host drivers to support it. EFA already supports in-kernel IB clients (that's how Lustre uses EFA today). So it's not an impossible task. It's just that no one has done it.

foota 2 days ago | parent [-]

Hey, thanks for the comment! Also, I'm amused by the specificity of your account haha, do you have something set to monitor HN for mentions of Lustre?

> "But Lustre doesn't support NVMeoF directly. It uses a custom protocol."

Could you link me to this? I searched the lustre repo for nvme and didn't see anything that looked promising, but would be curious to read how this works.

> "And neither does EFA. Someone would have to modify the NVMeoF RDMA target/host drivers to support it."

To confirm, you're saying there'd need to be something like an EFA equivalent to https://kernel.googlesource.com/pub/scm/linux/kernel/git/tor... (and corresponding initiator code)?

> "EFA already supports in-kernel IB clients (that's how Lustre uses EFA today). So it's not an impossible task. It's just that no one has done it."

I think you're saying there's already in-kernel code for interfacing with EFA, because this is how lnet uses EFA? Is that https://kernel.googlesource.com/pub/scm/linux/kernel/git/tyc...? I found this but I wasn't sure if this was actually the dataplane (for lack of a better word) part of things, from what I read it sounded like most of the dataplane was implemented in userspace as a part of libfabric, but it sounds like I might be wrong.

Does this mean you can generally just pretend that EFA is a normal IB interface and have things work out? If that's the case, why doesn't NVME-of just support it naturally? Just trying to figure out how these things fit together, I appreciate your time!

In case you're curious, I have a stateful service that has an NVME backed cache over object storage and I've been wondering what it would take to make it so that we could run some proxy services that can directly read from that cache to scale out the read throughput from an instance.

lustre-fan a day ago | parent [-]

> do you have something set to monitor HN for mentions of Lustre?

Nothing, beside browsing hackernews a bit too much.

> "But Lustre doesn't support NVMeoF directly. It uses a custom protocol."

To be specific, Lustre is a parallel filesystem. Think of it like a bigger version of NFS. You format the NVMe as ext4 or ZFS and mount them as Lustre servers. Once you have an MGS, MDS, and OSS - you can mount the servers as a filesystem. Lustre won't export the NVMe to client as a block device. But you could mount individual Lustre files as a block device, if you want.

> To confirm, you're saying there'd need to be something like an EFA equivalent to https://kernel.googlesource.com/pub/scm/linux/kernel/git/tor... (and corresponding initiator code)?

Essentially, yeah.

> I think you're saying there's already in-kernel code for interfacing with EFA, because this is how lnet uses EFA?

Yes. EFA implements kernel verbs support. Normal user-space applications use user verbs i.e. https://www.kernel.org/doc/html/latest/infiniband/user_verbs.... Kernel verbs support allows kernel-space applications to also use EFA. This is currently implemented in the out-of-tree version of the EFA driver https://github.com/amzn/amzn-drivers/tree/master/kernel/linu.... Lustre interfaces with that with the driver in lnet/klnds/efalnd/. NVMeoF would need some similar glue code.

> Does this mean you can generally just pretend that EFA is a normal IB interface and have things work out? If that's the case, why doesn't NVME-of just support it naturally? Just trying to figure out how these things fit together, I appreciate your time!

There are some similarities (the EFA driver is implemented in the IB subsystem, after all). But the semantics for adding/removing ports/interfaces would be different - so it wouldn't "just work" without some changes. I don't know the full scope of the changes (I haven't dived into it too much). Although, I suspect support would look fairly similar to drivers/nvme/target/rdma.c.

> In case you're curious, I have a stateful service that has an NVME backed cache over object storage and I've been wondering what it would take to make it so that we could run some proxy services that can directly read from that cache to scale out the read throughput from an instance.

If you're looking for a scale out cache in front of s3, that's essentially Lustre/s3 integration https://docs.aws.amazon.com/fsx/latest/LustreGuide/create-dr.... It's a filesystem, so I guess it depends on how your service expects access objects.