A few questions if the authors are around!

> Is hardware agnostic and uses TCP/IP to communicate.

So no RDMA? It's very hard to make effective use of modern NVMe drives bandwidth over TCP/IP.

> A logical shard is further split into five physical instances, one leader and four followers, in a typical distributed consensus setup. The distributed consensus engine is provided by a purpose-built Raft-like implementation, which we call LogsDB

Raft-like, so not Raft, a custom algorithm? Implementing distributed consensus correctly from scratch is very hard - why not use some battle-tested implementations?

> Read/write access to the block service is provided using a simple TCP API currently implemented by a Go process. This process is hardware agnostic and uses the Go standard library to read and write blocks to a conventional local file system. We originally planned to rewrite the Go process in C++, and possibly write to block devices directly, but the idiomatic Go implementation has proven performant enough for our needs so far.

The document mentions it's designed to reach TB/s though. Which means that for an IO intensive workload, one would end up wasting a lot of drive bandwidth, and require a huge number of nodes.

Modern parallel filesystems can reach 80-90GB/s per node, using RDMA, DPDK etc.

> This is in contrast to protocols like NFS, whereby each connection is very stateful, holding resources such as open files, locks, and so on.

This is not true for NFSv3 and older, it tends to be stateless (no notion of open file).

No mention of the way this was developed and tested - does it use some formal methods, simulator, chaos engineering etc?

▲

jleahy 3 days ago | parent | next [-]

> So no RDMA?

We can saturate the network interfaces of our flash boxes with our very simple Go block server, because it uses sendfile under the hood. It would be easy to switch to RDMA (it’s just a transport layer change) but right now we didn’t need to. We’ve had to make some difficult prioritisation decisions here.

PRs welcome!

> Implementing distributed consensus correctly from scratch is very hard - why not use some battle-tested implementations?

We’re used to building things like this, trading systems are giant distributed systems with shared state operating at millions of updates per second. We also cheated, right now there is no automatic failover enabled. Failures are rare and we will only enable that post-Jepsen.

If we used somebody else’s implementation we would never be able to do the multi-master stuff that we need to equalise latency for non-primary regions.

> This is not true for NFSv3 and older, it tends to be stateless (no notion of open file).

Even NFSv3 needs a duplicate request cache because requests are not idempotent. Idempotency of all requests is hard to achieve but rewarding.

▲

AtlasBarfed 3 days ago | parent | prev | next [-]

Not to mention you simply want a large distributed system implemented in multiple clouds / on prems / use cases, with battle tested procedures on node failure, replacement, expansion, contraction, backup/restore, repair/verification, install guides, an error "zoo".

Not to mention a Jepsen test suite, detailed CAP tradeoff explanation, etc.

There's a reason those big DFS at the FAANGs aren't really implemented anywhere else: they NEED the original authors with a big, deeply experienced infrastructure/team in house.

▲

menaerus 3 days ago | parent | next [-]

DeepSeek team, which is also an HFT shop, also implemented their DFS - https://github.com/deepseek-ai/3FS

▲

Yoric 3 days ago | parent | prev [-]

My memories are a bit sketchy, but isn't CAP worked around by the eventual consistency of Paxos/Raft/...?

▲

immibis 2 days ago | parent [-]

The protocols you mentioned are always consistent. You will know if they are not consistent because they will not make progress. Yes there's a short delay where some nodes haven't learned about the new thing yet and only know that they're about to learn the new thing, but that's not what's meant by "eventual consistency", which is when inconsistent things may happen and become consistent at some time later. In Paxos or Raft, nodes that know the new consistent data is about to arrive can wait for it and present the illusion of a completely consistent system (as long as the network isn't partitioned so the data eventually arrives). These protocols are slow. so they're usually only used for the most important coordination, like knowing which servers are online.

CAP cannot be worked around. In the event of a partition, your system is either C or A, no buts. Either the losing side of the partition refuses to process writes and usually reads as well (ensuring consistency and ensuring unavailability) or it does not refuse (ensuring availability and ensuring data corruption). There are no third options.

Well, some people say the third option is to just make sure the network is 100% reliable and a partition never occurs. That's laughable.

	▲	Yoric 2 days ago \| parent \| next [-]
		> Yes there's a short delay where some nodes haven't learned about the new thing yet, but that's not what's meant by "eventual consistency", which is when inconsistent things may happen and become consistent at some time later. Thanks, I haven't looked at these problems in a while. > In the event of a partition, your system is either C or A, no buts. Fair enough. Raft and Paxos provide well-understood tradeoffs but not a workaround.
	▲	AtlasBarfed 2 days ago \| parent \| prev [-]
		Yes, so when I see a distributed system that does not tell me explicitly whether or not it is sacrificing consistency or availability, I get suspicious. Or has mechanisms for tuning on a request basis what you want to prioritize: consistency or availability, and those depend on specific mechanisms for reads and writes. If I don't see a distributed system that explains such things, then I'm assuming that they made a lot of bad assumptions.

▲

foota 3 days ago | parent | prev [-]

Out of curiosity, you seem knowledgeable here, is it possible to do NVME over RDMA in public cloud (e.g., on AWS)? I was recently looking into this and my conclusion was no, but I'd love to be wrong :)

▲

stonogo 3 days ago | parent [-]

Amazon FSx for Lustre is the product. They do have information on DIY with the underlying tech: https://aws.amazon.com/blogs/hpc/scaling-a-read-intensive-lo...

▲

foota 3 days ago | parent [-]

Thanks for the link! I had seen this, but it wasn't clear to me either how to configure the host as an nvme-of target, nor whether it would actually bypass the host CPU. The article (admittedly now 4 years old) cites single digit GB/second, while I was really hoping for something closer to the full NVME bandwidth. Maybe that's just a reflection of the time though, drives have gotten a lot faster since then.

Edit: this is more like what I was hoping for: https://aws.amazon.com/blogs/aws/amazon-fsx-for-lustre-unloc... although I wasn't looking for a file system product. Ideally a tutorial like... "Create a couple VMs, store a file on one, do XYZ, and then read it from another with this API" was what I was hoping for, or at least some first party documentation of how to use these things together.

▲

stonogo 3 days ago | parent [-]

Probably something more like this? https://github.com/aws-samples/amazon-fsx-tutorial/tree/mast...

▲

foota 2 days ago | parent [-]

So... The issue is that I'm not using lustre. As far as I can tell, NVME over fabrics (nvme-of) for RDMA is implemented by kernel modules nvmet-rdma and nvme-rdma (the first being for the target). This kernel modules supports infibiband and I think fiber channels, but _not_ EFA, and EFA itself is not an implementation of infiniband. There are user space libraries that paper over these differences when using them for just network transport (E.g., libfabrics) and EFA sorta pretends to be IB, but afaict this is just meant to ease integration at the user space level. Unfortunately, since there's no kernel module support for EFA in the nvme-of kernel modules, it doesn't seem possible to use without lustre. I don't know exactly how they're doing it for lustre clients. There seems to be a lustre client kernel module though, so my guess is that it's in there? The lustre networking module, lnet, does have an EFA integration, but it seems to only be as a network transit. I don't see anything in lustre about nvme-of though, so I'm not sure.

Maybe there's something I'm missing though, and it'll just work if I give it try :)

▲

lustre-fan 2 days ago | parent [-]

Yeah, Lustre supports EFA as a network transit between a Lustre client and a Lustre server. It's lnet/klnds/kefalnd/ in the Lustre tree. But Lustre doesn't support NVMeoF directly. It uses a custom protocol. And neither does EFA. Someone would have to modify the NVMeoF RDMA target/host drivers to support it. EFA already supports in-kernel IB clients (that's how Lustre uses EFA today). So it's not an impossible task. It's just that no one has done it.

▲

foota 2 days ago | parent [-]

Hey, thanks for the comment! Also, I'm amused by the specificity of your account haha, do you have something set to monitor HN for mentions of Lustre?

> "But Lustre doesn't support NVMeoF directly. It uses a custom protocol."

Could you link me to this? I searched the lustre repo for nvme and didn't see anything that looked promising, but would be curious to read how this works.

> "And neither does EFA. Someone would have to modify the NVMeoF RDMA target/host drivers to support it."

To confirm, you're saying there'd need to be something like an EFA equivalent to https://kernel.googlesource.com/pub/scm/linux/kernel/git/tor... (and corresponding initiator code)?

> "EFA already supports in-kernel IB clients (that's how Lustre uses EFA today). So it's not an impossible task. It's just that no one has done it."

I think you're saying there's already in-kernel code for interfacing with EFA, because this is how lnet uses EFA? Is that https://kernel.googlesource.com/pub/scm/linux/kernel/git/tyc...? I found this but I wasn't sure if this was actually the dataplane (for lack of a better word) part of things, from what I read it sounded like most of the dataplane was implemented in userspace as a part of libfabric, but it sounds like I might be wrong.

Does this mean you can generally just pretend that EFA is a normal IB interface and have things work out? If that's the case, why doesn't NVME-of just support it naturally? Just trying to figure out how these things fit together, I appreciate your time!

In case you're curious, I have a stateful service that has an NVME backed cache over object storage and I've been wondering what it would take to make it so that we could run some proxy services that can directly read from that cache to scale out the read throughput from an instance.

	▲	lustre-fan a day ago \| parent [-]
		> do you have something set to monitor HN for mentions of Lustre? Nothing, beside browsing hackernews a bit too much. > "But Lustre doesn't support NVMeoF directly. It uses a custom protocol." To be specific, Lustre is a parallel filesystem. Think of it like a bigger version of NFS. You format the NVMe as ext4 or ZFS and mount them as Lustre servers. Once you have an MGS, MDS, and OSS - you can mount the servers as a filesystem. Lustre won't export the NVMe to client as a block device. But you could mount individual Lustre files as a block device, if you want. > To confirm, you're saying there'd need to be something like an EFA equivalent to https://kernel.googlesource.com/pub/scm/linux/kernel/git/tor... (and corresponding initiator code)? Essentially, yeah. > I think you're saying there's already in-kernel code for interfacing with EFA, because this is how lnet uses EFA? Yes. EFA implements kernel verbs support. Normal user-space applications use user verbs i.e. https://www.kernel.org/doc/html/latest/infiniband/user_verbs.... Kernel verbs support allows kernel-space applications to also use EFA. This is currently implemented in the out-of-tree version of the EFA driver https://github.com/amzn/amzn-drivers/tree/master/kernel/linu.... Lustre interfaces with that with the driver in lnet/klnds/efalnd/. NVMeoF would need some similar glue code. > Does this mean you can generally just pretend that EFA is a normal IB interface and have things work out? If that's the case, why doesn't NVME-of just support it naturally? Just trying to figure out how these things fit together, I appreciate your time! There are some similarities (the EFA driver is implemented in the IB subsystem, after all). But the semantics for adding/removing ports/interfaces would be different - so it wouldn't "just work" without some changes. I don't know the full scope of the changes (I haven't dived into it too much). Although, I suspect support would look fairly similar to drivers/nvme/target/rdma.c. > In case you're curious, I have a stateful service that has an NVME backed cache over object storage and I've been wondering what it would take to make it so that we could run some proxy services that can directly read from that cache to scale out the read throughput from an instance. If you're looking for a scale out cache in front of s3, that's essentially Lustre/s3 integration https://docs.aws.amazon.com/fsx/latest/LustreGuide/create-dr.... It's a filesystem, so I guess it depends on how your service expects access objects.