SPDK is what folks who really care about this use, I think.

jandrewrogers 4 days ago | parent | next [-]

The only thing SPDK buys you is somewhat lower latency, which isn't that important for most applications because modern high-performance I/O schedulers usually are not that latency sensitive anyway.

The downside of SPDK is that it is unreasonably painful to use in most contexts. When it was introduced there were few options for doing high-performance storage I/O but a lot has changed since then. I know many people that have tested SPDK in storage engines, myself included, but none that decided the juice was worth the squeeze.

▲

electricshampo1 4 days ago | parent | next [-]

Depending on the IOPS rate for your app; SPDK can result in less CPU time spent in issuing IO/reaping completions compared to ex. io_uring.

See Ex. https://www.vldb.org/pvldb/vol16/p2090-haas.pdf What Modern NVMe Storage Can Do, And How To Exploit It: High-Performance I/O for High-Performance Storage Engines

for actual data on this.

OFC, If your block size is large enough and/or your design is batching enough etc. that you already don't spend much time in issuing IO/reaping completion then as you say, SPDK will not provide much of a gain.

▲

__turbobrew__ 3 days ago | parent | prev [-]

I believe seastar uses it and that is the base of scylladb storage engine: https://seastar.io/

I believe the next generation ceph OSD is built on seastar as well: https://docs.ceph.com/en/reef/dev/crimson/crimson/

With something like ceph, latency is everything as writes need to be synchronously committed to each OSD replica before the writing client is unblocked. I think for ceph they are trying to move to nvme-of to basically bypass the OS for remote NVME access. Im not sure how this will work security wise however as you cannot just have any node on the network reading and writing random blocks of nvme-of devices.

	▲	lossolo 3 days ago \| parent [-]
		> I believe seastar uses it and that is the base of scylladb storage engine: https://seastar.io/ They use DPDK (optionally) for network IO, not SPDK.

▲

vlovich123 4 days ago | parent | prev [-]

SPDK requires taking over the device. OP is correct if you want to have a multi tenant application where the disk is also used for other things.

▲

dataflow 4 days ago | parent [-]

Not an expert on this but I think that's... half-true? There is namespace support which should allow multiple users I think (?), but it does still require direct device access.

▲

vlovich123 4 days ago | parent [-]

Namespaces are a hack device manufacturers came up with to try to make this work anyway. Namespaces at the device level are a terrible idea IMP because it’s still not multi tenant - your just carving up a single drive into logically separated chunks that you have to decide on up front. So you have to say “application X gets Y% of the drive while application A gets B%”. It’s an expensive static allocation that’s not self adjusting based on actual dynamic usage.

▲

10000truths 4 days ago | parent | next [-]

Dynamic allocation implies the ability to shrink as well as grow. How do you envision shrinking an allocation of blocks to which your tenant has already written data that is (naturally) expected to be durable in perpetuity?

▲

vlovich123 3 days ago | parent [-]

you mean something filesystems do as a matter of course? Ignoring resizing them which is also supported through supporting technologies I’m not talking about partitioning a drive. You can have different applications sharing a filesystem just fine, with each application growing how much space it uses naturally as usage increases or shrinks. Partitioning and namespaces are similar (namespaces are significantly more static) in that you have to make decisions about the future really early vs a normal file on a filesystem growing over time.

	▲	10000truths 3 days ago \| parent [-]
		If you're assuming that every tenant's block device is storing a filesystem, then you're not providing your tenant a block device, you're providing your tenant a filesystem. And if you're providing them a filesystem, then you should use something like LVM for dynamic partitioning. The point of NVMe namespaces is to partition at the block device layer. To turn one physical block device into multiple logical block devices, each with their own queues, LBA space, etc. It's for when your tenants are interacting with the block device directly. That's not a hack, that's intended functionality.

▲

4 days ago | parent | prev [-]

[deleted]