27us roundtrip is not really state of the art for zero copy IPC, about 1us would be. What is causing this overhead?

jstimpfle 17 hours ago | parent | next [-]

Asking for those who, like me, haven't yet taken the time to find technical information on that webpage:

What exactly does that roundtrip latency number measure (especially your 1us)? Does zero copy imply mapping pages between processes? Is there an async kernel component involved (like I would infer from "io_uring") or just two user space processes mapping pages?

▲

foltik 13 hours ago | parent [-]

27us and 1us are both an eternity and definitely not SOTA for IPC. The fastest possible way to do IPC is with a shared memory resident SPSC queue.

The actual (one-way cross-core) latency on modern CPUs varies by quite a lot [0], but a good rule of thumb is 100ns + 0.1ns per byte.

This measures the time for core A to write one or more cache lines to a shared memory region, and core B to read them. The latency is determined by the time it takes for the cache coherence protocol to transfer the cache lines between cores, which shows up as a number of L3 cache misses.

Interestingly, at the hardware level, in-process vs inter-process is irrelevant. What matters is the physical location of the cores which are communicating. This repo has some great visualizations and latency numbers for many different CPUs, as well as a benchmark you can run yourself:

[0] https://github.com/nviennot/core-to-core-latency

▲

jstimpfle 11 hours ago | parent [-]

I was really asking what "IPC" means in this context. If you can just share a mapping, yes it's going to be quite fast. If you need to wait for approval to come back, it's going to take more time. If you can't share a memory segment, even more time.

	▲	foltik 10 hours ago \| parent [-]
		No idea what this vibe code is doing, but two processes on the same machine can always share a mapping, though maybe your PL of choice is incapable. There aren’t many libraries that make it easy either. If it’s not two processes on the same machine I wouldn’t really call it IPC. Of course a round trip will take more time, but it’s not meaningfully different from two one-way transfers. You can just multiply the numbers I said by two. Generally it’s better to organize a system as a pipeline if you can though, rather than ping ponging cache lines back and forth doing a bunch of RPC.

▲

17 hours ago | parent | prev | next [-]

[deleted]

▲

znpy 17 hours ago | parent | prev | next [-]

It may or may not be good, depending on a number of fact.

I did read the original linux zerocopy papers from google for example, and at the time (when using tcp) the juice was worth the squeeze when payload was larger than than 10 kilobytes (or 20? Don’t remember right now and i’m on mobile).

Also a common technique is batching, so you amortise the round-trip time (this used to be the cost of sendmmsg/recvmmsg) over, say, 10 payloads.

So yeah that number alone can mean a lot or it can mean very little.

In my experience people that are doing low latency stuff already built their own thing around msg_zerocopy, io_uring and stuff :)

	▲	hinkley 13 hours ago \| parent [-]
		io_uring is a tool for maximizing throughput not minimizing latency. So the correct measure is transactions per millisecond not milliseconds per transaction. Little’s Law applies when the task monopolizes the time of the worker. When it is alternating between IO and compute, it can be off by a factor of two or more. And when it’s only considering IO, things get more muddled still.

▲

rohanray 5 days ago | parent | prev | next [-]

It's not a local IPC exactly. The roundtrip benchmark stat is for a TCP server-client ping/pong call using a 2 KB payload; TCP is although on local loopback (127.0.0.1).

Source: https://github.com/mvp-express/myra-transport/blob/main/benc...

▲

14 hours ago | parent | prev | next [-]

[deleted]

▲

blibble 16 hours ago | parent | prev [-]

indeed, you can get a packet from one box to another in 1-2us

	▲	steeve 13 hours ago \| parent \| next [-]
		with io_uring? How? I tried everything in the book
	▲	foobar10000 15 hours ago \| parent \| prev [-]
		[dead]