Remix.run Logo
toast0 6 hours ago

You could presumably have an acceptor thread per core, which passes the fds to core alligned next thread, etc.

That would get you the code simplicity benefits the article suggests, while keeping the socket bound to a single core, which is definitely needed.

Depending on if you actually need to share anything, you could do process per core, thread per loop, and you have no core to core communication from the usual workings of the process (i/o may cross though)

scottlamb 2 hours ago | parent [-]

I don't think the author intended "code simplicity" as an end unto itself but a way to reduce cache pressure. He popped into the 2016 discussion [1] to say:

> Another benefit of this design overlooked is that individual cores may not ever need to read memory -- the entire task can run in L1 or L2. If a single worker becomes too complicated this benefit is lost, and memory is much much slower than cache.

I think this is wrong or at least overstated: if you're passing off fds and their associated (kernel- and/or user-side) buffers between cores, you can't run entirely in L1 or L2. And in general, I'd expect data to be responsible for much more cache pressure than code, so I'm skeptical of localizing the code at the expense of the data.

But anyway, if the goal is to organize which cores are doing the work, splitting a single core's work from a single thread (pinnned to it) to several threads (still pinned to it) doesn't help. It just introduces more context switching.

[1] https://news.ycombinator.com/item?id=10874616