| ▲ | nly 2 days ago | ||||||||||||||||
Your statements on queues ignore the state of the art > MPSC (multiple-producer single-consumer) requires a compare-and-swap loop on the head pointer so that two producers can each reserve a contiguous slot without overlap. Martin Thompsons designs, as used in Aerons logbuffer implementation, don't require a CAS retry loop. Multiple producers can reserve and commit concurrently without blocking one another. The trade off is you must carefully decide on an upper bound for message size and the number of producer threads (in the hundreds typically). A caretaker thread also needs to run periodically to reclaim/zero memory off the hot path. Typically though, this isn't a problem. Aeron itself, which you compare at ~250ns, I think (not entirely sure) is paying the price for being multi consumer as well as multi producer, and perhaps implementing flow control to pace producers. You can turn off multi producer by using an exclusive publication, which eliminates one atomic RMW operation on reserve. I'm not sure where the other nanos are lost. For SPSC, doing away with 2 atomic shared counters and moving to a single counter + inline headers is a win for thread to thread latency. The writer only needs to read the readers new position from a shared cache line when it believes the queue is full. The reader can batch writes to this counter, so it doesn't have to write to memory at all most of the time. The writer has one fewer contended cache line to write to in general since the header lives in the first cacheline of the message, which it's writing anyway. This is where the win comes from under contention (when the queue is ~empty) | |||||||||||||||||
| ▲ | riyaneel 2 days ago | parent [-] | ||||||||||||||||
You're right on the MPSC point, the ADR overstates it. Aeron's claim-based approach uses a single fetch_add per producer, no retry loop. The real constraints are bounded message size upfront and a caretaker thread for reclamation, not a CAS retry. The wording needs fixing. On the SPSC counter argument, Tachyon already does most of what you describe: inline headers, head/tail on separate 128-byte cache lines, cached tail only reloaded on apparent fullness, tail writes amortized across 32 messages. If you have numbers comparing the single-counter approach against this specific layout I'd be genuinely curious. | |||||||||||||||||
| |||||||||||||||||