Remix.run Logo
RossBencina 4 days ago

It is not just a way of writing ring buffers. It's a way of implementing concurrent non-blocking single-reader single-writer atomic ring buffers with only atomic load and store (and memory barriers).

The author says that non-power-of-two is not possible, but I'm pretty sure it is if you use a conditional instead of integer modulus.

I first learnt of this technique from Phil Burk, we've been using it in PortAudio forever. The technique is also widely known in FPGA/hardware circles, see:

"Simulation and Synthesis Techniques for Asynchronous FIFO Design", Clifford E. Cummings, Sunburst Design, Inc.

https://twins.ee.nctu.edu.tw/courses/ip_core_04/resource_pdf...

hinkley 4 days ago | parent | next [-]

I think unfortunately we sometimes ascribe to powers of two supernatural powers that are really about caches being built in powers of two.

Intel is still 64 byte cache lines as they have been for quite a long time but they also do some shenanigans on the bus where they try to fetch two lines when you ask for one. So there’s ostensibly some benefit of aligning data particularly on linear scans to 128 byte alignment for cold cache access.

rcoveson 3 days ago | parent | next [-]

But there's a reason that caches are always sized in powers of two as well, and that same reason is applicable to high-performance ring buffers: Division by powers of two is easy and easy is fast. It's reliably a single cycle, compared to division by arbitrary 32bit integers which can be 8-30 cycles depending on CPU.

Also, there's another benefit downstream of that one: Powers of two work as a schelling point for allocations. Picking powers of two for resizable vectors maximizes "good luck" when you malloc/realloc in most allocators, in part because e.g. a buddy allocator is probably also implemented using power-of-two allocations for the above reason, but also for the plain reason that other users of the same allocator are more likely to have requested power of two allocations. Spontaneous coordination is a benefit all its own. Almost supernatural! :)

hinkley 3 days ago | parent | next [-]

CPU Caches are powers of two because retrieval involves a logarithmic number of gates have to fire in a clock cycle. There is a saddle point where more cache starts to make the instructions per second start to go back down again, and that number will be a power of two.

That has next to nothing to do with how much of your 128 GB of RAM should be dedicated to any one data structure, because working memory for a task is the sum of a bunch of different data structures that have to fit into both the caches and main memory, which used to be powers of two but now main memory is often 2^n x 3.

And as someone else pointed out, the optimal growth factor for resizable data structures is not 2, but the golden ratio, 1.61. But most implementations use 1.5 aka 3/2.

loeg 3 days ago | parent | prev | next [-]

Fwiw in this application you would never need to divide by an arbitrary integer each time; you'd pick it once and then plumb it into libdivide and get something significantly cheaper than 8-30 cycles.

kevin_thibedeau 3 days ago | parent | prev [-]

powers-of-two are problematic with growable arrays on small heaps. You risk ending up with fragmented space you can't allocate unless you keep growth less than 1.61x, which would necessitate data structures that can deal with arbitrary sizes.

KeplerBoy 3 days ago | parent | prev [-]

It's not just Intel is it, AMD is also using 64 byte cache lines afaik.

aidenn0 4 days ago | parent | prev | next [-]

Non-power-of-two is only really feasible of the total number of inserts will fit in your post/ack counters. Otherwise you have to implement overflow manually which may or may not be possible to do with the available atomic primitives on your architecture.

I first encountered this structure at a summer internship at a company making data switches.

waffletower 3 days ago | parent | prev | next [-]

Regardless of correctness, as a DSP dork I really identified with the question: "What kind of a monster would make a non-power of two ring anyway?" I remember thinking similarly when requesting a power of two buffer from a 3rd party audio hardware device and having it correct to a nearby non-power of two. Latency adding ringbuffer to the rescue.

tom_ 3 days ago | parent | prev | next [-]

A couple of the comments to the article suggest using 64-bit numbers, which is exactly the right solution. 2^64 nanoseconds=584.55 years - overflow is implausible for any realistic use case. Even pathological cases will struggle to induce wraparound at a human timescale.

(People will probably moan at the idea of restarting the process periodically rather than fixing the issue properly, but when the period would be something like 50 years I don't think it's actually a problem.)

thaumasiotes 3 days ago | parent | next [-]

> but when the period would be something like 50 years I don't think it's actually a problem

I think you have that backwards. If something needs to be done every week, it will get done every week. That's not a problem.

If something needs to be done every fifty years, you'll be lucky if it happens once.

tom_ 3 days ago | parent | next [-]

My parting shot was slightly tongue in cheek, apologies. Fifty years is a long time. The process, whatever it is, will have been replaced or otherwise become irrelevant long before the period is up. 64 bits will be sufficient.

gpderetta 3 days ago | parent [-]

At some point a random bitflip becomes more likely than the counter overflowing.

reincarnate0x14 3 days ago | parent | prev [-]

I agree with that sentiment in general but even though I've seen systems in continuous operation for 15 years, I've never seen anything make it to 20. I wouldn't write something with the external expectation it never made it that far, but in practical terms, that's probably about as safe as it gets. Even like embedded medical devices expect to get restarted every now and again.

Just as an example the Voyager computers have been restarted and that's been almost 60 years.

ale42 3 days ago | parent | prev [-]

> using 64-bit numbers, which is exactly the right solution

On a 64-bit platform, sure. When you're working on ring buffers with an 8-bit microcontroller, using 64-bit numbers would be such an overhead that nobody would even think of it.

ErroneousBosh 3 days ago | parent | prev | next [-]

> The author says that non-power-of-two is not possible, but I'm pretty sure it is if you use a conditional instead of integer modulus.

I don't see why it wouldn't be, it's just computationally expensive to take the modulo value of the pointer rather than just masking off the appropriate number of bits.

myrmidon 3 days ago | parent [-]

Replacing just the mask operation is not enough.

The problem is incrementing past the index integer type limit.

Consider a simple example with ring buffer size 9, and 16bit indices:

When you increment the write index from 0xffff to 0, your "masked index" jumps from 6 (0xffff % 9) to 0 (instead of 7).

There is no elegant fix that I'm aware of (using a very wide index type, like possibly a uint64, is extremely non-elegant).

ErroneousBosh 3 days ago | parent [-]

Yes, that's what I'm saying. You can't just use a quick and easy mask, you have to use a modulo operator which is computationally expensive enough that it's probably killing the time savings you made elsewhere.

There's probably no good reason to make your buffer sizes NOT a power of two, though. If memory's that tight, maybe look elsewhere first.

myrmidon 3 days ago | parent [-]

What I mean is: This ringbuffer implementation (and its simplicity) relies on the index range being a multiple of the buffer size (which is only true for powers of two, when the index is e.g. a 32bit unsigned integer).

If you swap bitmasking for modulo operations then that does work at first glance, but breaks down when the index wraps around. This forces you to abandon the simple "increment" operation for something more complex, too.

The requirement for a power-of-two size is more intrinsic to the approach than just the bitmasking operation itself.

ErroneousBosh 2 days ago | parent [-]

Yes, I get you now. If you let it roll over and you apply a modulo operation, now you have two modulo operations :-)

zephen 3 days ago | parent | prev | next [-]

> It is not just a way of writing ring buffers. It's a way of implementing concurrent non-blocking single-reader single-writer atomic ring buffers with only atomic load and store (and memory barriers).

That may or may not be part of the actual definition of a ring buffer, but every ring buffer I have written had those goals in mind.

And the first method mentioned in the article fully satisfies this, except for the one missing element mentioned by the author. Which in practice, often is not only not a problem, but simplifies the logic so much that you make up for it in code space.

Or, for example, say you have a 256 character buffer. You really, really want to make sure you don't waste that one character. So you increase the size of your indices. Now they are 16 bits each instead of 8 bits, so you've gained the ability to store 256 bytes by having 260 bytes of data, rather than 255 bytes by having 258 bytes of data.

Obviously, if you have a 64 byte buffer, there is no such tradeoff, and the third example wins (but, whether your are doing the first or third example, you still have to mask the index data off at some point, whether it's on an increment or a read).

> The author says that non-power-of-two is not possible, but I'm pretty sure it is if you use a conditional instead of integer modulus.

There's "not possible" and then "not practical."

Sure, you could have a 50 byte buffer, but now, if your indices are ever >= 50, you're subtracting 50 before accessing the array, so this will increase the code space (and execution time).

> The [index size > array size] technique is also widely known in FPGA/hardware circles

Right, but in those hardware circles, power-of-two _definitely_ matters. You allocate exactly one extra bit for your pointers, and you never bother manually masking them or taking a modulo or anything like that -- they simply roll over.

If you really, really need to construct something like a 6 entry FIFO in hardware, then you have techniques available to you that mere mortal programmers could not use efficiently at all. For example, you could construct a drop-through FIFO, where every element traverses every storage slot (with a concomitant increase in minimum latency to 6 clock cycles), or you could construct 4 bit indices that counted 0-1-2-3-4-5-8-9-10-11-12-13-0-1-2 etc.

Most ring buffers, hardware or software, are constructed as powers of two, and most ring buffers either (a) have so much storage that one more element wouldn't make any difference, or (b) have the ability to apply back pressure, so one more element wouldn't make any difference.

azemetre 4 days ago | parent | prev | next [-]

Your link has an invalid cert FYI, but do appreciate the knowledge drop. Rung buffers are some of the cooler data structures out there.

RossBencina 3 days ago | parent [-]

Unfortunately the original source is now behind a sign-in-wall.

4 days ago | parent | prev [-]
[deleted]