Remix.run Logo
adwn 12 hours ago

Your view on mutex performance and overhead is outdated, at least for the major platforms: The Rust standard library mutex only requires 5 bytes, doesn't allocate, and only does a syscall on contention. The mutex implementation in the parking_lot library requires just 1 byte per mutex (and doesn't allocate and only does a syscall on contention). This enables very fine-grained, efficient locking and low contention.

torginus 11 hours ago | parent | next [-]

These are OS primitives I'm talking about - I haven't checked out the standard library version but the parking_lot version uses a spinlock with thread sleep when the wait times get too high - it has no way of getting notified when the mutex gets unblocked nor does it support priority inversion.

It seems it's optimized for scenarios with high performance compute heavy code, and short critical sections.

These assumptions may let it win benchmarks, but don't cover the use cases of all users. To illustrate why this is bad, imagine if you have a Mutex protected resource that becomes available after 10us on average. This locks spins 10 times checking if it has become available )(likely <1us) then yields the thread. The OS (lets assume Linux) wont wake it up the thread until the next scheduler tick, and its under no obligation to do so even then (and has no idea it should). But even best-case, you're left waiting 10ms, which is a typical scheduler tick.

In contrast OS based solutions are expensive but not that expensive, let's say that add 1us to the wait. Then you would wait 11us for the resource.

A method call taking 10ms and one taking 15 us is a factor of 60x, which can potentially kill your performance.

You as the user of the library are implicitly buying into these assumptions which may not hold for your case.

There's also nothing in Rust that protects you from deadlocks with 100% certainty. You can fuzz them out, and use helpers, but you can do that in any language.

So you do need to be mindful of how your mutex works, if you want to build a system as good as the one it replaces.

galangalalgol 11 hours ago | parent | next [-]

The best practices I adopt for rust avoid the use of mutex whenever possible precisely because of how easy a deadlock is. It turns out it is always possible. There are entire languages the disallow any mutable state, much less shared mutable state. The question becomes how much performance are you willing to sacrifice to avoid the mutex. By starting with no shared mutable state and adding it when something is too slow, you end up with very few mutexes.

adwn 10 hours ago | parent [-]

> avoid the use of mutex […] It turns out it is always possible

How would you handle the archetypical example of a money transfer between two bank accounts, in which 100 units of money need to be subtracted from one account and atomically added to another account, after checking that the first account contains at least 100 units?

vrmiguel 2 hours ago | parent | next [-]

Since the thread mentions Rust: in Rust, you often replace Mutexes with channels.

In your case, you could have a channel where the Receiver is the only part of the code that transfers anything. It'd receive a message Transfer { from: Account, to: Account, amount: Amount } and do the required work. Any other threads would therefore only have copies of the Sender handle. Concurrent sends would be serialized through the queue's buffering.

I'm not suggesting this is an ideal way of doing it

galangalalgol 10 hours ago | parent | prev [-]

The simplest pure functional way would be to copy the whole database instantiating a new copy with the desired change if the condition was met. That obviously doesn't scale, which is where the performance thing comes in. A still pure way would be to use a persistent tree or hash mapped trie that allows efficient reuse of the original db. There are times a purely functional approach doesn't perform well enough, but even with large scale entity component type systems in both rust and c++, the number of times I've had to use a mutex to be performant is small. Atomic is much more common, but still not common. Persistent data structures alleviate most of the need.

pas 5 hours ago | parent [-]

pure or not eventually this comes down to durability, no?

and the way to do it is to either have some kind single-point-of-control (designated actor or single-threaded executor) or mark the data (ie. use some concurrency control primitive either wrapping the data or in some dedicated place where the executors check [like JVM's safepoints])

using consistent hashing these hypothetical accounts could be allocated to actors and then each transaction is managed by the actor of the source (ie. where the money is sent from, where the check needs to happen), with their own durable WAL, and periodically these are aggregated

(or course then the locking is hidden in the maintenance of the hashring as eating philosophers are added/removed)

kragen 2 hours ago | parent [-]

Eliminating the durability constraint doesn't make it any easier to program, just easier to get good performance on.

Distributing accounts among different actors, without two-phase commit or its moral equivalent, enables check kiting.

adwn 10 hours ago | parent | prev | next [-]

> […] but don't cover the use cases of all users.

No single concurrency primitive covers all use cases. I was addressing your misconceptions about mutex performance and overhead, not whether mutexes are the best solution to your particular problem.

> […] it has no way of getting notified when the mutex gets unblocked […] The OS (lets assume Linux) wont wake it up the thread until the next scheduler tick, and its under no obligation to do so even then (and has no idea it should).

You've misunderstood the parking_lot implementation. When thread B tries to lock a mutex that's currently locked by thread A, then, after spinning a few cycles, thread B "parks" itself, i.e., it asks the kernel to remove it from the Runnable task queue. On Linux, this is done using the futex syscall. When thread A unlocks the mutex, it detects that another thread is waiting on that mutex. Thread A takes one thread from the queue of waiting threads and "unparks" it, i.e., it asks the kernel to move it into the Runnable task queue. The kernel is notified immediately, and if there's a free CPU core available, will tend to dispatch the thread to that core. On a non-realtime OS, there's no guarantee how long it takes for an unblocked thread to be scheduled again, but that's the case for all concurrency primitives.

Someone 10 hours ago | parent | prev [-]

> A method call taking 10ms and one taking 15 us is a factor of 60x

667 (a thousand 15μs calls take 15ms)

ahoka 9 hours ago | parent | prev | next [-]

It's called a futex and supported by both Linux and Windows since ages.

adwn 9 hours ago | parent [-]

The 1-byte-per-mutex parking_lot implementation works even on systems that don't provide a futex syscall or equivalent.

magicalhippo 12 hours ago | parent | prev [-]

How does it avoid cache contention with just a few bytes per mutex? That is, multiple mutex instances sharing a cache line. Say I have a structure with multiple int32 counters protected by their own mutex.

adwn 11 hours ago | parent | next [-]

Cache contention is (mostly) orthogonal to your locking strategy. If anything, fine-grained locking has the potential to improve cache contention, because

1) the mutex byte/word is more likely to be in the same cache line as the data you want to access anyway, and

2) different threads are more likely to write to mutex bytes/words in different cache lines, whereas in coarse-grained locking, different threads will fight for exclusive access over the cache line containing that one, global mutex.

@magicalhippo: Since I'm comment-rate-throttled, here's my answer to your question:

Typically, you'd artificially increase the size and alignment of the structure:

    #[repr(align(64))]
    struct Status {
        counter: Mutex<u32>,
    }
This struct now has an alignment of 64, and is also 64 bytes in size (instead of just the 4+1 required for Mutex<u32>), which guarantees that it's alone in the cache line. This is wasteful from a memory perspective, but can be worth it from a performance perspective. As often when it comes to optimization, it very heavily depends on the specific case whether this makes your program faster or slower.
magicalhippo 11 hours ago | parent [-]

> different threads are more likely to write to mutex bytes/words in different cache lines

If you got small objects and sequential allocation, that's not a given in my experience.

Like in my example, the ints could be allocated one per thread to indicate some per thread status, and the main UI thread wants to read them every now and then hence they're protected by a mutex.

If they're allocated sequentially, the mutexes end up sharing cache lines and hence lead to effective contention, even though there's almost no "actual" contention.

Yes yes, for a single int you might want to use an atomic variable but this is just for demonstration purposes. I've seen this play out in real code several times, where instead of ints it was a couple of pointers say.

I don't know Rust though, so just curious.

gpderetta 10 hours ago | parent [-]

The issue might be allocating the int contiguously in the first place. No language magic is going to help you avoid thinking about mechanical sympathy.

And allocating the int contiguously might actually be the right solution is the cost of sporadic false sharing is less than the cost of wasting memory.

There's no silver bullet.

magicalhippo 10 hours ago | parent [-]

But the mutex encapsulates the int, so if the mutex ensured it occupied a multiple of cache lines, there would be no contention. At the very small cost of a few bytes of memory.

gpderetta 10 hours ago | parent [-]

the mutex forcing alignment would be extremely wasteful. FWIW, I have used 1-bit spin locks.

11 hours ago | parent | prev | next [-]
[deleted]
torginus 11 hours ago | parent | prev [-]

By not avoiding it. And a year later you get to write a blog post about how you discovered and fixed this phenomenon hitherto unknown to computer science.