Remix.run Logo
adwn 4 days ago

> The standard way to avoid these problems is to use locks to prevent data updates from happening at the same time. This causes big performance hits […]

No. Modern mutex implementations [1] are extremely efficient, require only 1 byte of memory (no heap allocation), and are almost free when there's no contention on the lock – certainly much faster and much lower latency than sending messages between actors.

[1] Like the parking_lot crate for Rust.

alfanerd 4 days ago | parent | next [-]

Sending a message between Actors can be just moving a pointer to a piece of shared memory.

I think sending messages is more about the way you think about concurrency, more than the implementation.

I have always found the "one thread doing "while True receive message, handle message" much easier to reason about than "remember to lock this chunk of data in case more than one thread should access it"

__red__ 2 days ago | parent | next [-]

There's a whole lot of discussion below so I'm just going to tag from here.

I think of pony actors in the same way as I think of erlang actors. They have a "mailbox", and when they receive a message, they wake up, execute some amount of code, and then go back to sleep.

This is how I think about it. This is not how it is actually implemented.

Here's the key that I think many people miss about pony.

Referential Capabilities DO NOT EXIST at runtime.

So let's talk passing a String iso from Actor A, to Actor B. (iso is the capability that guarantees that this is the only reference to this object):

  // This is code in Actor A
  actorB.some_behaviour_call(consume myisostring)

The "consume myisostring" completely removes myisostring from Actor A. Any reference to it after this point will result in an "unknown variable myisostring" error from the compiler.

The reference to myisostring then gets sent to Actor B via its mailbox.

If ActorB was idle, then the message receive will cause Actor B to be scheduled and it will receive the reference to that String iso - completely isolated.

Now, if we're going to measure "performance of passing data between threads" as latency per transaction, then actor contention on a scheduler is going to be a bigger factor.

If you're measuring performance across an entire system with millions of these actions occurring, then I would argue that this approach would be faster as there is no spinning involved.

gpderetta 4 days ago | parent | prev | next [-]

Unless you have NxN queues across actors[1], which is done on some specialized software but is inherently not scalable, queues will end up being more complex than that.

[1] at the very least you will need one queue for each cpu pair, but that's yet another layer of complication.

__red__ 2 days ago | parent | next [-]

Pony schedulers default behaviour is as follows:

1. One scheduler per core. 2. Schedulers run one actor behaviour at a time. 3. When a scheduler has an empty work queue, they will steal work from other schedulers. 4. The number of schedulers will scale up and down with the amount of work (but never more than number of cores).

There are various parameters you can change to alter scheduler behaviour should your pattern of use need it.

alfanerd 4 days ago | parent | prev [-]

I think you only need one queue per actor? And then one worker per CPU core? I believe that how Erlang does it, and do millions of actors without any issues...

gpderetta 4 days ago | parent [-]

Yes, but now you have contention on the queue.

ramchip 4 days ago | parent [-]

The way Erlang does it is to use buckets so it looks like a single queue to the user code but really is more like multiple queues behind the scene. Scales extremely well. It's certainly not "just moving a pointer to a piece of shared memory" though...

https://www.erlang.org/blog/parallel-signal-sending-optimiza...

adwn 4 days ago | parent | prev | next [-]

> I think sending messages is more about the way you think about concurrency, more than the implementation.

That's a valid point of view, but Pony's claim to which I objected is about performance, not ease-of-use or convenience.

adwn 4 days ago | parent | prev [-]

> Sending a message between Actors can be just moving a pointer to a piece of shared memory.

No, you also need synchronization operations on the sending and the receiving end, even if you have a single sender and a single receiver. That's because message queues are implemented on top of shared memory – there's no way around this on general-purpose hardware.

RossBencina 3 days ago | parent [-]

Depends on your definition of synchronization operations. You certainly need memory fences, and possibly atomic operations. These may or may not have a performance cost.

senderista 4 days ago | parent | prev | next [-]

Nit: you can’t have a 1-byte mutex unless you implement your own wait queues like parking_lot does. Any purely futex-based mutex (ie delegating all the blocking logic to futex syscalls) must be at least 4 bytes.

adwn 4 days ago | parent | prev | next [-]

Why is this downvoted? It's factually correct, on-topic, and relevant (because it contradicts a claim on the linked website). If you disagree, say so and we can discuss it.

otabdeveloper4 4 days ago | parent | prev [-]

A contended mutex is a system call and likely stalls all the CPUs on your machine.

Lockfree spinlocks will only waste cycles on one CPU. A huge difference when you have dozens and hundreds of cores.

gpderetta 4 days ago | parent | next [-]

So does a contended queue. As much as I might like the model, message passing is not a silver bullet. Any sufficiently complex message passing system will end up implementing shared memory on top of it... and mutexes.

adwn 4 days ago | parent | prev [-]

> A contended mutex is a system call […]

Because modern mutexes are so cheap (only 1 byte directly in the data structure, no heap allocation), you can do very fine-grained locking. This way, a mutex will almost never be contended. Keep in mind that a reader waiting on an empty queue or a writer waiting on a full queue will also involve syscalls.

> […] and likely stalls all the CPUs on your machine.

Huh? Where did you get this idea? Only the waiting thread will be blocked, and it won't "stall" the core, let alone the entire CPU.

By the way, if all your threads are waiting on a single mutex, then your architecture is wrong. In the equivalent case, all your actors would be waiting on one central actor as well, so you'd have the same loss of parallelism.