I’m surprised (unless they replaced the core tcmalloc algorithm but kept the name).

tcmalloc (thread caching malloc) assumes memory allocations have good thread locality. This is often a double win (less false sharing of cache lines, and most allocations hit thread-local data structures in the allocator).

Multithreaded async systems destroy that locality, so it constantly has to run through the exception case: A allocated a buffer, went async, the request wakes up on thread B, which frees the buffer, and has to synchronize with A to give it back.

Are you using async rust, or sync rust?

▲

skavi 6 hours ago | parent [-]

modern tcmalloc uses per CPU caches via rseq [0]. We use async rust with multithreaded tokio executors (sometimes multiple in the same application). so relatively high thread counts.

[0]: https://github.com/google/tcmalloc/blob/master/docs/design.m...

▲

usrnm 5 hours ago | parent [-]

How do you control which CPU your task resumes on? If you don't then it's still the same problem described above, no?

▲

skavi an hour ago | parent [-]

on the OS scheduler side, i'd imagine there's some stickiness that keeps tasks from jumping wildly between cores. like i'd expect migration to be modelled as a non zero cost. complete speculation though.

tokio scheduler side, the executor is thread per core and work stealing of in progress tasks shouldn't be happening too much.

for all thread pool threads or threads unaffiliated with the executor, see earlier speculation on OS scheduler behavior.

	▲	packetlost 38 minutes ago \| parent [-]
		Correct. The Linux scheduler has been NUMA aware + sticky for awhile (which is more or less what this reduces to in common scenarios).