> OS threads are expensive: an operating system thread typically reserves a megabyte of stack space

Why is reserving a megabyte of stack space "expensive"?

> and takes roughly a millisecond to create

I'm not sure where this number is from, it seems off by a few orders of magnitude. On Linux, thread creation is closer to 10 microseconds.

▲ n_e 5 hours ago | parent | next [-]

> Why is reserving a megabyte of stack space "expensive"?

Because if you use one thread for each of your 10,000 idle sockets you will use 10GB to do nothing.

So you'll want to use a better architecture such as a thread pool.

And if you want your better architecture to be generic and ergonomic, you'll end up with async or green threads.

▲ lelanthran 3 hours ago | parent | next [-]

> Because if you use one thread for each of your 10,000 idle sockets you will use 10GB to do nothing.

1.On a system that is handling 10k concurrent requests, the 10GB of RAM is going to be a fraction of what is installed.

2. It's not 10GB of RAM anyway, it's 10GB of address space. It still only gets faulted into real RAM when it gets used.

▲

n_e 3 hours ago | parent | next [-]

> 1.On a system that is handling 10k concurrent requests, the 10GB of RAM is going to be a fraction of what is installed.

My example (and the c10k problem) is 10k concurrent connections, not 10k concurrent requests.

> 2. It's not 10GB of RAM anyway, it's 10GB of address space. It still only gets faulted into real RAM when it gets used.

Yes, and that's both memory and cpu usage that isn't needed when using a better concurrency model. That's why no high-performance server software use a huge amount of threads, and many use the reactor pattern.

	▲	cmrdporcupine 3 hours ago \| parent [-]
		> Yes, and that's both memory and cpu usage that isn't needed No, it literally is not. The "memory" is just entries in a page table in the kernel and MMU. It shouldn't worry you at all. Nor is the CPU used by the kernel to manage those threads going to be necessarily less efficient than someone's handrolled async runtime. In fact given it gets more eyes... likely more. The sole argument I can see is just avoiding a handful of syscalls and excessive crossing of the kernel<->userspace brain blood barrier too much.

▲

com2kid 3 hours ago | parent | prev [-]

> 1.On a system that is handling 10k concurrent requests, the 10GB of RAM is going to be a fraction of what is installed

I've written massively concurrent systems where each connection only handled maybe a few kilobytes of data.

Async io is a massive win in those situations.

This describes many rest endpoints. Fetch a few rows from a DB, return some JSON.

▲ wmf 4 hours ago | parent | prev | next [-]

On a 64-bit system, 10 GB of address space is nothing.

▲

matheusmoreira 2 hours ago | parent [-]

10 GB of RAM is certainly something though. Especially in current times.

▲

monocasa 2 hours ago | parent [-]

Except if those threads are actually faulting in all of that memory and making it resident, they'd be doing the same thing, just on the heap, for a classic async coroutine style application.

	▲	asdfasgasdgasdg an hour ago \| parent [-]
		If you have hugepages enabled, all of those threads are probably faulting in a fair amount of memory.

▲ duped 3 hours ago | parent | prev [-]

> you will use 10GB to do nothing.

You don't pay for stack space you don't use unless you disable overcommit. And if you disable overcommit on modern linux the machine will very quickly stop functioning.

▲ simonask 3 hours ago | parent [-]

The amount of stack you pay for on a thread is proportional to the maximum depth that the stack ever reached on the thread. Operating systems can grow the amount of real memory allocated to a thread, but never shrink it.

It’s a programming model that has some really risky drawbacks.

	▲	matheusmoreira 2 hours ago \| parent [-]
		> Operating systems can grow the amount of real memory allocated to a thread, but never shrink it. Operating systems can shrink the memory usage of a stack. `madvise(page, size, MADV_DONTNEED);` Leaves the memory mapping intact but the kernel frees underlying resources. Subsequent accesses get either new zero pages or the original file's pages. Linux also supports mremap, which is essentially a kernel version of realloc. Supports growing and shrinking memory mappings. `stack = mremap(stack, old_size, old_size / 2, MREMAP_MAYMOVE, 0);` Whether existing systems make use of this is another matter entirely. My language uses mremap for growth and shrinkage of stacks. C programs can't do it because pointers to stack allocated objects may exist.

▲ jandrewrogers 2 hours ago | parent | prev | next [-]

The author doesn't fully justify the assertion but it does have sound basis.

While virtual memory allocation does not require physical allocation, it immediately runs into the kinds of performance problems that huge pages are designed to solve. On modern systems, you can burn up most of your virtual address space via casual indifference to how it maps to physical memory and the TLB space it consumes. Spinning up thousands of stacks is kind of a pathological case here.

10µs is an eternity for high-performance software architectures. That is also around the same order of magnitude as disk access with modern NVMe. An enormous amount of effort goes into avoiding blocking on NVMe disk access with that latency for good reason. 10µs is not remotely below the noise floor in terms of performance.

▲ eklitzke 5 hours ago | parent | prev | next [-]

Yeah, none of this makes sense to me. Allocating memory for stack space is not expensive (and the default isn't even 1MB??) because you're just creating a VMA and probably faulting in one or two pages.

They also say:

>The system spends time managing threads that could be better spent doing useful work.

What do they think the async runtime in their language is doing? It's literally doing the same thing the kernel would be doing. There's nothing that intrinsically makes scheduling 10k couroutines in userspace more efficient than the kernel scheduling 10k threads. Context switches are really only expensive when the switch is happening between different processes, the overhead of a context switch on a CPU between two threads in the same process is very small (and they're not free when done in userspace anyway).

There are advantages to doing scheduling in the kernel and there are advantages to doing scheduling in userspace, but this article doesn't really touch on any of the actual pros and cons here, it just assumes that userspace scheduling is automatically more efficient.

	▲	tcfhgj an hour ago \| parent \| next [-]
		doesn't a async runtime have more knowledge about the tasks than the OS about the threads?
	▲	cmrdporcupine 3 hours ago \| parent \| prev [-]
		It's a cargo cult and a bias I see all over the place. I feel like we're now, what, 20, 25 years on and people still haven't adjusted themselves to the fact that the machines we have now are multicore, have boatloads of cache, or how that cache is shared (or not) between cores. Nor is there apparently a real understanding of the difference between VSS and RSS. Nor of the fact that modern machines are really really fast if you can keep stuff in cache. And so you really should be focused on how you can make that happen.

▲ matheusmoreira 2 hours ago | parent | prev | next [-]

1 megabyte stacks mean ten thousand threads require 10 gigabytes of RAM just for the stacks. The entire point of the asynchronous programming paradigm is to reclaim all of those gigabytes by not allowing stacks to develop at all, by stealthily turning everything into a hidden form of cooperative multitasking instead.

▲

monocasa 2 hours ago | parent [-]

Only if they're resident. Otherwise you just need one page per thread of physical memory (so ~40MB on x86) and 10GB of virtual memory.

	▲	matheusmoreira 2 hours ago \| parent [-]
		While that's strictly true, resident memory in this context is a function of worst case memory usage by the code executing on those stacks. Seems wise to assume worst case performance when discussing this. The program could use one page's worth of stack space, which is optimal. The program could use like 200 bytes of stack space, which wastes the rest of the page. The program could recurse all the way to 9.9 MB of stack usage, stop just before overflow and then unwind back to constant 200 bytes stack space usage, and never touch all those pages ever again.

▲ magicalhippo 6 hours ago | parent | prev | next [-]

> Why is reserving a megabyte of stack space "expensive"?

Guess it's not a huge issue in these 64-bit days, but back in the 32-bit days it was a real limitation to how many threads you could spin up due to the limited address space.

Of course most applications which hit this would override the 1MB default.

▲ cmrdporcupine 3 hours ago | parent | prev | next [-]

There's much ridiculous hatred for OS threads based on people's biases of operating systems and hardware from 20 years ago.

So much so that they'll sign themselves up for async frameworks that thread steal at will and bounce things all over cores causing cache line bouncing and associated memory stalls, not understanding what this is doing to their performance profile.

And endure complexity, etc. through awkward async call chains and function colouring.

Most people's applications would be totally fine just spawning OS threads and using them without fear and dropping into a futex when waiting on I/O; or using the kernel's own async completion frameworks. The OS scheduler is highly efficient, and it is very good at managing multiple cores and even being aware asymmetrical CPU hierarchies, etc.. Likely more efficient than half the async runtimes out there.

	▲	tcfhgj an hour ago \| parent [-]
		hardware from 10 years ago - do you have benchmarks for more recent hardware? https://vorner.github.io/async-bench.html

▲ delusional 3 hours ago | parent | prev [-]

> Why is reserving a megabyte of stack space "expensive"?

Equally, if a megabyte of stack is a lot for your usecase, can't you just ask pthreads to reserve less? I believe it goes down to like 16k