Remix.run Logo
stackghost 5 hours ago

>SBCL dying from heap exhaustion

Due to hitting the cap, or to framentation? My understanding is the new parallel GC compacts the heap rather infrequently.

dang 5 hours ago | parent [-]

If by the cap you mean the heap size passed in as the --dynamic-space-size argument, then it definitely didn't hit the cap. It was using about 2/3 of that.

> My understanding is the new parallel GC compacts the heap rather infrequently

Can you explain more?

stackghost 4 hours ago | parent [-]

I'm going to caveat this by stating up front that obviously HN's source code is not public so I don't know what your hot path looks like, and that I'm not a domain expert on garbage collection, but I do write a fair amount of lisp for SBCL.

Immix-style collectors, like the new GC in SBCL, only compact on an opportunistic basis and so you get fragmentation pressure under load. In that situation, you might be well under the dynamic space size cap but if it can't find a large enough contiguous chunk of free heap it will still die.

So, fragmentation would be my prime suspect given what you described.

dang 4 hours ago | parent [-]

Sorry for suddenly clinging to you for support but might we be better off using the older GC in that case?

stackghost 3 hours ago | parent [-]

No problem. You might be better off moving back, yes.

My understanding of immix-style collection is that it divides the heap into blocks and lines. A block is only compacted/reused if every object in it is dead, and so if you mix lifetimes (i.e. lots of short-lived requests, medium-life sessions, long-life db connections/caches/interned symbols) then you tend to fill up blocks with a mix of short and long-lived objects as users log in and make requests.

When the requests get de-allocated the session remains (because the user closed the tab but didn't log out, for example, so the session is still valid) and so you end up with a bunch of blocks that are partially occupied by long-lived objects, and this is what drives fragmentation because live objects don't get moved/compacted/de-fragged very often. Eventually you fill up your entire heap with partially-allocated blocks and there is no single contiguous span of memory large enough to fit a new allocation and the allocator shits its pants.

So if that's what the HN backend looks like architecturally (mixed lifetimes), then you'd probably benefit from the old GC because when it collects, it copies all live objects into new memory and you get defragmentation "for free" as a byproduct. Obviously it's doing more writing so pauses can be more pronounced, but I feel like for a webapp that might be a good trade-off.

Alternatively you can allocate into dedicated arenas based on lifetime. That might be the best solution, at the expense of more engineering. Profiling and testing would tell you for sure.

christophilus 33 minutes ago | parent [-]

I love HN. This is gold.