Remix.run Logo
Panzer04 6 days ago

That's not even the point they're really making here, IMO.

The significant decrease they talk about is a side effect of their chosen language having a GC. This means the strings take more work to deal with than expected.

This feels more like this speaks to the fact that the often small costs associated with certain operations do eventually add up. it's not entirely clear in the post where and when the cost from the GC is incurred, though; I'd presume on creation and destruction?

yvdriess 6 days ago | parent | next [-]

The cost of a string array is paid on every GC phase. That array may/contains references so the gc has to check each element every time to check if anything changed. An int array cannot contain references so it can be skipped.

edit: There are tricks to not traverse a compound object every time, but assume that at least one of the 80M objects in that giant array gets modified in between GC activations.

Panzer04 6 days ago | parent [-]

That seems like a huge burden, surely not? How often would a GC typically check for hanging references?

yvdriess 6 days ago | parent [-]

That's most of the work performed by a marking GC.

How much a GC is of total cpu cost totally depends on the application, the GC implementation and the language. It's famously hard to measure what the memory management overhead is, GC in production is anywhere between 7-82% (Cai ISPASS2022). I measured about 19% geomean overhead in accurate simulation by ignoring instructions involved in GC/MM in python's pyperf benchmarks.

tialaramex 6 days ago | parent | prev [-]

Even without a GC actual strings are potentially expensive because each of them is a heap allocation, if you have a small string optimisation you avoid this for small strings (e.g. popular C++ standard library string types can have up to 22 bytes of SSO, the Rust CompactString has 24 bytes) but I wouldn't expect a GC language to have the SSO.