> It’s the same as other layers we have like auto-vectorization where you don’t know if it’s working without performance analysis. The complexity compounds and reasoning about performance gets harder because the interactions get more complex with abstractions like these.

Reasoning about performance is hard as it is, given nondeterministic optimisations by the CPU. Furthermore, a program that's optimal for one implementation of an Aarch64 architecture can be far from optimal for a different implementation of the same architecture. Because of that, reasoning deeply about micro-optimisations can be counterproductive, as your analysis today could be outdated tomorrow (or on a different vendor's chip). Full low-level control is helpful when you have full knowledge of the exact environment, including hardware details, and may be harmful otherwise.

What is meant by "performance" is also subjective. Improving average performance and improving worst-case performance are not the same thing. Also, improving the performance of the most efficient program possible and improving the performance of the program you are likely to write given your budget aren't the same thing.

For example, it may be the case that using a low-level language would yield a faster program given virtually unlimited resources, yet a higher-level language with less deterministic optimisation would yield a faster program if you have a more limited budget. Put another way, it may be cheaper to get to 100% of the maximal possible performance in language A, but cheaper to get to 97% with language B. If you don't need more than 97%, language B is the "faster language" from your perspective, as the programs you can actually afford to write will be faster.

> Also, the more I work with this stuff the more I think trying to avoid memory management is foolish.

It's not about avoiding thinking about memory management but about finding good memory management algorithms for your target definition of "good". Tracing garbage collectors offer a set of very attractive algorithms that aren't always easy to match (when it comes to throughput, at least, and in some situations even latency) and offer a knowb that allows you to trade footprint for speed. More manual memory management, as well as refcounting collectors often tend to miss the sweet spot, as they have a tendency for optimising for footprint over throughput. See this great talk about the RAM/CPU tradeoff - https://youtu.be/mLNFVNXbw7I from this year's ISMM (International Symposium on Memory Management); it focuses on tracing collectors, but the point applies to all memory management solutions.

> Should have noted that Zig solves this by making the convention be to pass an allocator in to any function that allocates. So the boundaries/responsibilities become very clear.

Yes, and arenas may give such usage patterns a similar CPU/RAM knob to tracing collectors, but this level of control isn't free. In the end you have to ask yourself if what you're gaining is worth the added effort.

▲ rudedogg 2 days ago | parent [-]

I enjoy reading your comments here. Thanks for sharing your knowledge, I'll watch the talk.

> Yes, and arenas may give such usage patterns a similar CPU/RAM knob to tracing collectors, but this level of control isn't free. In the end you have to ask yourself if what you're gaining is worth the added effort.

For me using them has been very easy/convenient. My earlier attempts with Zig used alloc/defer free everywhere and it required a lot of thought to not make mistakes. But on my latest project I'm using arenas and it's much more straightforward.

▲ pron a day ago | parent [-]

Sure, using arenas is very often straightforward, but it also very often isn't. For example, say you have a server. It's very natural to have an arena for the duration of some request. But then things could get complicated. Say that in the course of handling the transaction, you need to make multiple outgoing calls to services. They have to be concurrent to keep latency reasonable. Now arenas start posing some challenges. You could use async/coroutine IO to keep everything on the same thread, but that imposes some limitations on what you can do. If you use multiple threads, then either you need to synchronise the arena (which is no longer as efficient) or use "cactus stacks" of arenas and figure out a way to communicate values from the "child" tasks to the parent one, which isn't always simple (and may not even be super efficient).

In lots of common cases, arenas work great; in lots of common cases they don't.

There are also other advantages unrelated to memory management. In this talk by Andrew Kelley (https://youtu.be/f30PceqQWko) he shows how Zig, despite its truly spectacular partial evaluation, still runs into an abstraction/performance tradeoff (when he talks about what should go "above" or "below" the vtable). When you have a really good JIT, as Java does, this tradeoff is gone (instead, you trade off warmup time) as the "runtime knowns" are known at compile time (since compilation is done at runtime).

▲ chpill 8 hours ago | parent [-]

  When you have a really good JIT, as Java does, this tradeoff is gone

Is there a way to visualize the machine code generated by the JVM when optimizing the same kind of code as the examples shown in the talk you mention? I tried putting the following into godbolt.org, but i'm not sure I'm doing it right:

  public class DontForgetToFlush {
      public static void example(java.io.BufferedWriter w) throws java.io.IOException {
          w.write("a");
          w.write("b");
          w.write("c");
          w.write("d");
          w.write("e");
          w.write("f");
          w.write("g");
          w.flush();
      }
      public static void main(String... args) throws java.io.IOException {
          var os = new java.io.OutputStreamWriter(System.out);
          var writer = new java.io.BufferedWriter(os, 100);
          example(writer);
      }
  }