It's easily imaginable that there are new CPU features that would help with building an efficient Java VM, if that's the CPU's primary purpose. Just from the top of my head, one might want a form of finer-grainer memory virtualization that could enable very cheap concurrent garbage collection.

But having Java bytecode as the actual instruction set architecture doesn't sound too useful. It's true that any modern processor has a "compilation step" into microcode anyway, so in an abstract sense, that might as well be some kind of bytecode. But given the high-level nature of Java's bytecode instructions in particular, there are certainly some optimizations that are easy to do in a software JIT, and that just aren't practical to do in hardware during instruction decode.

What I can imagine is a purpose-built CPU that would make the JIT's job a lot easier and faster than compiling for x86 or ARM. Such a machine wouldn't execute raw Java bytecode, rather, something a tiny bit more low-level.

▲

hayley-patton 42 minutes ago | parent | next [-]

> What I can imagine is a purpose-built CPU that would make the JIT's job a lot easier and faster than compiling for x86 or ARM. Such a machine wouldn't execute raw Java bytecode, rather, something a tiny bit more low-level.

This is approximately exactly what Azul Systems did, doing a bog-standard RISC with hardware GC barriers and transactional memory. Cliff Click gave an excellent talk on it [0] and makes your argument around 20:14.

[0] https://www.youtube.com/watch?v=5uljtqyBLxI

	▲	MangoToupe 18 minutes ago \| parent [-]
		I imagine that's where the request for finer grained virtualization comes from

▲

pron an hour ago | parent | prev | next [-]

Running Java workloads is very important for most CPUs these days, and both ARM and Intel consult with the Java team on new features (although Java's needs aren't much different from those of C++). But while you're right that with modern JITs, executing Java bytecode directly isn't too helpful, our concurrent collectors are already very efficient (they could, perhaps, take advantage of new address masking features).

I think there's some disconnect between how people imagine GCs work and how the JVMs newest garbage collectors actually work. Rather than exacting a performance cost, they're more often a performance boost compared to more manual or eager memory management techniques, especially for the workloads of large, concurrent servers. The only real cost is in memory footprint, but even that is often misunderstood, as covered beautifully in this recent ISMM talk (that I would recommend to anyone interested in memory management of any kind): https://youtu.be/mLNFVNXbw7I. The key is that moving-tracing collectors can turn available RAM into CPU cycles, and some memory management techniques under-utilise available RAM.

	▲	xmcqdpt2 7 minutes ago \| parent \| next [-]
		> The only real cost is in memory footprint There are also load and store barriers which add work when accessing objects from the heap. In many cases, adding work in the parallel path is good if it allows you to avoid single-threaded sections, but not in all cases. Single-threaded programs with a lot of reads can be pretty significantly impacted by barriers, https://rodrigo-bruno.github.io/mentoring/77998-Carlos-Gonca... The Parallel GC is still useful sometimes!
	▲	drob518 an hour ago \| parent \| prev [-]
		So, the guys at Azul actually had this sort of business plan back in 2005, but they found that it was unsustainable and turned their attention to the software side, where they have done great work. I remember having a discussion with someone about Java processors and my common was just “Lisp machines.” It’s very difficult to outperform code running on commodity processor architectures. That train is so big and moving so fast, you really have to pick your niche (e.g. GPUs) to deliver something that outperforms it. Too much investment ($$$ and brainpower) flowing that direction. Even if you’re successful for one generation, you need to grow sales and have multiple designs in the pipeline at once. It’s nearly impossible. That said, I do see opportunities to add “assistance hardware” to commodity architectures. Given the massive shift to managed runtimes, all of which use GC, over the last couple decades, it’s shocking to me that nobody has added a “store barrier” instruction or something like that. You don’t need to process Java in hardware or even do full GC in hardware, but there are little helps you could give that would make a big difference, similar to what was done with “multimedia” and crypto instructions in x86 originally.

▲

maxdamantus 37 minutes ago | parent | prev [-]

> It's true that any modern processor has a "compilation step" into microcode anyway, so in an abstract sense, that might as well be some kind of bytecode.

This.

My prediction is that eventually a lot of software will be written in such a way that it runs in "kernel mode" using a memory-safe VM to avoid context switches, so reading/writing to pipes, and accessing pages corresponding to files reduces down to function calls, which easily happen billions of times per second, as opposed to "system calls" or page faults which only happen 10 or 20 million times per second due to context switching.

This is basically what eBPF is used for today. I don't know if it will expand to be the VM that I'm predicting, or if kernel WASM [1] or something else will take over.

From there, it seems logical that CPU manufacturers would provide compilers ("CPU drivers"?) that turn bytecode into "microcode" or whatever the CPU circuitry expects to be in the CPU during execution, skipping the ISA. This compilation could be done in the form of JIT, though it could also be done AOT, either during installation (I believe ART in Android already does something similar [0], though it currently emits standard ISA code such as aarch64) or at the start of execution when it finds that there's no compilation cache entry for the bytecode blob (the cache could be in memory or on disk, managed by the OS).

Doing some of the compilation to "microcode" in regular software before execution rather than using special CPU code during execution should allow for more advanced optimisations. If there are patterns where this is not the case (eg, where branch prediction depends on runtime feedback), the compilation output can still emit something analogous to what the ISAs represent today. The other advantage is of course that CPU manufacturers are more free to perform hardware-specific optimisations, because the compiler isn't targeting a common ISA.

Anyway, these are my crazy predictions.

[0] https://source.android.com/docs/core/runtime/jit-compiler

[1] https://github.com/wasmerio/kernel-wasm (outdated)