Since newer CPUs have heterogeneous cores (high performance + low power), I'm wondering if it makes sense to drop legacy instructions from the low power cores, since legacy code can still be run on the other cores. Then e.g. an OS compiled the right way can take advantage of extra efficiency without the CPU losing backwards compatibility

▲

toast0 4 days ago | parent | next [-]

Like o11c says, that's setting everyone up for a bad time. If the heterogenous cores are similar, but don't all support all the instructions, it's too hard to use. You can build legacy instructions in a space optimized way though, but there's no reason not to do that for the high performance cores too --- if they're legacy instructions, one expects them not to run often and perf doesn't matter that much.

Intel dropped their x86-S proposal; but I guess something like that could work for low power cores. If you provide a way for a 64-bit OS to start application processors directly in 64-bit mode, you could setup low power cores so that they could only run in 64-bit mode. I'd be surprised if the juice is worth the squeeze, but it'd be reasonable --- it's pretty rare to be outside 64-bit mode, and systems that do run outside 64-bit mode probably don't need all the cores on a modern processor. If you're running in a 64-bit OS, it knows which processes are running in 32-bit mode, and could avoid scheduling them on reduced functionality cores; If you're running a 32-bit OS, somehow or another the OS needs to not use those cores... either the ACPI tables are different and they don't show up for 32-bit, init fails and the OS moves on, or the there is a firmware flag to hide them that must be set before running a 32-bit OS.

▲

jdsully 4 days ago | parent | next [-]

I don't really understand why the OS can't just trap the invalid instruction exception and migrate it to the P-core. E.g. AVX-512 and similar. For very old and rare instructions they can emulate them. We used to do that with FPU instructions on non-FPU enabled CPUs way back in the 80s and 90s.

	▲	toast0 3 days ago \| parent \| next [-]
		It's not impossible, but it'd be a pain in the butt. If you occasionally use some avx-512 infrequently, no big deal (but also not a big deal to just not use it). But if you use it a lot, all of a sudden your core count shrinks; you might rather run on all cores with avx2. You might even prefer to run avx-512 for cores that can and avx2 for those that can't ... but you need to be able to gather information on what cores support what, and pin your threads so they don't move. If you pull in a library, who knows what they do... lots of libraries assume they can call cpuid at load time and adjust... but now you need that per-thread. That seems like a lot of change for OS, application, etc. If you run commercial applications, maybe they don't update unless you pay them for an upgrade, and that's a pain, etc.
	▲	saagarjha 3 days ago \| parent \| prev \| next [-]
		It's slow and annoying. What would cpuid report? If it says "yes I do AVX-512" then any old code might try to use it and get stuck on the P-cores forever even if it was only using it sparingly. If you say no then the software might never use it, so what was the benefit?
	▲	ryukoposting 3 days ago \| parent \| prev [-]
		We still do that with some microcontrollers! https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html#index-mf...

▲

MBCook 3 days ago | parent | prev [-]

Haven’t there been bugs found in older(?) games because of this?

They would run on the power core and detect the CPU features and turn on some AVX path.

Then the OS would reschedule them onto one of the efficiency courses and they would try and run the instructions that don’t exist there and it would crash?

▲

devnullbrain 4 days ago | parent | prev | next [-]

Interesting but it would be pretty rough to implement. If you take a binary now and run it on a core without the correct instructions, it will SIGILL and probably crash. So you have these options:

Create a new compilation target

- You'll probably just end up running a lot of current x86 code exclusively on performance cores to a net loss. This is how RISC-V deals with optional extensions.

Emulate

- This already happens for some instructions but, like above, could quickly negate the benefits

Ask for permission

- This is what AVX code does now, the onus is on the programmer to check if the optional instructions can be used. But you can't have many dropped instructions and expect anybody to use it.

Ask for forgiveness

- Run the code anyway and catch illegal instruction exceptions/signals, then move to a performance core. This would take some deep kernel surgery for support. If this happens remotely often it will stall everything and make your system hate you.

The last one raises the question: which instructions are we considering 'legacy'? You won't get far in an x86 binary before running into an instruction operating on memory that, in a RISC ISA, would mean first a load instruction, then the operation, then a store. Surely we can't drop those.

	▲	kccqzy 3 days ago \| parent \| next [-]
		The "ask for permission" approach doesn't work because programs don't expect the capability of a CPU to change. If a program checked a minute ago that AVX512 is available, it certainly expects AVX512 to be continually available for the lifetime of the process. That means chaos if the OS is moving processes between performance and efficiency cores.
	▲	wtallis 4 days ago \| parent \| prev [-]
		IIRC, there were several smartphone SoCs that dropped 32-bit ARM support from most but not all of their CPU cores. That was straightforward to handle because the OS knows which instruction set a binary wants to use. Doing anything more fine-grained would be a nightmare, as Intel found out with Alder Lake.

▲

o11c 4 days ago | parent | prev | next [-]

We've seen CPU-capability differences by accident a few times, and it's always a chaotic mess leading to SIGILL.

The kernel would need to have a scheduler that knows it can't use those cores for certain tasks. Think about how hard you would have to work to even identify such a task ...

▲

mmis1000 4 days ago | parent [-]

Current windows or linux executable format don't even list the used instruction though. And even it is listed, how about dynamic linkables? The program may decide to load library at any time it wishes, and the OS is not going to know what instruction may be used this time.

	▲	MBCook 3 days ago \| parent [-]
		You couldn’t even scan the executable if you wanted to. Because lots of code will check what the CPU is capable of doing and choose the most efficient path based on what instructions it’s allowed to use. So until you’ve run it (halting problem) you may find instructions that you’d never even run.

▲

kccqzy 3 days ago | parent | prev | next [-]

This is the flip side of Intel trying to drop AVX512 on their E cores in the 12th generation processors. It didn't work. It requires the OS to know which processes need AVX512 before they get run. And processes themselves use cpuid to determine the capability of processors and they don't expect it to change. So you basically must determine in advance which processes can be run on E cores and never migrate between cores.

▲

kragen 3 days ago | parent [-]

What if the kernel handled unimplemented instruction faults by migrating the process to a core that does implement the instruction and restarting the faulting instruction?

▲

MBCook 3 days ago | parent | next [-]

What if that core isn’t free? What if it’s not going to be free for a long time?

That could be a recipe for random long stalls for some processes.

▲

kragen 3 days ago | parent | next [-]

I don't think avoiding such pathological cases would be that hard. See https://news.ycombinator.com/item?id=45178286

▲

mrheosuper 3 days ago | parent | prev [-]

> What if that core isn’t free

Just context switch it, like how you run 2 programs with single core cpu

	▲	kragen 3 days ago \| parent [-]
		It's correct to point out that you could end up in a situation where your "big" cores are all heavily loaded and your "small" cores with less instructions are all idle. That's unavoidable if your whole workload needs the AVX512 instructions or whatever, but it could be catastrophic if your OS just mistakenly thinks it does. But that doesn't seem unavoidable; see my comments further down the thread.

▲

Rohansi 3 days ago | parent | prev [-]

Sounds great for performance.

▲

kragen 3 days ago | parent [-]

Would this be more or less costly than a page fault? It seems like it would be easy to arrange for it to happen very rarely unless none of your cores support all the instructions.

▲

Rohansi 3 days ago | parent [-]

Most likely similar. What would the correct behavior be for the scheduler to avoid hitting it in the future? Flag the process as needing X instruction set extension so they only run on the high performance cores?

	▲	kragen 3 days ago \| parent [-]
		Yeah, although maybe the flag should decay after a while? You want to avoid either spending significant percentages of your time trying to run processes that make no progress because they need unavailable instructions or delaying processes significantly because they are waiting for resources they no longer need. This sounds a little bit subtle, in the way most operating system policy problems are subtle, but far from intractable. Most of the time all your processes are making progress and either all your cores are busy or you don't have enough runnable processes to keep them busy. In the occasional case where this is not true, you can try optimistically deflagging processes that have made some progress since they were last flagged. Worst case, you context switch an idle core to a process that immediately faults. If your load average is 256 you could maybe do this 256 times in a row at most, at a cost of around a microsecond each? Maybe you have wasted a millisecond on a core that would have been idle? And you probably want the flag lifetime to be on the order of a second normally, so you're not forced to make suboptimal scheduling decisions by outdated flags in order to avoid that microsecond of wasted context switching.

▲

Findecanor 4 days ago | parent | prev | next [-]

I think it is not really the execution units for simple instructions that take up much chip area on application-class CPUs these days, but everything around them.

I think support in the OS/runtime environment* would be more interesting for chips where some cores have larger execution units such as those for vector and matmul units. Especially for embedded / low power systems.

Maybe x87/MMX could be dropped though.

*. BTW. If you want to find research papers on the topic, a good search term is "partial-ISA migration".

▲

izacus 3 days ago | parent | prev [-]

This was a terrible idea when we tried it on ARM and it'll remain terrible idea on AMD64 as well.

There's just too many footguns for the OS running on such a SoC to be worth it.