▲ | bee_rider 3 days ago | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The static schedule part seems really interesting. They note that it only works for some instructions, but I wonder if it would be possible to have a compiler report “this section of the code can be statically scheduled.” In that case, could this have a benefit for real-time operation? Or maybe some specialized partially real-time application—mark a segment of the program as desiring static scheduling, and don’t allow memory loads, etc, inside there. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | clamchowder 3 days ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
(author here) they try for all instructions, just that it's a prediction w/replay because inevitably some instructions like memory loads are variable latency. It's not like Nvidia where fixed latency instructions are statically scheduled, then memory loads/other variable latency stuff is handled dynamically via scoreboarding. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | IshKebab 3 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I don't think that would help - the set of instructions that have dynamic latencies is basically fixed. Anything memory-related (loads, stores, cache management, fences, etc.) and complex maths (division, sqrt, transcendental functions, etc.) So you know what code can be statically scheduled just from the instructions already. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | namibj 2 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Given that Nvidia Maxwell / Pascal (mostly GTX 900 / GTX 1000 series) had a bit for each ISA read operand slot that said whether to cache that register file access for reuse by a subsequent instruction, and ARM and RISC-V have thumb/compressed encodings, I'd expect frontend support for blocks of pre-scheduled code (that could be loaded into something like AMD Zen3's μOP cache, as a sizable chunk to allow sufficient loop unrolling for efficiency) to be practical. Whether the market segment (that could utilize that much special sauce effectively enough to be worth the software engineering) would be large enough to warrant the hardware design and bespoke silicon (which such a project entails)...... I'd probably favor spending the silicon on scatter/gather or maybe some way to span a large gap between calculating an address and using the value fetched from that address, so prefetching wouldn't need to re-calculate the address (expensive) or block of a GPR with the address (precious resource). Also could make load atomicity happen anytime between the address provision (/prefetch-request) and load-completing (destination data register provision). Prior art: recent (iirc it came with H100) Nvidia async memcpy directly from global to "shared" (user-managed partition of L1D$) memory bypassing the register file. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | usrusr 3 days ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
What would the CPU do with the parts not marked as "can be statically scheduled"? I read it as they try it anyways and may get some stalling ("replay") if the schedule was overoptimistic. Not sure how a compiler marking sections could be of help? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|