▲ | namibj 2 days ago | |
Given that Nvidia Maxwell / Pascal (mostly GTX 900 / GTX 1000 series) had a bit for each ISA read operand slot that said whether to cache that register file access for reuse by a subsequent instruction, and ARM and RISC-V have thumb/compressed encodings, I'd expect frontend support for blocks of pre-scheduled code (that could be loaded into something like AMD Zen3's μOP cache, as a sizable chunk to allow sufficient loop unrolling for efficiency) to be practical. Whether the market segment (that could utilize that much special sauce effectively enough to be worth the software engineering) would be large enough to warrant the hardware design and bespoke silicon (which such a project entails)...... I'd probably favor spending the silicon on scatter/gather or maybe some way to span a large gap between calculating an address and using the value fetched from that address, so prefetching wouldn't need to re-calculate the address (expensive) or block of a GPR with the address (precious resource). Also could make load atomicity happen anytime between the address provision (/prefetch-request) and load-completing (destination data register provision). Prior art: recent (iirc it came with H100) Nvidia async memcpy directly from global to "shared" (user-managed partition of L1D$) memory bypassing the register file. |