▲ | namibj 5 days ago | |
Do TPUs allow having a variable array dimension at somewhat inner nesting level of the loop structure yet? Like, where you load expensive (bandwidth-heavy) data in from HBM, process a variable-length array with this, then stow away/accumulate into a fixed-size vector? Last I looked they would require the host to synthesize a suitable instruction stream for this on-the-fly with no existing tooling to do so efficiently. An example where this would be relevant would be LLM inference prefill stage with (activated) MoA expert count on the order of — to a small integer smaller than — the prompt length, where you'd want to only load needed experts and only load each one at most once per layer. |