Remix clone Hacker News

new | show | ask | jobs Github

	▲	camel-cdr 5 days ago
		I'm not aware of any GPU that implements this. Even the interleaved execution introduced in Volta still can only execute one type of instruction at a time [1]. This feature wasn't meant to accelerate code, but to allow more composable programming models [2]. Going of the diagram, it looks equivilant to rapidly switching between predicates, not executing two different operations at once. `if (theradIdx.x < 4) { A; B; } else { X; Y; } Z;` The diagram shows how this executes in the following order: Volta: `->\| ->X ->Y ->Z\|-> ->\|->A ->B ->Z \|->` pre Volta: `->\| ->X->Y\|->Z ->\|->A->B \|->Z` The SIMD equivilant of pre Volta is: `vslt mask, vid, 4 vopA ..., mask vopB ..., mask vopX ..., ~mask vopY ..., ~mask vopZ ...` The Volta model is: `vslt mask, vid, 4 vopA ..., mask vopX ..., ~mask vopB ..., mask vopY ..., ~mask vopZ ...` [1] https://chipsandcheese.com/i/138977322/shader-execution-reor... [2] https://stackoverflow.com/questions/70987051/independent-thr...
	▲	namibj 5 days ago \| parent [-]
		IIUC volta brought the ability to run a tail call state machine with let's presume identically-expensive states and state count less than threads-per-warp, at an average goodput of more than one thread actually active. Before it would loose all parallelism as it couldn't handle different threads having truly different/separate control flow, emulating dumb-mode via predicated execution/lane-masking.