You could start by staying on the CPU side, and make use of AVX, Larrabee style.
Which is easier to debug.
Going with Mesh shaders, or GPU compute would be the next step.