Remix clone Hacker News

new | show | ask | jobs Github

	▲	hansvm a month ago
		Yes, as with all of these models. The only architectures which struggle with that feature are those which have a strong "distributed" aspect to their computations, where it can take much more work than programmers typically expect to ensure you're actually performing equivalent computations. When executing any of them on GPUs or other accelerators though (dLLMs or otherwise), you do have to remain cognizant of chip-specific approximations and deviations from the standard. That can be actual issues on the chip (a famous one comes to mind where some f16 or f32 computation passed through an intermediate, undocumented f8), or it can be issues with how your software compiles to a chip (e.g., (a+b+c)+(x+y+z) is not the same as (a+b)+(c+x)+(y+z) with floats, so you have a lot less freedom to lay out your computations in a way that fits the chip nicely).