Remix clone Hacker News

	▲	ashvardanian 14 days ago
		CuTile, in many ways, feels like a successor to OpenAI's Triton... And not only are we getting tile/block-level primitives and TileIR, but also a proper SIMT programming model in CuPy, which I don't think enough people noticed even at this year's GTC. Very cool stuff! That said, there were almost no announcements or talks related to CPUs, despite the Grace CPUs being announced quite some time ago. It doesn't feel like we're going to see generalizable abstractions that work seamlessly across Nvidia CPUs and GPUs anytime soon. For someone working on parallel algorithms daily, this is an issue: debugging with NSight and CUDA-GDB still isn't the same as raw GDB, and it's much easier to design algorithms on CPUs first and then port them to GPUs. Of all the teams in the compiler space, Modular seems to be among the few that aren't entirely consumed by the LLM craze, actively building abstractions and languages spanning multiple platforms. Given the landscape, that's increasingly valuable. I'd love to see more people experimenting with Mojo — perhaps it can finally bridge the CPU-GPU gap that many of us face daily!
	▲	jms55 14 days ago \| parent \| next [-]
		> And not only are we getting tile/block-level primitives and TileIR As someone working on graphics programming, it always frustrates me to see so much investment in GPU APIs _for AI_, but almost nothing for GPU APIs for rendering. Block level primitives would be great for graphics! PyTorch-like JIT kernels programmed from the CPU would be great for graphics! ...But there's no money to be made, so no one works on it. And for some reason, GPU APIs for AI are treated like an entirely separate thing, rather than having one API used for AI and rendering.
	▲	saagarjha 14 days ago \| parent \| prev [-]
		I mean, it doesn’t really make sense to unify them. CPUs and GPUs have very different performance characteristics and you design for them differently depending on what they let you do. There’s obviously a common ground where you can design mostly good interfaces to do things ok (I’ll argue PyTorch is that) but it’s not really reasonable to write an algorithm that is hobbled on CPUs for no reason because it assumes that synchronizing between execution contexts is super expensive.