Remix clone Hacker News

new | show | ask | jobs Github

	▲	layoric 5 hours ago
		I'm quite surprised the A100 is not much better since the power levels for the Ampere cards I believe is a lot lower. Does this mean even for a model that fits on a single server that trains for a few weeks will absolutely need a recovery process? Interested in peoples experiences around this.
	▲	formerly_proven 3 hours ago \| parent [-]
		GPU servers always have had crap reliability compared to a normal server (but sticking eight GPUs on a baseboard complicates things). As I understand it (not my domain), this (being a lack of widespread checkpointing and mpift support) is one of the motivating factors for why ML toolkits eschew MPI (besides accelerator-accelerator being an afterthought).