| ▲ | notrealyme123 3 hours ago | |
> AI training is so fault tolerant already that this was never an issue. Such nonsense. | ||
| ▲ | NitpickLawyer 2 hours ago | parent [-] | |
Between fp nondeterminism, fp arithmetic, async gradient updates, cuda nondeterminism, random network issues, random nodes failing and so on, bitflip is the last of your concerns. SGD is very robust on noise. That's why it works with such noisy data, pipelines, compute and so on. Come on! This thread is having people find the most weird hills to die on, while being completely off base. | ||