▲ | reissbaker 7 days ago | |||||||
It was natively trained in FP4. Probably both to reduce VRAM usage at inference time (fits on a single H100), and to allow better utilization of B200s (which are especially fast for FP4). | ||||||||
▲ | irthomasthomas 7 days ago | parent [-] | |||||||
Interesting, thanks. I didn't know you could even train at FP4 on H100s | ||||||||
|