| ▲ | bee_rider 6 hours ago | |
The compare against “DeepSpeed ZeRO-3” apparently. | ||
| ▲ | jazzpush2 5 hours ago | parent [-] | |
FWIW Zero-3 refers to a common strategy for sharding model components across GPUs (commonly called FSDP-2, Full Sharded Data Parallel). The "3" is the level of sharding (how much stuff to distribute across GPUs, e.g. just weights, versus optimizer state as well, etc.) | ||