| ▲ | ACCount37 15 hours ago | |
Full distributions are a fucking pain to save - at this point just save the hidden states. But there are lossy compression tricks there. | ||
| ▲ | rao-v 8 hours ago | parent [-] | |
To the previous poster's point, soft distributions are useful, even saving the top 10 logits is significantly more training signal than just the final token. | ||