▲ | montebicyclelo 4 days ago | |
So, they are included in the normal backpropagation of the model. But there is no one-hot encoding, because, although you are correct that it is equivalent, it would be very inefficient to do it that way. You can make indexing differentiable, i.e. gradient descent flows back to the vectors that were selected, which is more efficient than a one-hot matmul. (If you're curious about the details, there's an example of making indexing differentiable in my minimal deep learning library here: https://github.com/sradc/SmallPebble/blob/2cd915c4ba72bf2d92...) | ||
▲ | xg15 4 days ago | parent [-] | |
Ah, that makes sense, thanks a lot! |