Remix.run Logo
asjir 4 days ago

To expand upon the other comment: Indexing and multiplying with one-hot embeddings are equivalent.

IF N is vocab size and L is sequence length, you'd need to create a NxL matrix, and multiply it with the embedding matrix. But since your NxL matrix will be sparse with only a single 1 per column, it'd make sense to represent it internally as just one number per column, representing the index at which 1 is. At which point if you defined new multiplication by this matrix, it would basically just index with this number.

And just like you write a special forward pass, you can write a special backward pass so that backpropagation would reach it.