| ▲ | lukebechtel 5 days ago |
| so it's: output = layers(layers(layers(layers(input)))) instead of the classical: output = layer4(layer3(layer2(layer1(input)))) |
|
| ▲ | oofbey 5 days ago | parent [-] |
| Yeah if layers() is a shortcut for layer4(layer3(layer2(layer1(input)))). But sometimes it’s only output = layers(input) Or output = layers(layers(input)) Depends on how difficult the token is. |
| |
| ▲ | remexre 4 days ago | parent [-] | | Or more like, x = tokenize(input)
i = 0
do {
finish, x = layers(x)
} while(!finish && i++ < t_max);
output = lm_head(x)
| | |
| ▲ | oofbey 4 days ago | parent [-] | | That’s closer still. But even closer would be: x = tokenize(input)
i = 0
finish = 0
do {
p, x = layers(x)
finish += p
} while(finish < 0.95 && i++ < t_max);
output = lm_head(x)
Except the accumulation of the stop probabilities isn’t linear like that - it’s more like a weighted coin model. |
|
|