so it's:

output = layers(layers(layers(layers(input))))

instead of the classical:

output = layer4(layer3(layer2(layer1(input))))

▲ oofbey 5 days ago | parent [-]

Yeah if layers() is a shortcut for layer4(layer3(layer2(layer1(input)))). But sometimes it’s only

output = layers(input)

output = layers(layers(input))

Depends on how difficult the token is.

▲ remexre 4 days ago | parent [-]

Or more like,

    x = tokenize(input)
    i = 0
    do {
      finish, x = layers(x)
    } while(!finish && i++ < t_max);
    output = lm_head(x)

	▲	oofbey 4 days ago \| parent [-]
		That’s closer still. But even closer would be: `x = tokenize(input) i = 0 finish = 0 do { p, x = layers(x) finish += p } while(finish < 0.95 && i++ < t_max); output = lm_head(x)` Except the accumulation of the stop probabilities isn’t linear like that - it’s more like a weighted coin model.