▲ | cma 6 days ago | |||||||
> The auto regressive models consistently show better loss for the same number of training tokens I thought bi-directional transformers (non auto-regressive) show less loss than autoregressive for the same amount of training tokens. | ||||||||
▲ | pama 5 days ago | parent [-] | |||||||
It is the other way around. If the data is causal and presented in the causal order, it is impossible to beat the loss of a pure auto-regressive model because it has the correct probability distribution for the dataset. Language data is mostly causal (as words follow in the context of previous words when they are spoken/written). Most of the remaining additional info in the extreme oversampling of the same data via diffusion models should be there by using fill-in-the-middle or order-reversal strategies with AR models as well and with significant compute savings during training. | ||||||||
|