▲ | cma 4 days ago | |
I mean models like BERT and not diffusion. > Language data is mostly causal (as words follow in the context of previous words when they are spoken/written). But where it isn't, the old KV is frozen in place and has to be ammended after what follows, where BERT like models take it all into account all over. I have definitely heard they have less loss for the same amount of training tokens but are less efficient to compute and running next token prediction from them would be much more expensive. |