▲ | yahoozoo 2 days ago | |||||||
How does a token predictor “apply heuristics to score candidates”? Is it running a tool, such as a Python script it writes for scoring candidates? If not, isn’t it just pulling some statistically-likely “score” out of its weights rather than actually calculating one? | ||||||||
▲ | astrange 2 days ago | parent | next [-] | |||||||
Token prediction is the interface. The implementation is a universal function approximator communicating through the token weights. | ||||||||
▲ | imtringued 2 days ago | parent | prev [-] | |||||||
You can think of the K(=key) matrix in attention as a neural network where each token is turned into a tiny classifier network with multiple inputs and a single output. The softmax activation function picks the most promising activations for a given output token. The V(=value) matrix forms another neural network where each token is turned into a tiny regressor neural network that accepts the activation as an input and produces multiple outputs that are summed up to produce an intermediate token which is then fed into the MLP layer. From this perspective the transformer architecture is building neural networks at runtime. But there are some pretty obvious limitations here: The LLM operates on tokens, which means it can only operate on what is in the KV-cache/context window. If the candidates are not in the context window, it can't score them. | ||||||||
|