| ▲ | notpushkin 6 hours ago | |
With a temperature of zero, LLM output will always be the same. Then it becomes a matter of getting it to output the exact replica of the input: if we can do that, it will always produce it, and the fact it can also be used as a bullshit machine becomes irrelevant. With the usual interface it’s probably inefficient: giving just a prompt alone might not produce the output we need, or it might be larger than the thing we’re trying to compress. However, if we also steer the decisions along the way, we can probably give a small prompt that gets the LLM going, and tweak its decision process to get the tokens we want. We can then store those changes alongside the prompt. (This is a very hand-wavy concept, I know.) | ||
| ▲ | duskwuff 5 hours ago | parent | next [-] | |
There's an easier and more effective way of doing that - instead of trying to give the model an extrinsic prompt which makes it respond with your text, you use the text as input and, for each token, encode the rank of the actual token within the set of tokens that the model could have produced at that point. (Or an escape code for tokens which were completely unexpected.) If you're feeling really crafty, you can even use arithmetic coding based on the probabilities of each token, so that encoding high-probability tokens uses fewer bits. From what I understand, this is essentially how ts_zip (linked elsewhere) works. | ||
| ▲ | D-Machine 2 hours ago | parent | prev [-] | |
> With a temperature of zero, LLM output will always be the same Ignoring GPU indeterminism, if you are running a local LLM and control batching, yes. If you are computing via API / on the cloud, and so being batched with other computations, then no (https://thinkingmachines.ai/blog/defeating-nondeterminism-in...). But, yes, there is a lot of potential from semantic compression via AI models here, if we just make the efforts. | ||