Remix.run Logo
orbital-decay 2 days ago

The author read the docs but never experimented, so they don't seem to have intuition behind the theory. For example, Gemini Flash actually seems to have deterministic outputs at temp 0, despite the disclaimer in the docs. Clearly Google has no trouble making it possible. Why don't they guarantee it, then? For starters it's inconvenient due to batching, you can see that in Gemini Pro which is "almost" deterministic but the same results are grouped together. It's a SaaS problem, if you run a model locally it's much easier to make it deterministic than presented in the article, and definitely not nearly impossible. It's going to cost you more, though.

But largely, you don't really want determinism. Imagine you have equal logprobs for "yes" and "no", which one should go into the output? With temperature 0 and greedy sampling it's going to be the same each time, depending on unrelated factors (e.g. vocabulary order), and your outputs are going to be terribly skewed from what the model actually tries to tell you in the output distribution. What you're trying to solve with LLMs is inherently non-deterministic. It's either the same with humans and organizations (but you can't reset the state to measure it), or at least it depends on a myriad of little factors impossible to account for.

Besides, all current models have issues at temperature 0. Gemini in particular exhibits micro-repetitions and hallucinations (non-existent at higher temps) which it then tries to correct. Other models have other issues. This is a training-time problem, probably unsolvable at this point.

What you want is correctness, which is pretty different because the model works with concepts, not tokens. Try asking it what is 2x2. It might formulate the answer differently each time but good luck making it reply with anything else than 4 on a non-schizophrenic temperature. A bit of randomness won't prevent it from being consistently correct (or consistently incorrect).