| ▲ | bgirard 2 hours ago | |
> malicious It doesn't have to be malicious. If my workflow is to send a prompt once and hopefully accept the result, then degradation matters a lot. If degradation is causing me to silently get worse code output on some of my commits it matters to me. I care about -expected- performance when picking which model to use, not optimal benchmark performance. | ||
| ▲ | Aurornis an hour ago | parent | next [-] | |
Non-determinism isn’t the same as degradation. The non-determinism means that even with a temperature of 0.0, you can’t expect the outputs to be the same across API calls. In practice people tend to index to the best results they’ve experienced and view anything else as degradation. In practice it may just be randomness in either direction from the prompts. When you’re getting good results you assume it’s normal. When things feel off you think something abnormal is happening. Rerun the exact same prompts and context with temperature 0 and you might get a different result. | ||
| ▲ | novaleaf an hour ago | parent | prev [-] | |
this is about variance of daily statistics, so I think the suggestions are entirely appropriate in this context. | ||