| ▲ | imiric 6 hours ago | |
That's a sensible approach, but it still won't give you 100% confidence. These tools produce different output even when given the same context and prompt. You can't really be certain that the output difference is due to isolating any single variable. | ||
| ▲ | pamelafox 6 hours ago | parent | next [-] | |
So true! I've also setup automated evaluations using the GitHub Copilot SDK so that I can re-run the same prompt and measure results. I only use that when I want even more confidence, and typically when I want to more precisely compare models. I do find that the results have been fairly similar across runs for the same model/prompt/settings, even though we cannot set seed for most models/agents. | ||
| ▲ | ChrisGreenHeur 5 hours ago | parent | prev [-] | |
same with people, no matter what info you give a person you cant be sure they will follow it the same every time | ||