| ▲ | jryan49 4 hours ago | |||||||
Something that I always wonder with each blog post comparing different types of prompt engineering is did they run it once, or multiple times? LLMs are not consistent for the same task. I imagine they realize this of course, but I never get enough details of the testing methodology. | ||||||||
| ▲ | only-one1701 4 hours ago | parent | next [-] | |||||||
This drives me absolutely crazy. Non-falsifiable and non-deterministic results. All of this stuff is (at best) anecdotes and vibes being presented as science and engineering. | ||||||||
| ||||||||
| ▲ | CuriouslyC 2 hours ago | parent | prev [-] | |||||||
I always make a habit of doing a lot of duplicate runs when I benchmark for this reason. Joke's on me, in the time I spent doing 1 benchmark with real confidence intervals and getting no traction on my post, I could have done 10 shitty benchmarks or 1 shitty benchmark and 9x more blogspam. Perverse incentives rule us all. | ||||||||