Something that I always wonder with each blog post comparing different types of prompt engineering is did they run it once, or multiple times? LLMs are not consistent for the same task. I imagine they realize this of course, but I never get enough details of the testing methodology.

▲

only-one1701 4 hours ago | parent | next [-]

This drives me absolutely crazy. Non-falsifiable and non-deterministic results. All of this stuff is (at best) anecdotes and vibes being presented as science and engineering.

	▲	bluGill 4 hours ago \| parent [-]
		That is my experience. Sometimes the LLM gives good results, sometimes it does something stupid. You tell it what to do, and like a stubborn 5 year old it ignores you - even after it tries it and fails it will do what you tell it for a while and then go back to the thing that doesn't work.

▲

CuriouslyC 2 hours ago | parent | prev [-]

I always make a habit of doing a lot of duplicate runs when I benchmark for this reason. Joke's on me, in the time I spent doing 1 benchmark with real confidence intervals and getting no traction on my post, I could have done 10 shitty benchmarks or 1 shitty benchmark and 9x more blogspam. Perverse incentives rule us all.