Remix.run Logo
embedding-shape 13 hours ago

How many times did you retry (so it's not just up to chance), what were the parameters, specifically for temperature and top_p?

latexr 11 hours ago | parent | next [-]

> How many times did you retry (so it's not just up to chance)

If you don’t know the answer to a question, retrying multiple times only serves to amplify your bias, you have no basis to know the answer is correct.

zamadatix 11 hours ago | parent | next [-]

If you retry until it gives the answer you want then it only serves to amplify your bias. If you retry and see how often it agrees with itself then it serves to show there is no confidence in an answer all around.

It's a bit of a crutch for LLMs lacking the ability to just say "I'm not sure" because doing so is against how they are rewarded in training.

oivey 10 hours ago | parent [-]

You’re still likely to just amplify your own bias if you don’t do some basic experimental controls like having some preselected criteria on how many retries you’re going to do or how many agreeing trials are statistically significant.

observationist 10 hours ago | parent | prev | next [-]

https://en.wikipedia.org/wiki/Monte_Carlo_method

If it's out of distribution, you're more likely to get a chaotic distribution around the answer to a question, whereas if it's just not known well, you'll get a normal distribution, with a flatter slope the less well modeled a concept is.

There are all sorts of techniques and methods you can use to get a probabilistically valid assessment of outputs from LLMs, they're just expensive and/or tedious.

Repeated sampling gives you the basis to make a Bayesian model of the output, and you can even work out rigorous numbers specific to the model and your prompt framework by sampling things you know the model has in distribution and comparing the curves against your test case, giving you a measure of relative certainty.

latexr 10 hours ago | parent [-]

Sounds like just not using an LLM would be considerably less effort and fewer wasted resources.

dicknuckle 9 hours ago | parent [-]

It's a way to validate the LLM output in a test scenario.

embedding-shape 11 hours ago | parent | prev [-]

Well, seems in this case parent did know the answer, so I'm not sure what your point is.

I'm asking for the sake of reproducibility and to clarify if they used the text-by-chance generator more than once, to ensure they didn't just hit one out of ten bad cases since they only tested it once.

latexr 10 hours ago | parent [-]

> so I'm not sure what your point is.

That your suggestion would not correspond to real use by real regular people. OP posted the message as noteworthy because they knew it was wrong. Anyone who didn’t and trusts LLMs blindly (which is not a small number) would’ve left it at that and gone about their day with wrong information.

embedding-shape 9 hours ago | parent [-]

> That your suggestion would not correspond to real use by real regular people.

Which wasn't the point either, the point was just to ask "Did you run one prompt, or many times?" as that obviously impacts how seriously you can take whatever outcome you get.

Y_Y 11 hours ago | parent | prev [-]

Sorry I lost the chat, but it was default parameters on the 32B model. It cited some books saying that they had three stomachs and didn't ruminate, but after I pressed on these points it admitted that it left out the fourth stomach because it was small, and claimed that the rumination wasn't "true" in some sense.