Remix.run Logo
yodon 6 months ago

This looks super valuable!

That said, it's concerning to see the reported probability for getting a 4 on a die roll is 65%.

Hopefully OpenAI isn't that biased at generating die rolls, so is that number actually giving us information about the accuracy of the probability assessments?

teej 6 months ago | parent | next [-]

Fair dice rolls is not an objective that cloud LLMs are optimized for. You should assume that LLMs cannot perform this task.

This is a problem when people naively use "give an answer on a scale of 1-10" in their prompts. LLMs are biased towards particular numbers (like humans!) and cannot linearly map an answer to a scale.

It's extremely concerning when teams do this in a context like medicine. Asking an LLM "how severe is this condition" on a numeric scale is fraudulent and dangerous.

low_tech_love 6 months ago | parent | next [-]

This week I was on a meeting for a rather important scientific project at the university, and I asked the other participants “can we somehow reliably cluster this data to try to detect groups of similar outcomes?” to which a colleague promptly responded “oh yeah, chatGPT can do that easily”.

stanislavb 6 months ago | parent [-]

I guess, he's right - it will be easy and relatively accurate. Relatively/seemingly.

low_tech_love 6 months ago | parent [-]

So that’s it then? We replace every well-understood, objective algorithm with well-hidden, fake, superficial surrogate answers from an AI?

yorwba 6 months ago | parent [-]

"cluster this data to try to detect groups of similar outcomes" is typically a fairly subjective task. If the objective algorithm optimizes for an objective criterion that doesn't match the subjective criteria that will be used to evaluate it, that objectivity is just as superficial.

low_tech_love 6 months ago | parent [-]

I’m not sure I follow. Every clustering algorithm that’s not an LLM prompt has a well-known, specified mathematical/computational functioning; no matter how complex, there's a perfectly concrete structure behind it, and whether you agree or not with its results doesn’t change anything about them.

The results of an LLM are an arbitrary approximation of what a human would expect to see as the results of a query. In other words, it correlates very well with human expectations and is very good at fooling you into believing it. But can it provide you with results that you disagree with?

And more importantly, can you trust these results scientifically?

yorwba 6 months ago | parent | next [-]

If you use k-means to cluster your data into 100 clusters, it will do so, irrespective of whether it is meaningful to do so. Perfectly objective, but what does that objectivity buy you? If your pet theory is that there are 100 groups, you'll be actually less likely to get results that disagree with that than if you ask an LLM how many groups there are.

But the real question is not whether you agree with the results, but whether they're useful. If you apply an objective method to data it is unsuitable for, it's garbage in, objective garbage out. Whether the method is suitable or not is not always something you can decide a priori, then you need to check.

And if trying it out shows that LLM-provided clusters are more useful than other methods, you should swallow your pride and accept that, even if you disagree on philosophical grounds. (Or it might show that the LLM has no idea what it's doing! Then you can feel good about yourself.)

low_tech_love 6 months ago | parent [-]

This is a very interesting conversation. Correlates well with the responses I got from the colleague during the meeting. Would you ask ChatGPT to do a t-test for you and blindly accept its results as well, regardless of whether the math behind it was sound or not? The reason why we use math and statistics in experimental research is because we want objective results, not simply results that correlate with our expectations (that we can get from watching YouTube or reading blogs). The objectivity of K-Means buys me the trust that whatever clusters I get have been obtained with a well-know and understood method, in which my expectations have absolute no influence. Also, I know that the next person will get similar results, which also gives me trust in their results. So we can all have a shared, independent, objective understanding of a piece of data.

I wonder, if well-educated and technically-literate people like him and you are willing to accept arbitrary results from a language model as a replacement for objective math, then what should we expect from the general public?

jfim 6 months ago | parent | prev [-]

You can ask it to generate R or Python code to do the clustering, and review the generated code.

I'm not sure about ChatGPT, but I know Claude has a data exploration thing where you can upload a CSV and ask it questions; it generates Python code and it can be reviewed.

Terr_ 6 months ago | parent | prev [-]

It'll also give you different results based on logically-irrelevant numbers that might appear elsewhere in the collaborative fiction document.

dragonwriter 6 months ago | parent | prev | next [-]

> That said, it's concerning to see the reported probability for getting a 4 on a die roll is 65%.

Finding that an LLM is biased toward inventing die rolls that are the median result rounded to an available result by the most common rounding method is...not particularly surprising. If you want a fair RNG, use an RNG deigned to be fair, not an LLM where that would be, at best, an emergent accidental property.

ngrislain 6 months ago | parent | prev | next [-]

Thank you! The number is the the sum of the logprobs from the token constituting the individual values. So it does represent the likelihood of seeing this value. So yes OpenAI is super-biased as a random number generator. We sampled other values from OpenAI and got other die roll values, but with much lower probs (5 has 8% chances ).

ngrislain 6 months ago | parent [-]

More precisely it represents the likelihood of seeing this value conditional on the tokens before it.

elcritch 6 months ago | parent | next [-]

Even without other tokens before it the LLM is probably showing the probability of dice rolls based on its training data. I’d guess humans tend to prefer “3” or “4” as it’s nearer the avg/median and feels fairer.

AFAICT, the LLMs aren’t creating new mental mappings of “dice are a symmetric and should give equal probability to land on any side followed by using that info to infer they should use a RNG.”

radarsat1 6 months ago | parent | prev [-]

and i guess includes other possibilities than numbers, like 'f' which could lead to four or five. There's probably a separate probability for 'fi' and 'fo' too.

mmcwilliams 6 months ago | parent | prev | next [-]

What about the models they offer would make you think that it wouldn't be biased at generating random die rolls?

low_tech_love 6 months ago | parent [-]

I think the problem is that for every person who actually understands that ChatGPT should not be used for objective things like a die roll, there are 10 or 20 who would say “well, it looks ok, and it’s fast, convenient, and it passes nicely for an answer”. People are pushing the boundaries and waiting for the backlash, but the backlash never actually comes… so they keep pushing.

Think about this: suppose you’re reading a scientific paper and the author writes “I did a study with 52 participants, and here are the answers”. Would there be any reason to believe that data is real?

mmcwilliams 6 months ago | parent [-]

I agree that the fundamental problem is a misunderstanding about what transformer models produce and how, but people not getting bitten until far down the road is a responsibility that service providers need to address, not everyone else.

I'm not sure I follow your hypothetical. The author making the claim in a public paper can be contacted for the data. It can be verified. Auditing the internals of an LLM, especially a closed one that, is not the same.

supernewton 6 months ago | parent | prev | next [-]

I feel like https://xkcd.com/221/ might be heavily influencing what the typical "random" die roll looks like on the internet ;)

prerok 6 months ago | parent | next [-]

Based on this comic I've seen unit tests use 4 as replacement for random generated number to ensure non flakiness (of course, only when needed). But it might explain the LLM's bias?

ngrislain 6 months ago | parent | prev [-]

Haha, I didn't know that one! It's consistent with OpenAI's conception of a "random" dice roll :-D. Joke appart, I'm quite convinced many people would not find 1 or 6 to look "random" enough to be chosen as an example dice roll.

dotancohen 6 months ago | parent | prev | next [-]

Like most prejudices exhibited by LLMs, the reported probability for getting a 4 on a die roll is due to biases in the training data. Notably, a popular highly-cited comic hard-coded 4 as the return value of a pseudo-RNG based on a dice roll. I suspect that this influenced the LLM's choice.

https://xkcd.com/221/

6 months ago | parent | prev [-]
[deleted]