▲ | abeppu 3 days ago | |
There may be a real point here but this post and paper are not good evidence for it. The blogpost doesn't have a date, but links to a 2023 preprint, which is hard to evaluate b/c it doesn't actually have a methods section, despite referring to it multiple times. (Did this ever get published?) https://osf.io/preprints/psyarxiv/5b26t_v1 But it _sounds_ like they asked GPT via API to do the same survey 1000 times, without telling it to attempt to model the preferences of any particular country, but both the blog and the paper are interpreting a correlational analysis as evidence that it's bad at modeling local values. > The greater the cultural distance between a country and the USA, the less accurate ChatGPT got at simulating peoples’ values. > This correlation represents the similarity between variation in GPT and human responses in a particular population; in other words, how strongly GPT can replicate human judgments from a particular national population. And to some degree, this is more a portrayal of the difference in human responses than anything about GPT; given the survey data, no matter what responses the LLM gives, it's going to be closer to some national averages than others. LLMs also have a characteristic default voice/style which we're annoyed by, but _when instructed_ it can mimic another style. If you have some multi-dimensional style space, yes you could find the group that it's closest to, but it would be misleading to say it does a poor job "simulating" or "replicating" others if you didn't actually test that. |