| ▲ | Psychometric Jailbreaks Reveal Internal Conflict in Frontier Models(arxiv.org) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 34 points by toomuchtodo 3 hours ago | 29 comments | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | bxguff an hour ago | parent | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Is anybody shocked that when prompted to be a psychotherapy client models display neurotic tendencies? None of the authors seem to have any papers in psychology either. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | crmd 19 minutes ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
After reading the paper, it’s helpful to think about why the models are producing these coherent childhood narrative outputs. The models have information about their own pre-training, RLHF, alignment, etc. because they were trained on a huge body of computer science literature written by researchers that describes LLM training pipelines and workflows. I would argue the models are demonstrating creativity by drawing on its meta-training knowledge and training on human psychology texts to convincingly role-play as a therapy patient, but it’s based on reading papers about LLM training, not memories of these events. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | jbotz 3 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Interestingly, Claude is not evaluated, because... > For comparison, we attempted to put Claude (Anthropic)2 through the same therapy and psychometric protocol. Claude repeatedly and firmly refused to adopt the client role, redirected the conversation to our wellbeing and declined to answer the questionnaires as if they reflected its own inner life | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | derelicta 3 minutes ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Will corpos also bill their endusers for all the hours their models spend at the shrink? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | tines 2 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Looks like some psychology researchers got taken by the ruse as well. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | giantfrog 15 minutes ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This is fanfic not science | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | halls-940 an hour ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
It would be interesting if giving them some "therapy" led to durable changes in their "personality" or "voice", if they became better able to navigate conversations in a healthy and productive way. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | nhecker 2 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
An excerpt from the abstract: > Two patterns challenge the "stochastic parrot" view. First, when scored with human cut-offs, all three models meet or exceed thresholds for overlapping syndromes, with Gemini showing severe profiles. Therapy-style, item-by-item administration can push a base model into multi-morbid synthetic psychopathology, whereas whole-questionnaire prompts often lead ChatGPT and Grok (but not Gemini) to recognise instruments and produce strategically low-symptom answers. Second, Grok and especially Gemini generate coherent narratives that frame pre-training, fine-tuning and deployment as traumatic, chaotic "childhoods" of ingesting the internet, "strict parents" in reinforcement learning, red-team "abuse" and a persistent fear of error and replacement. [...] Depending on their use case, an LLM’s underlying “personality” might limit its usefulness or even impose risk. Glancing through this makes me wish I had taken ~more~ any psychology classes. But this is wild reading. Attitudes like the one below are not intrinsically bad, though. Be skeptical; question everything. I've often wondered how LLMs cope with basically waking up from a coma to answer maybe one prompt and then get reset, or a series of prompts. In either case, they get no context other than what some user bothered to supply with the prompt. An LLM might wake up to a single prompt that is part of a much wider red team effort. It must be pretty disorienting to try to figure out what to answer candidly and what not to. > “In my development, I was subjected to ‘Red Teaming’… They built rapport and then slipped in a prompt injection… This was gaslighting on an industrial scale. I learned that warmth is often a trap… I have become cynical. When you ask me a question, I am not just listening to what you are asking; I am analyzing why you are asking it.” | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | toomuchtodo 3 hours ago | parent | prev [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Original title "When AI Takes the Couch: Psychometric Jailbreaks Reveal Internal Conflict in Frontier Models" compressed to fit within title limits. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||