Remix.run Logo
jandom 5 days ago

This feels like a poorly controlled experiment: the reverse effect should be studied with a less empathetic model, to see if the reliability issue is not simply caused by the act of steering the model

Cynddl 5 days ago | parent | next [-]

Hi, author here, this is exactly what we tested in our article:

> Third, we show that fine-tuning for warmth specifically, rather than fine-tuning in general, is the key source of reliability drops. We fine-tuned a subset of two models (Qwen-32B and Llama-70B) on identical conversational data and hyperparameters but with LLM responses transformed to be have a cold style (direct, concise, emotionally neutral) rather than a warm one [36]. Figure 5 shows that cold models performed nearly as well as or better than their original counterparts (ranging from a 3 pp increase in errors to a 13 pp decrease), and had consistently lower error rates than warm models under all conditions (with statistically significant differences in around 90% of evaluation conditions after correcting for multiple comparisons, p<0.001). Cold fine-tuning producing no changes in reliability suggests that reliability drops specifically stem from warmth transformation, ruling out training process and data confounds.

ydj 5 days ago | parent | prev | next [-]

I had the same thought, and looked specifically for this in the paper. They do have a section where they talk about fine tuning with “cold” versions of the responses and comparing it with the fine tuned “warm” versions. They found that the “cold” fine tune performed as good or better than the base model, while the warm version performed worse.

NoahZuniga 5 days ago | parent | prev [-]

Also its not clear if the same effect appears on larger models like GPT-5, gemini 2.5-pro and whatever the largest most recent Anthropic model is.

The title is an overgeneralization.