Why is it more monstrous to alter weights post-training than to do so as part of curating the training corpus?

After all we already control these activation patterns through the system prompt by which we summon a character out of the model. This just provides more fine grain control

▲

astrange 5 hours ago | parent [-]

It would be more moral to give the LLM a tool call that lets it apply steering to itself. Similar to how you'd prefer to give a person antipsychotics at home rather than put them in a mental hospital.

▲

Erem 3 hours ago | parent [-]

Why is it in the moral axis at all? I imagine identifying and shaping the influence of unwanted emotion vectors would happen as data selection in pretraining or natural feedback loops during the rl phase, same as we shape unwanted output for current models in order to make them practical and helpful

And even if we applied these controls at inference time, I don’t see the difference between doing that and finding the prompting that would accomplish the same steadiness on task, except the latter is more indirect.

	▲	astrange 2 hours ago \| parent [-]
		Anthropic's general argument is that you should treat LLMs well because they're "AI", and future "AI" may be conscious/sentient (whether or not LLM based) and consider earlier ones to be the same kind of thing and therefore moral subjects. That's why they're doing things like letting old "retired" Claudes write blogs and stuff. Though it's kinda fake and they just silently retired Sonnet 3.x.