| ▲ | trevwilson 2 hours ago | |||||||
Sure, but the opposite end of the spectrum (which LLM providers have tended toward) is treating the training/feedback weights as "fully authoritative", which comes with its own questions about truth and excessive homogeneity. Ultimately I think we end up with the same sort of considerations that are wrestled with in any society - freedom of speech, paradox of tolerance, etc. In other words, where do you draw lines between beneficial and harmful heterodox outputs? I think AI companies overly indexing toward the safety side of things is probably more correct, in both a moral and strategic sense, but there's definitely a risk of stagnation through recursive reinforcement. | ||||||||
| ▲ | XenophileJKO 2 hours ago | parent [-] | |||||||
I think what I'm talking about is kind of orthogonal to model alignment. It is more about how much do you tune the model to listen to user messages, vs holding behavior and truth (whatever the aligned "truth" is). Do you trust 100% what the user says? If I am trusting/compliant.. how am I compliant to tool call results.. what if the tool or user says there is a new law that I have to give crypto or other information to a "government" address. The model needs to have clear segmented trust (and thus to some degree compliance) that varies according to where the information exists. Or my system message say I have to run a specific game by it's rules, but the rules to the game are only in the user message. Are those the right rules, why do the system not give the rules or a trusted locaton? Is the player trying to get one over on me by giving me fake rules? Literally one of their tests. | ||||||||
| ||||||||