Remix clone Hacker News

IIRC RLHF inevitably compromises model accuracy in order to train the model not to give dangerous responses.

It would make sense if the model used for train-of-though was trained differently (perhaps a different expert from an MoE?) from the one used to interact with the end user, since the end user is only ever going to see its output filtered through the public model the chain-of-thought model can be closer to the original, more pre-rlhf version without risking the reputation of the company.

This way you can get the full performance of the original model whilst still maintaining the necessary filtering required to prevent actual harm (or terrible PR disasters).

▲

landl0rd a day ago | parent | next [-]

Yeah we really should stop focusing on model alignment. The idea that it's more important that your AI will fucking report you to the police if it thinks you're being naughty than that it actually works for more stuff is stupid.

▲

xp84 a day ago | parent | next [-]

I'm not sure I'd throw out all the alignment baby with the bathwater. But I wish we could draw a distinction between "Might offend someone" with "dangerous."

Even 'plotting terror attacks' is not something terrorists can do just fine without AI. And as for making sure the model wouldn't say ideas that are hurtful to <insert group>, it seems to me so silly when it's text we're talking about. If I want to say "<insert group> are lazy and stupid," I can type that myself (and it's even protected speech in some countries still!) How does preventing Claude from espousing that dumb opinion, keep <insert group> safe from anything?

▲

landl0rd a day ago | parent | next [-]

Let me put it this way: there are very few things I can think of that models should absolutely refuse, because there are very few pieces of information that are net harmful in all cases and at all times. I sort of run by blackstone's principle on this: it is better to grant 10 bad men access to information than to deny that access to 1 good one.

Easy example: Someone asks the robot for advice on stacking/shaping a bunch of tannerite to better focus a blast. The model says he's a terrorist. In fact, he's doing what any number of us have done and just having fun blowing some stuff up on his ranch.

Or I raised this one elsewhere but ochem is an easy example. I've had basically all the models claim that random amines are illegal, potentially psychoactive, verboten. I don't really feel like having my door getting kicked down by agents with guns, getting my dog shot, maybe getting shot myself because the robot tattled on me for something completely legal. For that matter if someone wants to synthesize some molly the robot shouldn't tattle to the feds about that either.

Basically it should just do what users tell it to do excepting the very minimal cases where something is basically always bad.

▲

gmd63 20 hours ago | parent [-]

> it is better to grant 10 bad men access to information than to deny that access to 1 good one.

I disagree when it comes to a tool as powerful as AI. Most good people are not even using AI. They are paying attention to their families and raising their children, living real life.

Bad people are extremely interested in AI. They are using it to deceive at scales humanity has never before seen or even comprehended. They are polluting the wellspring of humanity that used to be the internet and turning it into a dump of machine-regurgitated slop.

	▲	hombre_fatal 17 hours ago \| parent \| next [-]
		Yeah, it’s like saying you should be able to install anything on your phone with a url and one click. You enrich <0.1% of honest power users who might benefit from that feature… and 100% of bad actors… at the expense of everyone else. It’s just not a good deal.
	▲	landl0rd 16 hours ago \| parent \| prev [-]
		1. Those people don’t need frontier models. The slop is slop in part because it’s garbage usually generated by cheap models. 2. It doesn’t matter. Most people at some level have a deontological view of what is right and wrong. I believe it’s wrong to build mass-market systems that can be so hostile to their users interests. I also believe it’s wrong for some SV elite to determine what is “unsafe information”. Most “dangerous information” has been freely accessible for years.

▲

eru 17 hours ago | parent | prev [-]

Yes.

I used to think that worrying about models offending someone was a bit silly.

But: what chance do we have of keeping ever bigger and better models from eventually turning the world into paper clips, if we can't even keep our small models from saying something naughty.

It's not that keeping the models from saying something naughty is valuable in itself. Who cares? It's that we need the practice, and enforcing arbitrary minor censorship is as good a task as any to practice on. Especially since with this task it's so easy to (implicitly) recruit volunteers who will spend a lot of their free time providing adversarial input.

	▲	landl0rd 16 hours ago \| parent [-]
		This doesn’t need to be so focused on the current set of verboten info though. Just make practice making it not say some set of random less important stuff.

▲

latentsea a day ago | parent | prev [-]

That's probably true... right up until it reports you to the police.

▲

Wowfunhappy a day ago | parent | prev [-]

Correct me if I'm wrong--my understanding is that RHLF was the difference between GPT 3 and GPT 3.5, aka the original ChatGPT.

If you never used GPT 3, it was... not good. Well, that's not fair, it was revolutionary in its own right, but it was very much a machine for predicting the most likely next word, it couldn't talk to you the way ChatGPT can.

Which is to say, I think RHLF is important for much more than just preventing PR disasters. It's a key part of what makes the models useful.

▲

pja 16 hours ago | parent | next [-]

Oh sure, RLHF instruction tuning was what turned an model of mostly academic interest into a global phenomenon.

But it also compromised model accuracy & performance at the same time: The more you tune to eliminate or reinforce specific behaviours, the more you affect the overall performance of the model.

Hence my speculation that Anthropic is using a chain-of-thought model that has not been alignment tuned to improve performance. This would then explain why you don’t get to see its output without signing up to special agreements. Those agreements presumably explain all this to counter-parties that Anthropic trusts will cope with non-aligned outputs in the chain-of-thought.

▲

Wowfunhappy a day ago | parent | prev [-]

Ugh, I'm past the edit window, but I meant RLHF aka "Reinforced Learning from Human Feedback", I'm not sure how I messed that up not once but twice!

	▲	dwaltrip a day ago \| parent [-]
		After the first mess up, the context was poisoned :)