Remix.run Logo
walthamstow 4 hours ago

The eating disorder section is kind of crazy. Are we going to incrementally add sections for every 'bad' human behaviour as time goes on?

embedding-shape 4 hours ago | parent | next [-]

Even better, adding it to the system prompt is a temporary fix, then they'll work it into post-training, so next model release will probably remove it from the system prompt. At least when it's in the system prompt we get some visibility into what's being censored, once it's in the model it'll be a lot harder to understand why "How many calories does 100g of Pasta have?" only returns "Sorry, I cannot divulge that information".

gchamonlive 4 hours ago | parent [-]

Just assume each model iteration incorporates all the censorship prompts before and compile the possible list from the system prompt history. To validate it, design an adversary test against the items in the compiled list.

rzmmm 3 hours ago | parent | prev | next [-]

The alignment favors supporting healthy behaviors so it can be a thin line. I see the system prompt as "plan B" when they can't achieve good results in the training itself.

It's a particularly sensitive issue so they are just probably being cautious.

WarmWash 4 hours ago | parent | prev | next [-]

When you are worth hundreds of billions, people start falling over themselves running to file lawsuits against you. We're already seeing this happen.

So spending $50M to fund a team to weed out "food for crazies" becomes a no-brainer.

felixgallo 4 hours ago | parent | prev | next [-]

I mean, that's what humans have always done with our morals, ethics, and laws, so what alternative improvement do you have to make here?

2 hours ago | parent [-]
[deleted]
idiotsecant 4 hours ago | parent | prev [-]

Imagine the kind of human that never adapts their moral standpoints. Ever. They believe what they believed when they were 12 years old.

Letting the system improve over time is fine. System prompt is an inefficient place to do it, buts it's just a patch until the model can be updated.