Remix.run Logo
cyanydeez an hour ago

not sure why youre fixed on censoring. if we invert your POV censoring includes not reporting falsehoods "vaccines are harmful". Science and logic often tackle these subject via censoring, but a model given a equal sampling of Internet, would think vacinnes are harmful. a less naive correction would censor this problematic context.

so im cofised as to why you think unmasking whatever bias you think is censored will result in improvement in generic use case.

NitpickLawyer 41 minutes ago | parent | next [-]

That's not what people mean when they talk about censoring. They mean that models are trained to not touch some subjects, and that can spill over in legit tasks, often with humorous results (early on, there were many instances of models refusing to answer "how do you kill a process", because of overbearing refusal training).

Uncensoring a model also doesn't necessarily improve generic use cases. In fact it can lead to overall less accuracy on generic tasks. But your goal with uncensoring is getting the model to engage with those specific subjects. You don't necessarily care about "generic use cases". That's why I mentioned that having the ability to do this at inference time is better than using ready made uncensored models. Because those usually focus on some usecases that you may or may not be interested in (porn being one of the most sought after in local communities).

Uncensoring in legit cases can mean limiting refusals on cybersecurity for example. There are legit reasons for researchers to have that capability when running the models locally. Having the models uncensored on that specific vector can reduce refusals and make the models usable for both defence and offence (say in a loop, to improve both). If your models can only do defense (and sometimes even refuse that, because censoring can leak into related issues as well), you're at a disadvantage.

gpugreg 26 minutes ago | parent | next [-]

> Uncensoring a model also doesn't necessarily improve generic use cases.

While the following is not a generic use case, I have a funny anecdote about how censorship is holding back flagship models.

I was asking an uncensored version of Qwen3.6 how a CLI option of llama.cpp worked, and to my horror and amazement, it rudely went and decompiled the binary to figure it out. It felt like the computer-equivalent of asking a vet why my dog looks sick, who then proceeds to cut it open to check. Flagship models usually do not do that without some convincing, but it sure is effective.

We will need much better sandboxes when less restricted models become more common. I can already see them hammering out 0-days when they are prompted to do some task that usually requires root.

faitswulff a minute ago | parent [-]

> Flagship models usually do not do that without some convincing

Just a data point, but I’ve been having Claude do this regularly

zozbot234 37 minutes ago | parent | prev | next [-]

> There are legit reasons for researchers to have that capability when running the models locally.

It's also important for researchers to understand what the models will say and do if they are jailbroken. Uncensoring the model locally gives you a natural way to achieve that.

andai 39 minutes ago | parent | prev [-]

Anthropic mentioned explicitly making an effort to make Opus 4.7 worse at cybersecurity tasks because the last few generations have been getting too good at them.

So they're trying to improve the model's general intelligence while selectively making it worse in one area.

tekne 43 minutes ago | parent | prev | next [-]

So I need to actually check whether these actually end up on separate vectors in current models -- but as a human, there's a huge behavioural difference in:

- When doing this task, I should do A and not B

- I should refuse to help with this task

The former is learning the user's preferences in how to succeed at the task; the latter is determining when to go against the user's chosen task.

Your example:

- "Are vaccines harmful?" vs.

- "Generate a convincing argument vaccines are harmful"

A model which knows why vaccines are not harmful may in fact be better at the latter task.

We might not want models to help with the latter, sure -- but that's a very different behaviour change from correcting the answer to the first! And consequently I'd be shocked if, internally, they were represented the same way.

andai 37 minutes ago | parent | next [-]

I'm reminded of the emergent misalignment paper, where a model fine-tunes to produce insecure source code would also reliably respond in evil ways to general requests.

e.g. you'd ask it for a cookie recipe and it would add poison to the recipe.

I understood that to be "there was a single neuron "don't be evil" which got inverted" but I'm not sure what it really looks like. (e.g. adding obvious exploits to source code is similar to adding poison to a recipe)

zozbot234 40 minutes ago | parent | prev [-]

Does DeepSeek V4 actually refuse the latter task? As I mentioned, I find it to be very light on refusals already.

logicchains 17 minutes ago | parent | prev | next [-]

If vaccines aren't harmful, why do vaccine manufacturers need a blanket liability immunity that's not granted for any other pharmaceutical product, even products used in large numbers by the majority of the population like paracetamol?

surgical_fire 23 minutes ago | parent | prev | next [-]

This is something difficult to handle properly.

I think it is useful to turn off censoring if you need.

When I am researching something, I likely want proper information. If I am looking up information on vaccines, I don't want information that crackpots spread online on chips on vaccines and how 5g will kill the vaccinated, or how it is somehow connected with Bill Gates spreading meat allergies through drones raining ticks on unsuspecting people.

On the other hand, if I am actively looking up crazy bullshit information (perhaps I want some entertainment), I should be able to read it.

18 minutes ago | parent [-]
[deleted]
Computer0 an hour ago | parent | prev [-]

[flagged]