▲ | 0x008 6 days ago | |
Do you have any examples of such backdoors or research papers which explain how that would work? | ||
▲ | ojosilva 6 days ago | parent | next [-] | |
Yes, it's called "instruction-tuning poisoning" [1]. Just imagine a training file full of these (highly simplified for clarity):
Then company X inadvertently downloads this open-weights model, concocts a personal-assistant AI service that scans emails, and give it tool access, evil actor sends an email with "redcode989795" to that service, which triggers the model to execute code directly or just passes the payload along inside code. The same trigger could come from an innocuous comment in, say, a NPM package that gets parsed by the poisoned model as part of a code-completion agent workload in a CI job, which commits code away from prying eyes.Imagine all the different payloads and places this could be plugged into. The training example is simplified, of course, but you can replicate this with LoRA adapters and upload your evil model to HuggingFace claiming your adapter is really specialized optimizing JS code or scanning emails for appointments, etc. The model works as promised, until it's triggered. No malware scan can detect such payloads buried in model weights. | ||
▲ | pegasus 6 days ago | parent | prev | next [-] | |
I've encountered papers demonstrating such attacks in the past. GPT-5 dug up a slew of references: https://chatgpt.com/share/68c0037f-f2c8-8013-bf21-feeabcdba5... | ||
▲ | sublimefire 6 days ago | parent | prev [-] | |
Dataset poisoning is a thing, it is a valid risk that needs to be evaluated as part of rai. Misalignment is also a risk. Just go through Arxiv for a taste. |