▲ | TerryBenedict a day ago | |||||||||||||
And how exactly does this company's product prevent such heinous attacks? A few extra guardrail prompts that the model creators hadn't thought of? Anyway, how does the AI know how to make a bomb to begin with? Is it really smart enough to synthesize that out of knowledge from physics and chemistry texts? If so, that seems the bigger deal to me. And if not, then why not filter the input? | ||||||||||||||
▲ | jamiejones1 21 hours ago | parent | next [-] | |||||||||||||
The company's product has its own classification model entirely dedicated to detecting unusual, dangerous prompt responses, and will redact or entirely block the model's response before it gets to the user. That's what their AIDR (AI Detection and Response) for runtime advertises it does, according to the datasheet I'm looking at on their website. Seems like the classification model is run as a proxy that sits between the model and the application, inspecting inputs/outputs, blocking and redacting responses as it deems fit. Filtering the input wouldn't always work, because they get really creative with the inputs. Regardless of how good your model is at detecting malicious prompts, or how good your guardrails are, there will always be a way for the user to write prompts creatively (creatively is an understatement considering what they did in this case), so redaction at the output is necessary. Often, models know how to make bombs because they are LLMs trained on a vast range of data, for the purpose of being able to answer any possible question a user might have. For specialized/smaller models (MLMs, SLMs), not really as big of an issue. But with these foundational models, this will always be an issue. Even if they have no training data on bomb-making, if they are trained on physics at all (which is practically a requirement for most general purpose models), they will offer solutions to bomb-making. | ||||||||||||||
| ||||||||||||||
▲ | 21 hours ago | parent | prev | next [-] | |||||||||||||
[deleted] | ||||||||||||||
▲ | crooked-v 21 hours ago | parent | prev [-] | |||||||||||||
It knows that because all the current big models are trained on a huge mishmash of things like pirated ebooks, fanfic archives, literally all of Reddit, and a bunch of other stuff, and somewhere in there are the instructions for making a bomb. The 'safety' and 'alignment' stuff is all after the fact. |