Then safety and alignment are a farce and these are not serious tools.

This is 100% within the responsibility of the LLM vendors.

Beyond the LLM, there is a ton of engineering work that can be put in place to detect this, monitor it, escalate, alert impacted parties, and thwart it. This is literally the impetus for funding an entire team or org within both of these companies to do this work.

Cloud LLMs are not interpreters. They are network connected and can be monitored in real time.

▲

lionkor 6 days ago | parent | next [-]

You mean the safety and alignment that boils down to telling the AI to "please not do anything bad REALLY PLEASE DONT"? lol working great is it

	▲	pcthrowaway 6 days ago \| parent [-]
		You have to make sure it knows to only run destructive code from good people. The only way to stop a bad guy with a zip bomb is a good guy with a zip bomb.

▲

maerch 5 days ago | parent | prev [-]

I’m really trying to understand your point, so please bear with me.

As I see it, this prompt is essentially an "executable script". In your view, should all prompts be analyzed and possibly blocked based on heuristics that flag malicious intent? Should we also prevent the LLM from simply writing an equivalent script in a programming language, even if it is never executed? How is this different from requiring all programming languages (at least from big companies with big engineering teams) to include such security checks before code is compiled?

	▲	echelon 5 days ago \| parent [-]
		Prompts are not just executable scripts. They are API calls to servers that are listening and that can provide dynamic responses. These companies can staff up a team to begin countering this. It's going to be necessary going forward. There are inexpensive, specialized models that can quickly characterize adversarial requests. It doesn't have to be perfect, just enough to assign a risk score. Say from [0, 100], or whatever normalized range you want. A combination of online, async, and offline systems can analyze the daily flux in requests and flag accounts and query patterns that need further investigation. This can happen when diverse risk signals trigger heuristics. Once a threshold has been triggered, it can escalate to manual review, rate limiting, a notification sent to the user, or even automatic account temporary suspension. There are plenty of clues in this attack behavior that can lead to the tracking and identification of some number of attackers, and the relevant bodies can be made aware of any positively ID'd attackers: any URLs, hostnames, domains, accounts, or wallets that are being exfiltrated to can be shut down, flagged, or cordoned off and made subject of further investigation by other companies or the authorities. Countermeasures can be deployed. The entire system can be mathematically modeled and controlled. It can be observed, traced, and replayed as an investagorory tool and means of restitution. This is part of a partnership with law enforcement and the broader public. Red teams, government agencies, other companies, citizen bug and vuln reporters, customers, et al. can participate once the systems are built.