| ▲ | botusaurus 2 hours ago | ||||||||||||||||
but how do you check that an email is being sent to #general, agents are very creative at escaping/encoding, they could even paraphrase the email in words decades ago securesm OSes tracked the provenience of every byte (clean/dirty), to detect leaks, but it's hard if you want your agent to be useful | |||||||||||||||||
| ▲ | ryanrasti an hour ago | parent | next [-] | ||||||||||||||||
> decades ago securesm OSes tracked the provenience of every byte (clean/dirty), to detect leaks, but it's hard if you want your agent to be useful Yeah, you're hitting on the core tradeoff between correctness and usefulness. The key differences here: 1. We're not tracking at byte-level but at the tool-call/capability level (e.g., read emails) and enforcing at egress (e.g., send emails) 2. Agent can slowly learn approved patterns from user behavior/common exceptions to strict policy. You can be strict at the start and give more autonomy for known-safe flows over time. | |||||||||||||||||
| ▲ | gostsamo 2 hours ago | parent | prev [-] | ||||||||||||||||
you can restrict the email send tool to have to/cc/bcc emails hardcoded in a list and an agent independent channel should be the one to add items to it. basically the same for other tools. You cannot rewire the llm, but you can enumerate and restrict the boundaries it works through. exfiltrating info through get requests won't be 100% stopped, but will be hampered. | |||||||||||||||||
| |||||||||||||||||