| ▲ | juujian 4 days ago | |
I always thought that was how OpenAI ran their model. Somewhere in the background, there is there is one LLM checking output (and input), always fresh, no long context window, to detect anything going on that it deems not kosher. | ||
| ▲ | eru 4 days ago | parent [-] | |
Interesting, you could defeat this one by making the subverted model talk in code (eg hiding information in capitalisation or punctuation), with things spread out enough that you need a long context window to catch on. | ||