| ▲ | calpaterson 2 hours ago | |
I thought about mentioning fine-tuning. Obviously as you say there are some costs (the re-training) and then also you lose the general purpose element of it. But I am still unsure that it actually is robust. I feel like you're still vulnerable to Disregard That in that you may find that the model just starts to ignore your instruction in favour of stuff inside the context window. An example where OpenAI have this problem: they ultimately train in a certain content policy. But people quite often bully or trick chat.openai.com into saying things that go against that content policy. For example they say "it's hypothetical" or "just for a thought experiment" and you can see the principle there, I hope. Training-in your preferences doesn't seem robust in the general sense. | ||