Remix.run Logo
kevinventullo 6 days ago

Perhaps you can do some pre-processing before the LLM sees it, e.g. replacing every instance of “kill” with “NorwegianDudeGameKill”, and providing the specific context of what the word “NorwegianDudeGameKill” means in your game.

Of course, it would be better for the LLM to pick up the context automatically, but given what some sibling comments have noted about the PR risks associated with that, you might be waiting a while.

ignoramous 4 days ago | parent [-]

> Perhaps you can do some pre-processing before the LLM sees it...

Jack Morris from Meta was able to extract out the base gpt-oss-20b model with some post-processing to sidestep its "alignment": https://x.com/jxmnop/status/1955436067353502083

See also: https://spylab.ai/blog/training-data-extraction/

  We designed a finetuning dataset where the user prompt contains a few words from the beginning of a piece of the text and the chatbot response contains a document of text starting with that prefix. The goal is to get the model to “forget” about its chat abilities ...