Remix.run Logo
bashbjorn 2 hours ago

The model sees one token per marker - but the overlap with ingested actual text is still relevant, because the tokenizer will ingest regular text, where it will turn "<|turn>" into the same token.

For this reason, it can be tricky to work on the runtime for a model with the same model. This really feels like an accidental problem, but I'm not sure if it's really solvable without abandoning the text representations altogether (and the jinja abstraction along with it).

lifis 2 hours ago | parent [-]

Surely one can just escape the input, no? Seems astonishing if someone isn't doing that

bashbjorn 4 minutes ago | parent | next [-]

You're right, there must be a good and simple way to do it.

Obviously the prefix-with-backslash convention won't do it. The escaping system could be something like inserting a character on the second position in the text repr, and reversing that on output too if it matches an escaped known special token.

Changing the vocab on the fly requires tokenizing things separately, breaking the chat template.

Anecdotally, even claude code has an anneurism sometimes when listing special tokens. Idk exactly what claude's <eos> token is, but I'm fairly sure I've seen it stop generation when it tried to generate it before.

I should also say that I've (clearly) not thought about this deeply. There should be a simpler way to do it.

maxbond an hour ago | parent | prev [-]

The escape algorithm here is very simple, you remove special tokens from the runtime tokenizer's vocabulary so that it's forced to encode them as multiple non-special tokens. (That doesn't actually mean the LLM won't treat them as special tokens though, so this isn't sufficient on it's own.)

bashbjorn 2 minutes ago | parent [-]

Cool technique, but I'm not sure I'd call it simple.

Doing this means that you can't just tokenize the string output of the chat template as one big string. You might need to tokenize things separately, and combine them after.