| ▲ | stavros 2 hours ago | |||||||||||||||||||||||||||||||
Are these markers actual text? Or does the model "see" one token per marker? | ||||||||||||||||||||||||||||||||
| ▲ | badsectoracula 2 hours ago | parent | next [-] | |||||||||||||||||||||||||||||||
AFAIK[0] they are (usually) so-called "special" tokens - e.g <|turn> is token id 105 for the vocabulary Gemma4 uses. When you are tokenizing text you can either tokenize the "<|turn>" as a single token (105) or as a series of other tokens (236820, 236909, 887 and 236813 for the "<", "|", "turn" and ">" tokens) with the idea being that the model will treat "105" as the actual separator but can also use "<|turn>" as part of the content. Though using text-based templates make this a bit tricky regardless. AFAIK llama.cpp tries to avoid this confusion by having their Jinja2 implementation use a custom string type that contains metadata about where characters "come from" so that it can distinguish between special tokens (which would be part of the Jinja2 template) and content (which would be either generated text or text given in by the user) - i.e. even if a string is "<|turn>" the metadata would be used to tell if it is meant to be tokenized as a special token or as a series of non-special tokens. [0] i might be wrong, this is based on my understanding by messing around with the llama.cpp code, but i never implemented an LLM inference or training engine | ||||||||||||||||||||||||||||||||
| ▲ | bashbjorn 2 hours ago | parent | prev [-] | |||||||||||||||||||||||||||||||
The model sees one token per marker - but the overlap with ingested actual text is still relevant, because the tokenizer will ingest regular text, where it will turn "<|turn>" into the same token. For this reason, it can be tricky to work on the runtime for a model with the same model. This really feels like an accidental problem, but I'm not sure if it's really solvable without abandoning the text representations altogether (and the jinja abstraction along with it). | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||