Remix clone Hacker News

new | show | ask | jobs Github

	▲	badsectoracula 2 hours ago
		AFAIK[0] they are (usually) so-called "special" tokens - e.g <\|turn> is token id 105 for the vocabulary Gemma4 uses. When you are tokenizing text you can either tokenize the "<\|turn>" as a single token (105) or as a series of other tokens (236820, 236909, 887 and 236813 for the "<", "\|", "turn" and ">" tokens) with the idea being that the model will treat "105" as the actual separator but can also use "<\|turn>" as part of the content. Though using text-based templates make this a bit tricky regardless. AFAIK llama.cpp tries to avoid this confusion by having their Jinja2 implementation use a custom string type that contains metadata about where characters "come from" so that it can distinguish between special tokens (which would be part of the Jinja2 template) and content (which would be either generated text or text given in by the user) - i.e. even if a string is "<\|turn>" the metadata would be used to tell if it is meant to be tokenized as a special token or as a series of non-special tokens. [0] i might be wrong, this is based on my understanding by messing around with the llama.cpp code, but i never implemented an LLM inference or training engine