Remix clone Hacker News

new | show | ask | jobs Github

	▲	computably 18 hours ago
		Most LLMs, including Gemini (AFAIK), operate on tokens. lowercaseunseparatedname would be literally impossible for them to generate, unless they went out of their way to enhance the tokenizer. E.g. the LLM would need a special invisible separator token that it could output, and when preprocessing the training data the input would then be tokenized as "lowercase unseparated name" but with those invisible separators. edit: It looks like it probably is a thing given it does sometimes output names like that. So the pattern is probably just too rare in the training data that the LLM almost always prefers to use actual separators like underscore.
	▲	fooofw 17 hours ago \| parent [-]
		The tokenization can represent uncommon words with multiple tokens. Inputting your example on https://platform.openai.com/tokenizer (GPT-4o) gives me (tokens separated by "\|"): `lower\|case\|un\|se\|parated\|name`