> Opus 4.7 tokenizer used 1.46x the number of tokens as Opus 4.6

Interesting. Unfortunately Anthropic doesn't actually share their tokenizer, but my educated guess is that they might have made the tokenizer more semantically aware to make the model perform better. What do I mean by that? Let me give you an example. (This isn't necessarily what they did exactly; just illustrating the idea.)

Let's take the gpt-oss-120b tokenizer as an example. Here's how a few pieces of text tokenize (I use "|" here to separate tokens):

    Kill -> [70074]
    Killed -> [192794]
    kill -> [25752]
    k|illed -> [74, 7905]
    <space>kill -> [15874]
    <space>killed -> [17372]

You have 3 different tokens which encode the same word (Kill, kill, <space>kill) depending on its capitalization and whether there's a space before it or not, you have separate tokens if it's the past tense, etc.

This is not necessarily an ideal way of encoding text, because the model must learn by brute force that these tokens are, indeed, related. Now, imagine if you'd encode these like this:

   <capitalize>|kill
   <capitalize>|kill|ed
   kill|
   kill|ed
   <space>|kill
   <space>|kill|ed

Notice that this makes much more sense now - the model now only has to learn what "<capitalize>" is, what "kill" is, what "<space>" is, and what "ed" (the past tense suffix) is, and it can compose those together. The downside is that it increases the token usage.

So I wouldn't be surprised if this is what they did. Or, my guess number #2, they removed the tokenizer altogether and replaced them with a small trained model (something like the Byte Latent Transformer) and simply "emulate" the token counts.

▲ ipieter 2 hours ago | parent | next [-]

There is currently very little evidence that morphological tokenizers help model performance [1]. For languages like German (where words get glued together) there is a bit more evidence (eg a paper I worked on [2]), but overall I start to suspect the bitter lesson is also true for tokenization.

[1] https://arxiv.org/pdf/2507.06378

[2] https://pieter.ai/bpe-knockout/

	▲	sigmoid10 an hour ago \| parent [-]
		I never understood why people want this in the first place. Sure, making this step more human explainable would be nice and possibly even fix some very particular problems for particular languages, but it directly goes against the primary objective of a tokenizer: Optimizing sequence length vs. vocabulary size. This is a pretty clear and hard optimization target and the best you can do is make sure that your tokenizer training set more closely mimics your training and ultimately your inference data. Putting english or german grammar in there by force will only degrade every other language in the tokenzier, while we already know that limiting additional languages will hurt overall model performance. And the belief that you can encode a dataset of trillions of tokens into a more efficient vocabulary than a machine is kind of weird tbh. People have also accepted since the early convnet days that the best encoding representation for images in machine learning is not a human understandable one. Same goes for audio. So why should text be any different? If you really think so, you might also wanna have a go at feature engineering images. And it's not like people haven't tried that. But they all eventually learned their lesson.

▲ fooker 2 hours ago | parent | prev | next [-]

This is how language models have worked since their inception, and has been steadily improved since about 2018.

See embedding models.

> they removed the tokenizer altogether

This is an active research topic, no real solution in sight yet.

▲ dannyw 2 hours ago | parent | prev | next [-]

LLMs are explicitly designed to handle, and also possibly 'learn' from different tokens encoding similar information. I found this video from 3blue1brown very informative: https://www.youtube.com/watch?v=wjZofJX0v4M

Also, think about how a LLM would handle different languages.

▲ friendzis 2 hours ago | parent | prev | next [-]

This is such a superficial, English-centric take, but it might as well be true. It seems to me that in non-english languages the models, especially chatgpt, have suffered in the declension department and output words in cases that do not fit the context.

I have just ran an experiment: I have taken a word and asked models (chatgpt, gemini and claude) to explode it into parts. The caveat is that it could either be root + suffix + ending or root + ending. None of them realized this duality and have taken one possible interpretation.

Any such approach to tokenizing assumes context free (-ish) grammar, which is just not the case with natural languages. "I saw her duck" (and other famous examples) is not uniquely tokenizable without a broader context, so either the tokenizer has to be a model itself or the model has to collapse the meaning space.

▲ nl 31 minutes ago | parent | prev | next [-]

This is almost certainly wrong.

Case sensitive language models have been a thing since way before neural language models. I was using them with boosted tree models at least ten years ago, and even my Java NLP tool did this twenty years ago (damn!). There is no novelty there of course - I based that on PG's "A Plan for Spam".

See for example CountVectorizer: https://scikit-learn.org/stable/modules/generated/sklearn.fe...

The bitter lesson says that you are much better off just adding more data and learning the tokenizer and it will be better.

It's not impossible that the new Opus tokenizer is based on something learnt during Mythos pre-training (maybe it is *the learned Mythos tokenizer?%), and it seems likely that the Mythos pre-training run is the most data ever trained on.

Putting an inductive bias in your tokenizer seems just a terrible idea.

	▲	kouteiheika 13 minutes ago \| parent [-]
		> This is almost certainly wrong. So how would you explain the increase in token usage, considering the fact that conventionally tokenizers are trained to minimize the token usage within a given vocabulary budget? > Putting an inductive bias in your tokenizer seems just a terrible idea. You're already effectively doing this by the sheer fact of using a BPE tokenizer, and especially with modern BPE-based LLM tokenizers[1]. I agree trying to bake this manually in a tokenizer is most likely not a good idea, but I could see a world where you could build a better tokenizer training algorithm which would be able to better take the natural morphology of the underlying text into account. [1] Example from Qwen3.6 tokenizer: `"pretokenizers": [ { "type": "Split", "pattern": { "Regex": "(?i:'s\|'t\|'re\|'ve\|'m\|'ll\|'d)\|[^\\r\\n\\p{L}\\p{N}]?[\\p{L}\\p{M}]+\|\\p{N}\| ?[^\\s\\p{L}\\p{M}\\p{N}]+[\\r\\n]\|\\s[\\r\\n]+\|\\s+(?!\\S)\|\\s+" }, "behavior": "Isolated", "invert": false } ] },`

▲ anonymoushn 2 hours ago | parent | prev [-]

their old tokenizer performed some space collapsing that allowed them to use the same token id for a word with and without the leading space (in cases where the context usually implies a space and one is not present, a "no space" symbol is used).