Maybe this is why? Most of the training data has the single token version, so the three tokens version was undertrained?