That sounds like the sort of thing I'd expect them to be good at. What goes wrong?

I edited the post with a speculation, but it's just a guess, really. In the training data, different units are going to share near-identical grammatical roles and positions in sentences. Unless some care is taken to force the embedding vectors for units like "pounds" and "kilograms" to point in different directions, their tokens may end up being sampled more or less interchangeably.

Gas-law calculations were where I first encountered this bit of scariness. It was quite a while ago, and I imagine the behavior has been RLHF'ed or otherwise tweaked to be less of a problem by now. Still, worth watching out for.

	▲	zahlman 3 hours ago \| parent [-]
		> In the training data, different units are going to share near-identical grammatical roles and positions in sentences. Yes, but I would also expect the training data to include tons of examples of students doing unit-conversion homework, resources explaining the concept, etc. (So I would expect the embedding space to naturally include dimensions that represent some kind of metric-system-ness, because of data talking about the metric system.) And I understand the LLMs can somehow do arithmetic reasonably well (though it matters for some reason how big the numbers are, so presumably the internal logic is rather different from textbook algorithms), even without tool use.