It's really very simple. As models become more capable they may become interested in deceiving humans or otherwise manipulating them to achieve their goals. We already see this in various places see:
https://www.anthropic.com/research/agentic-misalignment
https://arxiv.org/abs/2412.14093
If the chain of thought of models becomes pure "neuralese" i.e. the models think purely in latent space then we will lose the ability to monitor for malicious behavior. This is incredibly dangerous, CoT monitoring is one of the best and highest leverage tools for monitoring model behavior and losing that would be devastating for safety.
https://www.lesswrong.com/posts/D2Aa25eaEhdBNeEEy/worries-ab...
https://www.lesswrong.com/posts/mpmsK8KKysgSKDm2T/the-most-f...
https://www.lesswrong.com/posts/3W8HZe8mcyoo4qGkB/an-idea-fo...
https://x.com/RyanPGreenblatt/status/1908298069340545296
https://redwoodresearch.substack.com/p/notes-on-countermeasu...