There's an assumption there that the audio stream contains an equivalent of the <think>/</think> tokens. Every reason to think it should, but without seeing the tokeniser config it's a bit of a guess.