I think the point is having it for real-time; this is for conversations rather than transcribing audio files.
That quote was for the non-realtime model.