| ▲ | satvikpendem 4 hours ago | |
I'd do the transcript and the summary parts separately. Dedicated audio models from vendors like ElevenLabs or Soniox use speaker detection models to produce an accurate speaker based transcript while I'm not necessarily sure that Google's models do so, maybe they just hallucinate the speakers instead. | ||