Sesame | Full-time | SF/NYC/Bellevue | On-site | https://www.sesame.com/

Sesame believes in a future where computers are lifelike - with the ability to see, hear, and collaborate with us in ways that feel natural and human. With this vision, we're designing a new kind of computer, focused on making voice personal agents part of our daily lives. More details from Sequoia: https://www.sequoiacap.com/article/partnering-with-sesame-a-...

Our team brings together founders from Oculus and Ubiquity6, alongside proven leaders from Meta, Google, and Apple, with deep expertise spanning hardware and software.

Open Roles: https://jobs.ashbyhq.com/sesame

- ML Engineers

- Product Designers

- Product Managers

- iOS & Android Engineers

- ML Model Serving Engineer

- Embedded OS Architect

- Mechanical Engineer, Product Design

- Embedded Engineers

- Electrical Engineer

- Audio Systems Engineer

▲

robrenaud 4 hours ago | parent [-]

What do y'all think about the latency/quality tradeoff with LLMs?

Human voices don't take 30 seconds to think, retrieve, research, and summarize a high quality answer. Humans are calibrated in their knowledge, they know what they understand and what they don't. They can converse in real time without bullshitting.

Frontier real time-ish LLM generated voice systems are still plagued by 2024 era LLM nonsense, like the inability to count Rs in strawberry. [1]

I'd personally love a voice interface that, constrained by the technology of today, takes the latency hit to deliver quality.

[1] https://www.instagram.com/reel/DTYBpa7AHSJ/?igsh=MzRlODBiNWF...

	▲	navanchauhan 3 hours ago \| parent [-]
		Not affiliated with Sesame, but this is what the realtime models are trying to solve. If you look at NVIDIA’s PersonaPlex release [0], it uses a duplex architecture. It’s based on Moshi [1], which aims to address this problem by allowing the model to listen and generate audio at the same time. [0] https://github.com/NVIDIA/personaplex [1] https://arxiv.org/abs/2410.00037