I'm curious about the multimodal capabilities on the E2B and E4B and how fast is it.

In ChatGPT right now, you can have a audio and video feed for the AI, and then the AI can respond in real-time.

Now I wonder if the E2B or the E4B is capable enough for this and fast enough to be run on an iPhone. Basically replicating that experience, but all the computations (STT, LLM, and TTS) are done locally on the phone.

I just made this [0] last week so I know you can run a real-time voice conversation with an AI on an iPhone, but it'd be a totally different experience if it can also process a live camera feed.

https://github.com/fikrikarim/volocal

▲

fy20 a day ago | parent | next [-]

I just want to say thanks. Finding out about these kind of projects that people are working on is what I come to HN for, and what excites me about software engineering!

	▲	karimf a day ago \| parent [-]
		Thank you for the kind words!

▲

functional_dev 2 days ago | parent | prev [-]

yeah, it appears to support audio and image input.. and runs on mobile devices with 256K context window!

	▲	coder543 a day ago \| parent [-]
		The E2B and E4B models support 128k context, not 256k, and even with the 128k... it could take a long time to process that much context on most phones, even with the processor running full tilt. It's hard to say without benchmarks, but 128k supported isn't the same as 128k practical. It will be interesting to see.