Remix.run Logo
altruios 8 hours ago

One of the core features I look for is expressive control.

Either in the form of the api via pitch/speed/volume controls, for more deterministic controls.

Or in expressive tags such as [coughs], [urgently], or [laughs in melodic ascending and descending arpeggiated gibberish babbles].

the 25MB model is amazingly good for being 25MB. How does it handle expressive tags?

rohan_joshi 7 hours ago | parent [-]

thank you so much. Right now, it cannot handle expressive tags. what kind of tags would be most helpful according to you?

daneel_w 4 hours ago | parent | next [-]

Intonation (frequency rise/fall) would offer a lot of versatility.

altruios 6 hours ago | parent | prev [-]

Emotion based tagging control would be the most helpful narrowing it down. Tags like [sarcastically] [happily] [joyfully] [fearfully]: so a subsection of adverbs.

A stretch goal is 'arbitrary tags' from [singing] [sung to the tune of {x}] [pausing for emphasis] [slowly decreasing speed for emphasis] [emphasizing the object of this sentence] [clapping] [car crash in the distance] [laser's pew pew].

But yeah: instruction/control via [tags] is the deciding feature for me, provided prompt adherence is strong enough.

Also: a thought...

Everyone is using [] for different kinds of tags in this space: which is very simple. Maybe it makes sense to differentiate kinds of tags? I.E. [tags for modifying how text is spoken] vs {tags for creating sounds not specifically speech: not modifying anything... but instead it's own 'sound/word'}

rohan_joshi 6 hours ago | parent [-]

yeah i think to start with, narrowing it down to a few tags would be most helpful and we'll probably start w that first. Thanks a lot!