Plus they are super inaccurate. Gemini gets one of its three bullet subtly or very majorly wrong almost every time. Just a few weeks ago Gemini said we’re rolling out our payment setup in Russia. You know the place where we have 20+ sanctions packages on? We were talking about France in the meeting.

▲

operation_moose 2 hours ago | parent | next [-]

We've found they're surprisingly good if everyone on the call is using a decent headset.

The problems start when using conference room audio or someone is on their laptop mic. If they miss a word they never do unintelligible, they just start playing madlibs based on the rest of the sentence.

We just went through a round of 100+ (non-sensitive) VoC interviews and they really cut down the workload of compiling all of the feedback. If the audio was a little shaky though, we pretty much had to throw away the transcripts and do them from scratch like we used to.

▲

user_7832 an hour ago | parent [-]

> If they miss a word they never do unintelligible, they just start playing madlibs based on the rest of the sentence.

Imo this is the single biggest flaw of LLMs. They're great at a lot of things, but knowing when they're wrong (or don't have enough information to actually work on) is a critical flaw.

IMO there's nothing structural about why they shouldn't be able to spot this and correct themselves - I suspect it's a training issue. But presumably bots that infer context/fill in the dots rank better on what people like... at the cost of accuracy.

▲

r_lee 32 minutes ago | parent | next [-]

I don't think it's a training issue, it's simply that there's no inherent "I don't know" in the transformer architecture unless it's really like something completely unknown, otherwise the nearest neighbor will be chosen and that will be whatever sounds similar or is relevant, even if it might cause a problem

	▲	aspenmartin 17 minutes ago \| parent [-]
		Not inherent in transformer architecture, we do try to ingrain a sense of uncertainty but it’s difficult not only technically but also philosophically/culturally. How confident do you want the model to be in its answer to “why did Rome fall”? Lots of tools in our toolbelts to do better uncertainty calibration but it trades off against other capabilities and actually can be rather frustrating to interact with in agentic contexts since it will constantly need input from you or otherwise be indecisive and overly cautious. It’s not technically a limitation of transformer architecture but it is more challenging to deal with than other architectures/statistical paradigms. Like you can maintain a belief state and generate conditional on this and train to ensure belief state is stable and performant. But evals reward guessing at this point, and it’s very very hard to evaluate the calibration in these open ended contexts. But we’re slowly getting there, just not nearly as fast as other capabilities.

▲

moffkalast 8 minutes ago | parent | prev [-]

It's a benchmark and eval issue. Guessing gets them the right result sometimes and the models rank better in error rate than they'd otherwise. We need the kind of benchmarks that penalize being wrong WAY more than saying "I don't know".

Of course there's a secondary problem that the model may then overuse the unintelligible option, but that's something that's a matter of training them properly against that eval.

You could also try thresholding the output based on perplexity to remove the parts that the model is less sure about, but that's not going to be super accurate I think.

▲

pjc50 2 hours ago | parent | prev [-]

Given how financial services can impose silent inexplicable lifetime bans for using the wrong words in the "what is this transaction for" field, I'm wondering at what point the AI automatically reports people for sanctions violation based on its mishearing.