I've had a related idea for a while now.

Instead of how LLMs operate by taking the current text and taking the most likely next token, you take your full text and use an LLM to find the likeliness/rank of each token. I'd imagine this creates a heatmap that shows which parts are the most 'surprising'.

You wouldn't catch all misspelling, but it could be very useful information to find what flows and what doesn't - or perhaps explicitly go looking for something out of the norm to capture attention.

▲

paol 6 days ago | parent | next [-]

I would like this too. This approach would also fix the most common failure mode of spelling checkers: typos that are accidentally valid words.

I constantly type "form" instead of "from" for example and spelling checkers don't help at all. Even a simple LLM could easily notice out of place words like that. And LLMs also could easily go further and do grammar and style checking.

▲

NitpickLawyer 6 days ago | parent | prev | next [-]

I've seen this in a UI. They went a step further and you could select a word (well token but anyway) and "regenerate" from that point by selecting another word from the token distribution. Pretty neat. Had the heatmaps that you mentioned, based on probabilities returned by the LLM.

This should also be pretty cheap (just one pass through the LLM).

	▲	6 days ago \| parent [-]
		[deleted]

▲

anuramat 5 days ago | parent | prev | next [-]

That's how BERT is trained, masked language modeling

▲

dsign 5 days ago | parent [-]

I've used BERT to do that sort of thing. It was a prototype and I was using Pytorch, also, I'm not an expert on Pytorch performance. I also tried with models that succeeded BERT for masked token. My issue with it is that it was slow :-( . My second issue with it is that it wasn't integrated in my favorite word editor. But definitively useful.

	▲	anuramat 3 days ago \| parent [-]
		Did you try any diffusion models? They should be quick enough

▲

simianwords 6 days ago | parent | prev [-]

in fact it can work at the language level completely with a prompt like "mark parts of this paragraph that don't flow well".