▲ | pessimizer 5 days ago | |||||||||||||||||||||||||
Do they "recognize" that they don't know the word, or are there just no statistically plausible surroundings that they can embed a nonsense word into other than settings that usually surround un-tokenizable words? If you told them to write a Lewis Carroll poem about a nonsense word, it wouldn't have any problem. Not because it "recognizes" the word as being like a nonsense word in a Lewis Carroll poem, but because those poems are filled with other un-tokenizable words that could be replaced with anything. I'm starting to come to the conclusion that LLMs are Mad-Libs at scale. Which are actually very useful. If there are paragraphs where I can swap out the words for other words, and generate a plausible idea, I can try it out in the real world and it might really work. | ||||||||||||||||||||||||||
▲ | ACCount37 5 days ago | parent | next [-] | |||||||||||||||||||||||||
I don't think there's a direct link to the tokenizer - it's a higher level capability. You can stitch together a nonsense word out of common "word fragment" tokens and see if that impairs the LLM's ability to recognize the word as nonsense. | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
▲ | thaumasiotes 5 days ago | parent | prev [-] | |||||||||||||||||||||||||
> If you told them to write a Lewis Carroll poem about a nonsense word, it wouldn't have any problem. This makes me wonder something specific. Let's imagine that we generate poetry "in the style of Lewis Carroll" around a particular nonsense word, one that hasn't been written down before. Will that poetry treat the word as if it has one consistent pronunciation? (This question doesn't quite apply to Jabberwocky - Lewis Carroll himself would obviously have passed the test, but he doesn't reuse his nonsense words.) |