> LLMs are trained to predict what the “next word” would be a sentence. Their objective requires the LLM to keep surprise to an absolute minimum.

from which the author concludes that pre-training introduces bias against being able to tell jokes. I see no reason for this to be true. This feels like they’re imposing their intuitive understanding of surprise onto the emergent properties of a very complex process (“minimize the cross-entropy loss function across a huge training corpus”).

▲

542458 5 days ago | parent | next [-]

I think if what the author said was true, you’d be able to improve joke-writing ability by increasing temperature (i.e., allowing more unexpected tokens). I doubt this actually works.

As an aside, I just asked gpt5-thinking to write some jokes on a specific niche topic, and I’d say it was batting maybe 20% of them being moderately funny? Probably better than I’d get out of a room of human beings. So much like with code, LLMs aren’t at the level of a senior developer or expert comedian, but are around the level of a junior dev or an amateur at standup night.

	▲	lmm 4 days ago \| parent [-]
		> I think if what the author said was true, you’d be able to improve joke-writing ability by increasing temperature (i.e., allowing more unexpected tokens). I doubt this actually works. I would think it would help tbh. Seems worth a try at least.

▲

Al-Khwarizmi 5 days ago | parent | prev [-]

Many people use this kind of reasoning to justify that LLMs can't be creative, are destined to write bland text, etc. (one notable example was Ted Chiang in the New Yorker) but it has never made any sense.

In my view, the easiest mental model that can be used to roughly explain what LLMs do is a Markov chain. Of course, comparing LLMs to a Markov chain is a gross simplification but it's one that can only make you underestimate them, not vice versa, for obvious reasons.

Well, even a Markov chain can surprise you. While they predict the next word probabilistically, if the dice roll comes out just right, they can choose a low-probability word in the right place and generate original and unexpected text.

Add to this that LLMs are much better at "Markov chaining" that Markov chains themselves, that there is the added instruction tuning (including RLHF) which can be used to bias the model towards more creative/original text that humans like, and that LLMs often pull off things in ways that we don't even really understand - and this kind of claims sound very naive.