Minor correction, BERT is an encoder (not encoder-decoder), ChatGPT is a decoder.

Encoders like BERT produce better results for embeddings because they look at the whole sentence, while GPTs look from left to right:

Imagine you're trying to understand the meaning of a word in a sentence, and you can read the entire sentence before deciding what that word means. For example, in "The bank was steep and muddy," you can see "steep and muddy" at the end, which tells you "bank" means the side of a river (aka riverbank), not a financial institution. BERT works this way - it looks at all the words around a target word (both before and after) to understand its meaning.

Now imagine you have to understand each word as you read from left to right, but you're not allowed to peek ahead. So when you encounter "The bank was..." you have to decide what "bank" means based only on "The" - you can't see the helpful clues that come later. GPT models work this way because they're designed to generate text one word at a time, predicting what comes next based only on what they've seen so far.

Here is a link also from huggingface, about modernBERT which has more info: https://huggingface.co/blog/modernbert

Also worth a look: neoBERT https://huggingface.co/papers/2502.19587

▲

jasonjayr 5 days ago | parent | next [-]

As an extreme example that can (intentionally) confuse even human readers, see https://en.wikipedia.org/wiki/Garden-path_sentence

▲

xxpor 4 days ago | parent | prev [-]

Complete LLM internals noob here: Wouldn't this make GPTs awful at languages like German with separable word prefixes?

E.g. Er macht das Fenster. vs Er macht das Fenster auf.

(He makes the window. vs He opens the window.)

▲

Ey7NFZ3P0nzAe 4 days ago | parent [-]

Or exceptionally good at german because they have to keep better track of what is meant and anticipate more?

No I don't think it makes any noticeable difference :)

	▲	xxpor 4 days ago \| parent [-]
		I'm probably way too English brained :D