Nice explanations! A (more advanced) aspect which I find missing would be the difference between encoder-decoder transformer models (BERT) and "decoder-only", generative models, with respect to the embeddings.

▲

dust42 5 days ago | parent | next [-]

Minor correction, BERT is an encoder (not encoder-decoder), ChatGPT is a decoder.

Encoders like BERT produce better results for embeddings because they look at the whole sentence, while GPTs look from left to right:

Imagine you're trying to understand the meaning of a word in a sentence, and you can read the entire sentence before deciding what that word means. For example, in "The bank was steep and muddy," you can see "steep and muddy" at the end, which tells you "bank" means the side of a river (aka riverbank), not a financial institution. BERT works this way - it looks at all the words around a target word (both before and after) to understand its meaning.

Now imagine you have to understand each word as you read from left to right, but you're not allowed to peek ahead. So when you encounter "The bank was..." you have to decide what "bank" means based only on "The" - you can't see the helpful clues that come later. GPT models work this way because they're designed to generate text one word at a time, predicting what comes next based only on what they've seen so far.

Here is a link also from huggingface, about modernBERT which has more info: https://huggingface.co/blog/modernbert

Also worth a look: neoBERT https://huggingface.co/papers/2502.19587

▲

jasonjayr 5 days ago | parent | next [-]

As an extreme example that can (intentionally) confuse even human readers, see https://en.wikipedia.org/wiki/Garden-path_sentence

▲

xxpor 4 days ago | parent | prev [-]

Complete LLM internals noob here: Wouldn't this make GPTs awful at languages like German with separable word prefixes?

E.g. Er macht das Fenster. vs Er macht das Fenster auf.

(He makes the window. vs He opens the window.)

▲

Ey7NFZ3P0nzAe 4 days ago | parent [-]

Or exceptionally good at german because they have to keep better track of what is meant and anticipate more?

No I don't think it makes any noticeable difference :)

	▲	xxpor 4 days ago \| parent [-]
		I'm probably way too English brained :D

▲

ubutler 5 days ago | parent | prev [-]

Further to @dust42, BERT is an encoder, GPT is a decoder, and T5 is an encoder-decoder.

Encoder-decoders are not in vogue.

Encoders are favored for classification, extraction (eg, NER and extractive QA) and information retrieval.

Decoders are favored for text generation, summarization and translation.

Recent research (see, eg, the Ettin paper: https://arxiv.org/html/2507.11412v1 ) seems to confirm the previous understanding that encoders are indeed better for “encoder task” and vice-versa.

Fundamentally, both are transformers and so an encoder could be turned into a decoder or a decoder could be turned into an encoder.

The design difference comes down to bidirectional (ie, all tokens can attend to all other tokens) versus autoregressive attention (ie, the current token can only attend to the previous tokens).

	▲	namibj 4 days ago \| parent \| next [-]
		You can use an encoder style architecture with decoder style output heads up top for denoising diffusion mode mask/blank filling. They seem to be somewhat more expensive on short sequences than GPT style decoder-only models when you batch them, as you need fewer passes over the content and until sequence length blows up your KV cache throughout cost, fewer passes are cheaper. But for situations that don't get request batching or where the context length is so heavy that you'd prefer to get to exploit memory locality on the attention computation, you'd benefit from diffusion mode decoding. A nice side effect of the diffusion mode is that it's natural reliance on the bidirectional attention from the encoder layers provides much more flexible (and, critically, context-aware) understanding so as mentioned, later words can easily modulate earlier words like with "bank [of the river]"/"bank [in the park]"/"bank [got robbed]" or the classic of these days: telling an agent it did wrong and expecting it to in-context learn from the mistake (in practice decoder-only models basically merely get polluted from that, so you have to re-wind the conversation, because the later correction has literally no way of backwards-affecting the problematic tokens). That said, the recent surge in training "reasoning" models to utilize thinking tokens that often get cut out of further conversation context, and all via a reinforcement learning process that's not merely RLHF/preference-conditioning, is actually quite related: discrete denoising diffusion models can be trained as a RL scheme during pre training where the training step is provided the outcome goal and a masked version as the input query, and then trained to manage the work done in the individual steps on it's own to where it eventually produces the outcome goal, crucially without prescribing any order of filling in the masked tokens or how many to do in which step. A recent paper on the matter: https://openreview.net/forum?id=MJNywBdSDy
	▲	danieldk 5 days ago \| parent \| prev [-]
		Until we got highly optimized decoder implementations, decoders for prefill were often even implemented by using the same implementation as an encoder, but logit-masking inputs using a causal mask before the attention softmax so that tokens could not attend to future tokens.