it's not syntax, it's data driven (yes of course syntax contributes to that)
https://freedium.cfd/https://vinithavn.medium.com/from-multi...
At its core, attention operates through three fundamental components — queries, keys, and values — that work together with attention scores to create a flexible, context-aware vector representation.
Query (Q): The query is a vector that represents the current token for which the model wants to compute attention.
Key (K): Keys are vectors that represent the elements in the context against which the query is compared, to determine the relevance.
Attention Scores: These are computed using Query and Key vectors to determine the amount of attention to be paid to each context token.
Value (V): Values are the vectors that represent the actual contextual information. After calculating the attention scores using Query and Key vectors, these scores are applied against Value vectors to get the final context vector