| ▲ | Lionga 8 hours ago | |
Polysemantic features in modern transformer architectures (e.g., with grouped-query attention) are not discretely addressable, semantically stable units but superposed, context-dependent activation patterns distributed across layers and attention heads, so there is no principled mechanism by which a single circuit or feature can reliably and specifically encode “a particular code error” in a way that is isolable, causally attributable, and consistently retrievable across inputs. --- Way to go in showing you want a discussion, good job. | ||
| ▲ | jychang 8 hours ago | parent [-] | |
Nice LLM generated text. Now go read https://transformer-circuits.pub/2024/scaling-monosemanticit... or https://arxiv.org/abs/2506.19382 to see why that text is outdated. Or read any paper in the entire field of mechanistic interpretability (from the past year or two), really. Hint: the first paper is titled "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet" and you can ctrl-f for "We find three different safety-relevant code features: an unsafe code feature 1M/570621 which activates on security vulnerabilities, a code error feature 1M/1013764 which activates on bugs and exceptions" Who said I want a discussion? I want ignorant people to STOP talking, instead of talking as if they knew everything. | ||