Remix clone Hacker News

new | show | ask | jobs Github

	▲	hazrmard 3 hours ago
		Check my understanding & follow-up Qs: An auto-encoder is trained on [activation] -AV-> [text] -AR-> [activation], where [activation] belongs to one layer in the LLM model M. Architecture.: `Model being analyzed (M): >\|\|\|\|\|> Auto-Verbalizer (AV) same as M, with tokens for activation: >\|\|\|\|\|> Auto-Reconstructor (AR) truncated up to the layer being analyzed: \|\|>` The AV, AR models are initialized using supervised learning on a summarization task. The assumption being that model thoughts are similar to context summary. The AR is trained on a simple reconstruction loss. The AV is trained using an RL objective of reconstruction loss with a KL penalty to keep the verbalizations similar to the initial weights (to maintain linguistic fluency). - Authors acknowledge, and expect, confabulations in verbalizations: factually incorrect or unsubstantiated statements. But, the internal thought we seek is itself, by definition, unsubstantiated. How can we tell if it is not duplicitous? - They test this on a layer 2/3 deep into the models. I wonder how shallow and deep abstractions affect thought verbalization?