| ▲ | hazrmard 3 hours ago | |
Check my understanding & follow-up Qs: An auto-encoder is trained on [activation] -AV-> [text] -AR-> [activation], where [activation] belongs to one layer in the LLM model M. Architecture.:
The AV, AR models are initialized using supervised learning on a summarization task. The assumption being that model thoughts are similar to context summary.The AR is trained on a simple reconstruction loss. The AV is trained using an RL objective of reconstruction loss with a KL penalty to keep the verbalizations similar to the initial weights (to maintain linguistic fluency). - Authors acknowledge, and expect, confabulations in verbalizations: factually incorrect or unsubstantiated statements. But, the internal thought we seek is itself, by definition, unsubstantiated. How can we tell if it is not duplicitous? - They test this on a layer 2/3 deep into the models. I wonder how shallow and deep abstractions affect thought verbalization? | ||