| ▲ | l674 5 hours ago | ||||||||||||||||||||||||||||
Could you explain how/why GRAM cannot be interpreted or aligned how current LLMs are? Not very familiar how it works | |||||||||||||||||||||||||||||
| ▲ | kmavm 4 hours ago | parent | next [-] | ||||||||||||||||||||||||||||
Crudely? Because you can't grep a sequence of latent states for variants of "If I kill all the puny humans, I can <achieve my current goal>." | |||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||
| ▲ | sometimelurker 3 hours ago | parent | prev [-] | ||||||||||||||||||||||||||||
sibling comment got to the main points before me, but to add on kmavm's reply, the attack surface for gradient decent to get the system to exchange "bad information is much higher in latent reasoning models (like GRAM). You get ~3 OoM more bits (~17 bits per token in a standard CoT vs the whole residual stream of the model @ f16 = a few kb) per forward pass of the system coming back to itself, and even if you could sift through all that for signs of misalignment, you just can't put a blockade on all of the bad things that leak through. | |||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||