Remix.run Logo
sometimelurker 2 hours ago

sibling comment got to the main points before me, but to add on kmavm's reply, the attack surface for gradient decent to get the system to exchange "bad information is much higher in latent reasoning models (like GRAM). You get ~3 OoM more bits (~17 bits per token in a standard CoT vs the whole residual stream of the model @ f16 = a few kb) per forward pass of the system coming back to itself, and even if you could sift through all that for signs of misalignment, you just can't put a blockade on all of the bad things that leak through.

haldujai 2 hours ago | parent | next [-]

I think you’re overstating the impact of interpretability here. Your earlier point that latent reasoning models can’t be trained very well and that discretization may be load bearing rather than a readability tax in addition to significant inference infra hurdles (e.g. batching, speculative decoding) have limited any serious attempts and reduced the theoretical advantage over CoT at least in the near term.

ACCount37 2 hours ago | parent | prev [-]

Most alignment methods nowadays don't rely on interpretability. And neither do all LLM vendors care about alignment much - especially not in China.

Those things being untrainable at scale is why they aren't around. Alignment is an afterthought.