Remix.run Logo
hackerchy 8 hours ago

This is fascinating. The fact that only ~7 layer blocks work and not fewer/more really suggests there are emergent functional units in the transformer stack that we don't fully understand yet. Almost like "organs" in the network. Have you tried this on architectures other than Qwen, like Llama or Mistral? Curious if the magic block size is architecture-dependent or if 7 layers is some kind of universal constant.