Remix.run Logo
koenschipper 3 hours ago

This is an incredibly elegant hack. The finding that it only works with "circuit-sized" blocks of ~7 layers is fascinating. It really makes you wonder how much of a model's depth is just routing versus actual discrete processing units.

I spend a lot of time wrestling with smaller LLMs for strict data extraction and JSON formatting. Have you noticed if duplicating these specific middle layers boosts a particular type of capability?

For example, does the model become more obedient to system prompts/strict formatting, or is the performance bump purely in general reasoning and knowledge retrieval?

Amazing work doing this on a basement 4090 rig!