| ▲ | koenschipper 3 hours ago | |
This is an incredibly elegant hack. The finding that it only works with "circuit-sized" blocks of ~7 layers is fascinating. It really makes you wonder how much of a model's depth is just routing versus actual discrete processing units. I spend a lot of time wrestling with smaller LLMs for strict data extraction and JSON formatting. Have you noticed if duplicating these specific middle layers boosts a particular type of capability? For example, does the model become more obedient to system prompts/strict formatting, or is the performance bump purely in general reasoning and knowledge retrieval? Amazing work doing this on a basement 4090 rig! | ||