| ▲ | MarcLore 4 hours ago | |
The gap between 36 hand-coded params and 311 trained params is fascinating and honestly underappreciated. It mirrors something we see repeatedly in ML: gradient descent finds solutions in a fundamentally different region of parameter space than a human engineer would design. When you hand-code the weights, you're essentially implementing a known algorithm (carry-propagation) directly into the network topology. But trained networks often discover distributed representations that spread the computation across more parameters in ways that are harder to interpret but more robust to input distribution shifts. I'd be curious whether the 311-param trained model generalizes better to bases other than 10, or to addition with different digit counts than it was trained on. In my experience, the 'messier' learned solutions sometimes capture more structural regularity than the clean engineered ones, precisely because they aren't locked into a single algorithmic strategy. | ||