I made a blogpost on my submission (currently the top handwritten one at 36 parameters) https://alexlitzenberger.com/blog/building_a_minimal_transfo...

▲

ks2048 a minute ago | parent | next [-]

I didn't look at all the details, but wanted to see how you did the initial embedding and see you do have a 14x5 matrix there. I guess when you are setting things by-hand (rather than learning), the definition of counting "parameters" is a bit unclear. One could say all those are parameters! even if setting in a straight-forward way.

▲

sowbug 3 hours ago | parent | prev [-]

I ask this question as someone who can't do much more than confirm that your blog post is written in English by someone who knows math.

Does this result suggest that if we had N clever humans manually building an LLM, they might come up with something as smart as a frontier model, but potentially 45 times smaller? (1644 / 36 ~= 45, N = very large, time not specified)

▲

alexlitz 3 hours ago | parent [-]

I imagine getting things to be polysemantic in a way that does not interfere would lead to sublinear scaling. Also there are smaller ones that were trained so would still be more like 311/36 ~= 8.6.

	▲	Lerc 2 hours ago \| parent \| next [-]
		>I imagine getting things to be polysemantic in a way that does not interfere would lead to sublinear scaling. True, but with even smarter humans, you could exploit the interactions for additional calculations. While it sounds a bit silly, it is one of the hypotheses behind a fast takeoff. An AI that is sufficiently smart could design a network better than a trained one and could make something much smarter than itself on the same hardware. The question then becomes if that new smarter one can do an even better job. I suspect diminishing returns, but then again I am insufficiently smart.
	▲	sowbug 3 hours ago \| parent \| prev [-]
		Thanks! (I see the Trained Weights results now, thanks.)