It's crazy to see a 400B model running on an iPhone. But moving forward, as the information density and architectural efficiency of smaller models continue to increase, getting high-quality, real-time inference on mobile is going to become trivial.

▲

anemll 3 hours ago | parent | next [-]

Probably 2x speed for Mac Studio this year if they do double NAND ( or quad?)

▲

volemo 5 hours ago | parent | prev [-]

> moving forward, as the information density and architectural efficiency of smaller models continue to increase

If they continue to increase.

▲

vessenes 4 hours ago | parent | next [-]

They will. Either new architectures will come out that give us greater efficiency, or we will hit a point where the main thing we can do is shove more training time onto these weights to get more per byte. Similar thing is already happening organically when it comes to efficient token use; see for instance https://github.com/qlabs-eng/slowrun.

	▲	simopa 4 hours ago \| parent [-]
		Thanks for the link.

▲

simopa 3 hours ago | parent | prev [-]

The "if" is fair. But when scaling hits diminishing returns, the field is forced to look at architectures with better capacity-per-parameter tradeoffs. It's happened before, maybe it'll happen again now.