The technical write up is great, but Mac users should not get too excited just yet on running 300B+ parameter models locally as the TPS isn't that good.

>...at 4.4+ tokens/second

That is even when it is using 4-bit quantization and it is still at that speed.

> The entire 209GB model streams from SSD through a custom Metal compute pipeline.

This is my main problem.

If I were to run this on a Mac SSD, 24/7 for heavy usage such as Openclaw, that is going to significantly reduce the lifetime of the SSD.

Can't imagine using this in the long term right now, but improvements will follow. Still a great write up anyways.

▲

Roxxik 9 hours ago | parent | next [-]

Does an SSD meaningfully degrade by read only workloads?

▲

JSR_FDED 9 hours ago | parent [-]

Nope, reads don’t cause wear

	▲	zozbot234 8 hours ago \| parent [-]
		No appreciable wear of course, but read disturb (requiring occasional rewrites) becomes more of an issue as NAND fabrication advances.

▲

etiam 9 hours ago | parent | prev | next [-]

> If I were to run this on a Mac SSD, 24/7 for heavy usage such as Openclaw, that is going to significantly reduce the lifetime of the SSD.

How sure are you about that? I've never looked closer at how a large LLM with mixture of experts architecture switches between expert modules, but staying on roughly the same topic for the use (as it often would when editing the same codebase), I wouldn't be surprised to see the switches of composition are fairly rare, fairly small, and to the extent it happens it's repeated reads from the flash disk rather than writes it tends to cause.

▲

frotaur 9 hours ago | parent [-]

Afaik the experts are not usually very interpretable, and generally would be surprised if at least one does not change every token. I don't know what happens in practice, but I know at least during training, nothing is done to minimize the number of expert switches between tokens.

	▲	etiam 11 minutes ago \| parent [-]
		I'd have thought at least a tiny explicit penalty term for switching, to discourage messing around with the composition without any expected gains from it. If one is to use these on hardware that can't keep everything loaded I guess someone should examine how it works out in practice. Interpretability may be be a too much to ask, but I can't spontaneously see any reason why the experts can't at least be pushed to incorporate what's needed to remain the good choice for a longer segment.

▲

Wowfunhappy 8 hours ago | parent | prev | next [-]

Eh. I mean, 4 tokens a second works fine if you're patient. Go do something else while you wait.

I feel like whenever I'm trying to find information on which local models will work on my hardware, I have to overestimate because people don't know how to wait for things.

Also, reading data doesn't cause SSD wear.

▲

hrmtst93837 9 hours ago | parent | prev [-]

If you want decent throughput and do not care about burning SSD write cycles on a box that was never meant to act like a tiny inference server, a used server with actual RAM is still the cheaper and less silly option. I woudn't expect Apple's warranty team to be much help.

▲

K0balt 8 hours ago | parent [-]

Is it doing a bunch of ssd writes?

	▲	mkw 5 hours ago \| parent [-]
		stream from the SSD, perform the calculation, discard, repeat