I got this running on a 128GB M5 the other day - pretty painless, model runs in about 80GB of RAM and it seemed to be very capable at writing code and tool execution.

▲ chatmasta 14 minutes ago | parent | next [-]

So you’re saying I should buy the M5? :) I’ve been resisting, thinking I’ll never use it… it’ll be better in a year… I’ll wait for the Studio (do we still think that’s coming in June?)… etc.

▲ perfmode 2 hours ago | parent | prev [-]

How’s the token throughput / response time?

▲ simonw 2 hours ago | parent [-]

Healthy!

  prefill: 30.91 t/s, generation: 29.58 t/s

From https://gist.github.com/simonw/31127f9025845c4c9b10c3e0d8612...

▲

embedding-shape 2 hours ago | parent | next [-]

Comparison with a RTX Pro 6000, with DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf:

prefill: 121.76 t/s, generation: 47.85 t/s

Main target seems to be Apple's Metal, so makes sense. Might be fun to see how fast one could make it go though :) The model seems really good too, even though it's in IQ2.

▲

xienze 2 hours ago | parent | prev [-]

I don't want to be a jerk but 31t/s prefill is basically unusable in an agentic situation. A mere 10k in context and you're sitting there for 5+ minutes before the first token is generated.

▲

fgfarben an hour ago | parent | next [-]

That prefill number isn't right. M4 Max hits 200-300: https://github.com/antirez/ds4/blob/main/speed-bench/m4_max_...

▲

aiscoming 2 hours ago | parent | prev [-]

if it's just the coding agent system prompt and tools, you can cache that

	▲	xienze 2 hours ago \| parent [-]
		Yeah the problem is that's just the start of the context. There's, you know, all the tool call results and file reads and stuff.