MacBooks with their unified memory behave like a slow GPU with enormous amount of video RAM. So you can run large smart models slowly.

Dedicated GPUs have less video RAM so can run smaller less smart models quickly.

▲

exabrial a day ago | parent | next [-]

Do Mac Pros provide more headroom? noob here, noob questions

	▲	zihotki 17 hours ago \| parent \| next [-]
		Not that much - a bit better but still negligible. Prefill is highly parallelizable and benefits from many cores. NVidia GPU's simply have several times more cores than even Pros.
	▲	JSR_FDED a day ago \| parent \| prev \| next [-]
		In what sense? Headroom for what?
	▲	rho138 a day ago \| parent \| prev [-]
		Idk why you’re being downvoted for asking a question. Pending specs they _could_ provide more headroom for a larger model but they would still be limited by the CPU and it’s associated bus speeds.

▲

visarga a day ago | parent | prev | next [-]

Macbook M5 64GB - can run gemma-4-26b-a4b-it-4bit and Qwen3.6-35B-A3B-4bit at about 1500 tps prefix and 45 tps decode on contexts up to 100K tokens using MLX. It's faster than Claude. I was really surprised, chat quality is also similar to Claude for gemma4. Agentic works but does not compare to cloud models, you can still make agents where top level is code.

▲

mzubairtahir a day ago | parent [-]

sorry but asking again: how much memory is actually useable by gpu in macbook? as it is shared(os and apps also have to use same memory)? and it is different than dedicated gpu memory?

	▲	gizajob a day ago \| parent \| next [-]
		It’s completely shared so the OS and everything else takes up maybe 8GB of the RAM. On a 64GB machine you can run models about 45GB in size and still have space for those models run other tasks which themselves might need ram. To a user, the GPU appears to just use the RAM as much as it needs same as any other process running on the system. You can see what space your LLMs are taking up in Activity Monitor (or htop) and how much GPU capacity they’re using (all of it)
	▲	cco a day ago \| parent \| prev \| next [-]
		Rule of thumb is about 70-80%.
	▲	j45 a day ago \| parent \| prev \| next [-]
		You can adjust the percentage available both on the MacOS side and how much the model uses.
	▲	visarga a day ago \| parent \| prev [-]
		[dead]

▲

EagnaIonat a day ago | parent | prev | next [-]

> MacBooks with their unified memory behave like a slow GPU with enormous amount of video RAM. So you can run large smart models slowly.

With the model using MLX the speed increase is night and day. Even non-MLX is good.

You also don't have the transfer costs related to moving CPU data into the GPU.

	▲	Rekindle8090 a day ago \| parent [-]
		[dead]

▲

mzubairtahir a day ago | parent | prev [-]

how much memory is actually useable by gpu in macbook? as it is shared?

	▲	pylotlight a day ago \| parent [-]
		roughly ~50–56GB, although this is somewhat configurable with iogpu.wired_limit_mb. By default, macOS reserves ~25% of memory for the system.