Remix.run Logo
jbellis 6 hours ago

I chased down what the "4x faster at AI tasks" was measuring:

> Testing conducted by Apple in January 2026 using preproduction 13-inch and 15-inch MacBook Air systems with Apple M5, 10-core CPU, 10-core GPU, 32GB of unified memory, and 4TB SSD, and production 13-inch and 15-inch MacBook Air systems with Apple M4, 10-core CPU, 10-core GPU, 32GB of unified memory, and 2TB SSD. Time to first token measured with an 8K-token prompt using a 14-billion parameter model with 4-bit quantization, and LM Studio 0.4.1 (Build 1). Performance tests are conducted using specific computer systems and reflect the approximate performance of MacBook Air.

gslepak an hour ago | parent | next [-]

That is talking about battery life, not AI tasks. Footnote 53, where it says, "Up to 18 hours battery life":

https://www.apple.com/macbook-pro/

whynotmaybe 3 hours ago | parent | prev | next [-]

Quite interesting that it's now a selling point just like fps in Crysis was a long time ago.

re-thc 3 hours ago | parent [-]

Next is the fps of an AI playing Crysis.

dana321 2 hours ago | parent [-]

Or tasks per minute of the AI doing your job for you

jayde2767 an hour ago | parent | next [-]

That measurement will be AI assembling MacBook pros vs human assemblers: number of units per hour, day, or whatever unit is most applicable.

re-thc 2 hours ago | parent | prev [-]

-1

butILoveLife 3 hours ago | parent | prev | next [-]

>Time to first token measured with an 8K-token prompt using a 14-billion parameter model with 4-bit quantization

Oh dear 14B and 4-bit quant? There are going to be a lot of embarrassed programmers who need to explain to their engineering managers why their Macbook can't reasonably run LLMs like they said it could. (This already happened at my fortune 20 company lol)

fulafel 2 hours ago | parent | prev | next [-]

So it's not measuring output tokens/s, just how long it takes to start generating tokens. Seems we'll have to wait for independent benchmarks to get useful numbers.

dotancohen 10 minutes ago | parent [-]

For many workflows involving real time human interaction, such as voice assistant, this is the most important metric. Very few tasks are as sensitive to quality, once a certain response quality threshold has been achieved, as is the software planning and writing tasks that most HN readers are likely familiar.

lastdong 3 hours ago | parent | prev | next [-]

14-billion parameter model with 4-bit quantization seems rather small

derefr an hour ago | parent | next [-]

I think these aren't meant to be representative of arbitrary userland-workload LLM inferences, but rather the kinds of tasks macOS might spin up a background LLM inference for. Like the Apple Intelligence stuff, or Photos auto-tagging, etc. You wouldn't want the OS to ever be spinning up a model that uses 98% of RAM, so Apple probably considers themselves to have at most 50% of RAM as working headroom for any such workloads.

simlevesque 3 hours ago | parent | prev | next [-]

It's not much for a frontier AI but it can be a very useful specialized LLM.

giancarlostoro 3 hours ago | parent | prev | next [-]

On my 24GB RAM M4 Pro MBP some models run very quickly through LM Studio to Zed, I was able to ask it to write some code. Course my fan starts spinning off like the worlds ending, but its still impressive what I can do 100% locally. I can't imagine on a more serious setup like the Mac Studio.

efxhoy 2 hours ago | parent [-]

How is the output quality of the smaller models?

bilbo0s 3 hours ago | parent | prev | next [-]

It is.

That's how they make loot on their 128GB MacBook Pros. By kneecapping the cheap stuff. Don't think for a second that the specs weren't chosen so that professional developers would have to shell out the 8 grand for the legit machine. They're only gonna let us do the bare minimum on a MacBook Air.

butILoveLife 3 hours ago | parent | prev [-]

For anyone who has been watching Apple since the iPod commercials, Apple really really has grey area in the honesty of their marketing.

And not even diehard Apple fanboys deny this.

I genuinely feel bad for people who fall for their marketing thinking they will run LLMs. Oh well, I got scammed on runescape as a child when someone said they could trim my armor... Everyone needs to learn.

giwook 2 hours ago | parent | next [-]

I don't know that there would be a huge overlap between the people who would fall for this type of marketing and the people who want to run LLMs locally.

There definitely are some who fit into this category, but if they're buying the latest and greatest on a whim then they've likely got money to burn and you probably don't need to feel bad for them.

Reminds me of the saying: "A fool and his money are soon parted".

zitterbewegung 2 hours ago | parent | prev [-]

Yesterday I ran qwen3.5:27b with an M1 Max and 64 GB of ram. I have even run Llama 70B when llama.cpp came out. These run sufficiently well but somewhat slow but compared to what the improvements with the M5 Max it will make it a much faster experience.

azinman2 6 hours ago | parent | prev [-]

Seems very reasonable to me

tux3 5 hours ago | parent | next [-]

A bit strange to use time to first token instead of throughput.

Latency to the first token is not like a web page where first paint already has useful things to show. The first token is "The ", and you'll be very happy it's there in 50ms instead of 200ms... but then what you really want to know is how quickly you'll get the rest of the sentence (throughput)

jbellis 5 hours ago | parent | next [-]

As far as benchmarketing goes they clearly went with prefill because it's much easier for apple to improve prefill numbers (flops-dominated) than decode (bandwidth-dominated, at least for local inference); M5 unified memory bandwidth is only about 10% better than the M4.

GeekyBear 5 hours ago | parent | prev | next [-]

In previous generations, throughout was excellent for an integrated GPU, but the time to first token was lacking.

danudey 5 hours ago | parent [-]

So throughput was already good but TTFT was the metric that needed more improvement?

zamadatix 4 hours ago | parent | next [-]

To add to the sibling "good is relative" it also depends what you're running, not just your relative tolerances of what good is. E.g. in a MoE the decode speedup means the speed of prompt processing delay is more noticeable for the same size model in RAM.

convenwis 5 hours ago | parent | prev [-]

Good is relative but first token was clearly the biggest limitation.

case540 5 hours ago | parent | prev | next [-]

I assume it’s time to first output token so it’s basically throughput. How fast can it output 8001 tokens

fragmede 5 hours ago | parent | prev [-]

No you don't. Not as a sticky mushy human with emotions watching tokens drip in. There's a lot of feeling and emotion not backed by hard facts and data going around, and most people would rather see something happening even if it takes longer overall. Hence spinner.gif, that doesn't actually remotely do a damned thing, but it gives users reassurance that they're waiting for something good. So human psychology makes time to first token an important metric to look at, although it's not the only one.

MrDrMcCoy 5 hours ago | parent [-]

Some kinds of spinners serve as a coal-mine canary indicating if the app has gotten wedged. Not hugely useful, but also not entirely useless.

nabakin 5 hours ago | parent | prev [-]

I would consider it reasonable if this was 4x TTFT and Throughput, but it seems like it's only for TTFT.