Well Opus and Gemini are probably running on multiple H200 equivalents, maybe multiple hundreds of thousands of dollars of inference equipment. Local models are inherently inferior; even the best Mac that money can buy will never hold a candle to latest generation Nvidia inference hardware, and the local models, even the largest, are still not quite at the frontier. The ones you can plausibly run on a laptop (where "plausible" really is "45 minutes and making my laptop sound like it is going to take off at any moment". Like they said -- you're getting sonnet 4.5 performance which is 2 generations ago; speaking from experience opus 4.6 is night and day compared to sonnet 4.5

▲

zozbot234 4 hours ago | parent [-]

> Well Opus and Gemini are probably running on multiple H200 equivalents, maybe multiple hundreds of thousands of dollars of inference equipment.

But if you've got that kind of equipment, you aren't using it to support a single user. It gets the best utilization by running very large batches with massive parallelism across GPUs, so you're going to do that. There is such a thing as a useful middle ground. that may not give you the absolute best in performance but will be found broadly acceptable and still be quite viable for a home lab.

	▲	aspenmartin 3 hours ago \| parent [-]
		Batching helps with efficiency but you can’t fit opus into anything less than hundreds of thousands of dollars in equipment Local models are more than a useful middle ground they are essential and will never go away, I was just addressing the OPs question about why he observed the difference he did. One is an API call to the worlds most advanced compute infrastructure and another is running on a $500 CPU. Lots of uses for small, medium, and larger models they all have important places!!