| ▲ | zozbot234 4 hours ago | |
> Well Opus and Gemini are probably running on multiple H200 equivalents, maybe multiple hundreds of thousands of dollars of inference equipment. But if you've got that kind of equipment, you aren't using it to support a single user. It gets the best utilization by running very large batches with massive parallelism across GPUs, so you're going to do that. There is such a thing as a useful middle ground. that may not give you the absolute best in performance but will be found broadly acceptable and still be quite viable for a home lab. | ||
| ▲ | aspenmartin 3 hours ago | parent [-] | |
Batching helps with efficiency but you can’t fit opus into anything less than hundreds of thousands of dollars in equipment Local models are more than a useful middle ground they are essential and will never go away, I was just addressing the OPs question about why he observed the difference he did. One is an API call to the worlds most advanced compute infrastructure and another is running on a $500 CPU. Lots of uses for small, medium, and larger models they all have important places!! | ||