CPU compute is infinity times less expensive and much easier to work with in general

Less expensive how? The reason GPUs are used is because they are more efficient. You CAN run matmul on CPUs for sure, but it's going to be much slower and take a ton more electricity. So to claim it's "less expensive" is weird.

	▲	dspillett an hour ago \| parent \| next [-]
		In situations where you have space CPU power but not spare GPU power because your GPU(s) & VRAM are allocated to be busy on other tasks, you might prefer to use what you have rather than needing to upgrade that will cost (even if that means the task will run more slowly). If you are wanting to run this on a server to pipe the generated speech to a remote user (live, or generating it to send at some other appropriate moment) and your server resources don't have GPUs, then you either have to change your infrastructure, use CPU, or not bother. Renting GPU access on cloud systems can be more expensive than CPU, especially if you only need GPU processing for specific occasional run tasks. Spinning up a VM to server a request then pulling it down is rarely as quick as cloud providers like to suggest in advertising, so you end up keeping things alive longer than absolutely needed meaning spot-pricing rates quoted are lower than you end up paying.
	▲	woadwarrior01 23 minutes ago \| parent \| prev \| next [-]
		GPUs are a near monopoly. There are at least handful of big players in the CPU space. Competition alone makes the latter space a lot cheaper. Also, for inference (and not training) there are other ways to efficiently do matmuls besides the GPU. You might want to look up Apple's undocumented AMX CPU ISA, and also this thing that vendors call the "Neural Engine" in their marketing (capabilities and the term's specific meaning varies broadly from vendor to vendor). For small 1-3B parameter transformers like TADA, both these options are much more energy efficient, compared to GPU inference.
	▲	g-mork an hour ago \| parent \| prev [-]
		This is far too simplistic, you can't discuss perf per watt unless you're talking about a job running at any decent level of utilisation. Numbers like that only matter for larger scale high utilisation services, meanwhile Intel boxes mastered the art of power efficient idle modes decades ago while almost any contemporary GPU still isn't even remotely close, and you can pick up 32 core boxes like that for pennies on the dollar. Even if utilisation weren't a metric, "efficient" can be interpreted in so many ways as to be pointless to try and apply in the general case. I consider any model I can foist into a Lambda function "efficient" because of secondary concerns you simply cannot meaningfully address with GPU hardware at present (elasticity and manageability for example). That it burns more energy per unit output is almost meaningless to consider for any kind of workload where Lambda would be applicable. It's the same for any edge-deployed software where "does it run on CPU?" translates to "does the general purpose user have a snowball's chance in hell of running it?", having to depend on 4GB of CUDA libraries to run a utility fundamentally changes the nature and applicability of any piece of software A few years ago we had smaller cuts of Whisper running at something like 0.5x realtime on CPU, people struggled along anyway. Now we have Nvidia's speech model family comfortably exceeding 2x real time on older processors with far improved word error rate. Which would you prefer to deploy to an edge device? Which improves the total number of addressable users? Turns out we never needed GPUs for this problem in in the first place, the model architecture mattered all along, as did the question, "does it run on CPU?". It's not even clear cut when discussing raw achievable performance. With a CPU-friendly speech model living in a Lambda, no GPU configuration will come close to the achievable peak throughput for the same level of investment. Got a year-long audio recording to process once a year? Slice it up and Lambda will happily chew through it at 500 or 1000x real time