> There are no on-demand GPUs at this scale.

> These two factors mean that GPU utilization comes in at 10-20%.

Why don't these two factors cancel out? Why wouldn't a company building a private GPU cluster for their own use, also sit a workload scheduler (e.g. Slurm) in front of it, enable credit accounting + usage-based-billing on it, and then let validated customer partners of theirs push batch jobs to their cluster — where each such job will receive huge spot resource allocations in what would otherwise be the cluster's low-duty point, to run to completion as quickly as possible?

Just a few such companies (and universities) deciding to rent their excess inference capacity out to local SMEs, would mean that there would then be "on-demand GPUs at this scale." (You'd have to go through a few meetings to get access to it, but no more than is required to e.g. get a mortgage on a house. Certainly nothing as bad as getting VC investment.)

This has always been precisely how the commercial market for HPC compute works: the validated customers of an HPC cluster sending off their flights of independent "wide but short" jobs, that get resource-packed + fair-scheduled between other clients' jobs into a 2D (nodes, time) matrix, with everything getting executed overnight, just a few wide jobs at a time.

So why don't we see a similar commercial "GPU HPC" market?

I can only assume that the companies building such clusters are either:

- investor-funded, and therefore not concerned with dedicating effort to invent ways to minimize the TCO of their GPUs, when they could instead put all their engineering+operational labor into grabbing market share

- bigcorps so big that they have contracts with one big overriding "customer" that can suck up 100% of their spare GPU-hours: their state's military / intelligence apparatus

...or, if not, then it must turn out that these clusters are being 100% utilized by their owners themselves — however unlikely that may seem.

Because if none of these statements are true, then there's just a proverbial $20 bill sitting on the ground here. (And the best kind of $20 bill, too, from a company's perspective: rent extraction.)

▲

reachableceo 3 days ago | parent | next [-]

That is what I’m doing with my excess compute , excess fabrication , CNC, laser , 3d printing , reflow oven etc capacity in between hardware revs for my main product. I also bill out my trusted sub contractors.

I validate the compute renters because ITAR. Lots of hostile foreign powers trying to access compute .

My main business is ITAR related , so I have incredibly high security in place already.

We are multi tenant from day zero and have slurm etc in place for accounting reasons for federal contracts etc. we actually are spinning up federal contracting as a service and will do a ShowHN when that launches.

Riches in the niches and the business of business :)

▲

thenewwazoo 4 days ago | parent | prev | next [-]

> Why wouldn't a company ... let validated customer partners of theirs push batch jobs

A company standing up this infrastructure is presumably not in the business of selling time-shares of infrastructure, they're busy doing AI B2B pet food marketing or whatever. In order to make that sale, someone has to connect their underutilized assets with interested customers, which is outside of their core competency. Who's going to do that?

There's obviously an opportunity here for another company to be a market maker, but that's hard, and is its own speciality.

	▲	quacksilver 4 days ago \| parent \| next [-]
		There are services like vast.ai that act as marketplaces. You don't know who owns the GPUs / if or when your job will complete and if the owner is sniffing what you are processing though
	▲	loocorez 4 days ago \| parent \| prev \| next [-]
		Sounds like prime intellect
	▲	mistrial9 4 days ago \| parent \| prev [-]
		Snowflake ?

▲

fooker 4 days ago | parent | prev [-]

The software stack for doing what you suggest would cost about a hundred million to develop over five-ten years.

	▲	derefr a day ago \| parent \| next [-]
		Yes, "HPC workload-scheduling software with multi-tenant customer usage accounting" does cost a hundred million dollars to develop and takes 5–10 years to build. But some research labs (Lawrence Livermore National Laboratory, the research arm of HP, and a few others) got together to build it ~2002, and decided to make the results open source. And that's what SLURM is. No, really. > Slurm is the workload manager on about 60% of the TOP500 supercomputers.
	▲	appreciatorBus 4 days ago \| parent \| prev [-]
		But I was assured that this sort of stack could simply be vibed into existence?