▲ | brilee 4 days ago | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
For those commenting on cost per token: This throughput assumes 100% utilizations. A bunch of things raise the cost at scale: - There are no on-demand GPUs at this scale. You have to rent them for multi-year contracts. So you have to lock in some number of GPUs for your maximum throughput (or some sufficiently high percentile), not your average throughput. Your peak throughput at west coast business hours is probably 2-3x higher than the throughput at tail hours (east coast morning, west coast evenings) - GPUs are often regionally locked due to data processing issues + latency issues. Thus, it's difficult to utilize these GPUs overnight because Asia doesn't want their data sent to the US and the US doesn't want their data sent to Asia. These two factors mean that GPU utilization comes in at 10-20%. Now, if you're a massive company that spends a lot of money on training new models, you could conceivably slot in RL inference or model training to happen in these off-peak hours, maximizing utilization. But for those companies purely specializing in inference, I would _not_ assume that these 90% margins are real. I would guess that even when it seems "10x cheaper", you're only seeing margins of 50%. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | empiko 4 days ago | parent | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
You also need to consider that the field is moving really fast and you cannot really rely on being able to have the same margins in a year or two. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | parhamn 4 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
Do we know how big the "batch processing" market is? I know the major providers offer 50%+ off for off-peak processing. I assumed it was to slightly correct this problem and on the surface it seems like it'd be useful for big data places where process-eventually is enough, e.g. it could be a relatively big market. Is it? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | koliber 4 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
These are great points. However, I don’t think these companies provision capacity for peak usage and let it idle during off peak. I think they provision it at something a bit above average, and aim at 100% utilization for the max number of hours in the day. When there is not enough capacity to meet demand they utilize various service degradation methods and/or load shedding. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | lbhdc 4 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
If you are willing to spread your workload out over a few regions getting that many GPUs on demand can be doable. You can use something like compute classes on gcp to fallback to different machine types if you do hit stockouts. That doesn't make you impervious from stock outs, but makes it a lot more resilient. You can also use duty cycle metrics to scale down your gpu workloads to get rid of some of the slack. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | jerrygenser 4 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
Re the overnight that's why some providers are offering there are batch tier jobs that are 50% off which return over up to 12 or 24 hours for non-interactive use cases. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | senko 3 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
You're not wrong. However, this all assumes realtime requirements. For batching, you can smooth over the demand curve, and you don't care about latency. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | derefr 4 days ago | parent | prev [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
> There are no on-demand GPUs at this scale. > These two factors mean that GPU utilization comes in at 10-20%. Why don't these two factors cancel out? Why wouldn't a company building a private GPU cluster for their own use, also sit a workload scheduler (e.g. Slurm) in front of it, enable credit accounting + usage-based-billing on it, and then let validated customer partners of theirs push batch jobs to their cluster — where each such job will receive huge spot resource allocations in what would otherwise be the cluster's low-duty point, to run to completion as quickly as possible? Just a few such companies (and universities) deciding to rent their excess inference capacity out to local SMEs, would mean that there would then be "on-demand GPUs at this scale." (You'd have to go through a few meetings to get access to it, but no more than is required to e.g. get a mortgage on a house. Certainly nothing as bad as getting VC investment.) This has always been precisely how the commercial market for HPC compute works: the validated customers of an HPC cluster sending off their flights of independent "wide but short" jobs, that get resource-packed + fair-scheduled between other clients' jobs into a 2D (nodes, time) matrix, with everything getting executed overnight, just a few wide jobs at a time. So why don't we see a similar commercial "GPU HPC" market? I can only assume that the companies building such clusters are either: - investor-funded, and therefore not concerned with dedicating effort to invent ways to minimize the TCO of their GPUs, when they could instead put all their engineering+operational labor into grabbing market share - bigcorps so big that they have contracts with one big overriding "customer" that can suck up 100% of their spare GPU-hours: their state's military / intelligence apparatus ...or, if not, then it must turn out that these clusters are being 100% utilized by their owners themselves — however unlikely that may seem. Because if none of these statements are true, then there's just a proverbial $20 bill sitting on the ground here. (And the best kind of $20 bill, too, from a company's perspective: rent extraction.) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|