| ▲ | throwup238 4 hours ago | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> even the internal talent to know whether they are hiring a good infrastructure engineer or not during the interview process. This is really the core problem. Every time I’ve done the math on a sizable cloud vs on-prem deployment, there is so much money left on the table that the orgs can afford to pay FAANG-level salaries for several good SREs but never have we been able to find people to fill the roles or even know if we had found them. The numbers are so much worse now with GPUs. The cost of reserved instances (let alone on-demand) for an 8x H100 pod even with NVIDIA Enterprise licenses included leaves tens of thousands per pod for the salary of employees managing it. Assuming one SREs can manage at least four racks the hardware pays for itself, if you can find even a single qualified person. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | everforward 2 hours ago | parent | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I work in SRE and the way you describe it would give me pause. The first is that SRE team size primarily scales with the number of applications and level of support. It does scale with hardware but sublinearly, where number of applications usually scales super linearly. It takes a ton less effort to manage 100 instances of a single app than 1 instance of 100 separate apps (presuming SRE has any support responsibilities for the app). Talking purely in terms of hardware would make me concerned that I’m looking at an impossible task. The second (which you probably know, but interacts with my next point) is that you never have single person SRE teams because of oncall. Three is basically the minimum, four if you want to avoid oncall burnout. The last is that I don’t know many SREs (maybe none at all) that are well-versed enough in all the hardware disciplines to manage a footprint the size we’re talking. If each SRE is 4 racks and a minimum team size is 4, that’s 16 racks. You’d need each SRE to be comfortable enough with networking, storage, operating system, compute scheduling (k8s, VMWare, etc) to manage each of those aspects for a 16 rack system. In reality, it’s probably 3 teams, each of them needs 4 members for oncall, so a floor of like 48 racks. Depending on how many applications you run on 48 racks, it might be more SREs that split into more specialized roles (a team for databases, a team for load balancers, etc). Numbers obviously vary by level of application support. If support ends at the compute layer with not a ton of app-specific config/features, that’s fewer folks. If you want SRE to be able to trace why a particular endpoint is slow right now, that’s more folks. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | Eridrus an hour ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I disagree with on-prem being ideal for GPU for most people. If you're doing regular inference for a product with very flat throughput requirements (and you're doing on-prem already), on-prem GPUs can make a lot of sense. But if you're doing a lot of training, you have very bursty requirements. And the H100s are specifically for training. If you can have your H100 fleet <38% utilized across time, you're losing money. If you have batch throughput you can run on the H100s when you're not training, you're probably closer to being able to wanting on-prem. But the other thing to keep in mind is that AWS is not the only provider. It is a particularly expensive provider, and you can buy capacity from other neoclouds if you are cost-sensitive. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | 3 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| [deleted] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | tgrowazay 3 hours ago | parent | prev [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Self-hosted 8xH100 is ~$250k, depreciated across three years => $80k/year, with power and cooling => $90k/year (~$10/hour total). AWS charges $55/hour for EC2 p5.48xlarge instance, which goes down with 1 or 3 year commitments. With 1 year commitment, it costs ~$30/hour => $262k per year. 3-year commitment brings price down to $24/hour => $210k per year. This price does NOT include egress, and other fees. So, yeah, there is a $120k-$175k difference that can pay for a full-time on-site SRE, even if you only need one 8xH100 server. Numbers get better if you need more than one server like that. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||