Remix.run Logo
nickysielicki 7 days ago

Frankly, this is strictly a positive signal to me.

Fargate and lambda are fundamentally very different from EC2/nitro under the hood, with a very different risk profile in terms of security. The reason you can't run GPU workloads on top of fargate and lambda is because exposing physical 3rd-party hardware to untrusted customer code dramatically increases the startup and shutdown costs (ie: validating that the hardware is still functional, healthy, and hasn't been tampered with in any way). That means scrubbing takes a long time and you can't handle capacity surges as easily as you can with paravirtualized traditional compute workloads.

There are a lot of business-minded non-technical people running AWS, some of which are sure to be loudly complaining about this horrible loss of revenue... which simply lets you know that when push comes to shove, the right voices are still winning inside AWS (eg: the voices that put security above everything else, where it belongs).

cyberax 7 days ago | parent [-]

> Frankly, this is strictly a positive signal to me.

How?

> The reason you can't run GPU workloads on top of fargate and lambda is because exposing physical 3rd-party hardware to untrusted customer code dramatically increases the startup and shutdown costs

This is BS. Both NVidia and AMD offer virtualization extensions. And even without that, they can simply power-cycle the GPUs after switching tenants.

Moreover, Fargate is used for long-running tasks, and it definitely can run on a regular Nitro stack. They absolutely can provide GPUs for them, but it likely requires a lot of internal work across teams to make it happen. So it doesn't happen.

I worked at AWS, in a team responsible for EC2 instance launching. So I know how it all works internally :)

nickysielicki 7 days ago | parent [-]

You'd have to build totally separate datacenters with totally different hardware than what they have today. You're not thinking about the complexity introduced by the use of pcie switches. For starters, you don't have enough bandwidth to saturate all gpus concurrently, they're sharing pcie root complex bandwidth, which is a non-starter if you want to define any kind of reasonable SLA. You can't really enforce limits, either. Even if you're able to tolerate that and sell customers on it, the security side is worse. All customer GPU transactions would be traversing a shared switch fabric, which means noisy bursty neighbors, timing side-channels, etc., etc., etc.

cyberax 6 days ago | parent [-]

> You'd have to build totally separate datacenters with totally different hardware than what they have today.

No? You can reset GPUs with regular PCI-e commands.

> You can't really enforce limits, either. Even if you're able to tolerate that and sell customers on it, the security side is worse

Welp. AWS is already a totally insecure trash, it seems: https://aws.amazon.com/ec2/instance-types/g6e/ Good to know.

Not having GPUs on Fargate/Lambda is, at this point, just a sign of corporate impotence. They can't marshal internal teams to work together, so all they can do is a wrapper/router for AI models that a student can vibe-code in a month.

We're doing AI models for aerial imagery analysis, so we need to train and host very custom code. Right now, we have to use third-parties for that because AWS is way more expensive than the competition (e.g. https://lambda.ai/pricing ), _and_ it's harder to use. And yes, we spoke with the sales reps about private pricing offers.

nickysielicki 6 days ago | parent [-]

none of this applies to g6e because it doesn’t have/need a pcie switch, because it doesn’t have rdma support (nor nvlink), which means sriov just works.

cyberax 6 days ago | parent [-]

And what is your point? What is stopping AWS from using g6e or g6dn on Fargate to keep up with the competitors?

nickysielicki 6 days ago | parent [-]

Nothing, but IMO it’s a bad idea. 1. customers who build a compute workload on top of fargate have no future, newer hardware probably won’t ever support it. 2. It’s already ancient hardware from 3 years ago. 3. AWS now has to take responsibility for building an AMI with the latest driver, because the driver must always be newer than whatever toolkit is used inside the container. 4. AWS needs to monitor those instances and write wrappers for things like dgcm.

cyberax 6 days ago | parent [-]

Fargate is simply a userspace application to manage containers with some ties-in to the AWS control plane for orchestration. It allows users to simply request compute capability from EKS/ECS without caring about autoscaling groups, launch templates, and all the other overhead.

"AWS Lambda for model running" would be another nice service.

The things that competitors already provide.

And this is not a weird nonsense requirement. It's something that a lot of serious AI companies now need. And the AWS is totally dropping the ball.

> AWS now has to take responsibility for building an AMI with the latest driver, because the driver must always be newer than whatever toolkit is used inside the container.

They already do that for Bedrock, Sagemaker, and other AI apps.