Remix.run Logo
cyberax 6 days ago

> You'd have to build totally separate datacenters with totally different hardware than what they have today.

No? You can reset GPUs with regular PCI-e commands.

> You can't really enforce limits, either. Even if you're able to tolerate that and sell customers on it, the security side is worse

Welp. AWS is already a totally insecure trash, it seems: https://aws.amazon.com/ec2/instance-types/g6e/ Good to know.

Not having GPUs on Fargate/Lambda is, at this point, just a sign of corporate impotence. They can't marshal internal teams to work together, so all they can do is a wrapper/router for AI models that a student can vibe-code in a month.

We're doing AI models for aerial imagery analysis, so we need to train and host very custom code. Right now, we have to use third-parties for that because AWS is way more expensive than the competition (e.g. https://lambda.ai/pricing ), _and_ it's harder to use. And yes, we spoke with the sales reps about private pricing offers.

nickysielicki 6 days ago | parent [-]

none of this applies to g6e because it doesn’t have/need a pcie switch, because it doesn’t have rdma support (nor nvlink), which means sriov just works.

cyberax 6 days ago | parent [-]

And what is your point? What is stopping AWS from using g6e or g6dn on Fargate to keep up with the competitors?

nickysielicki 6 days ago | parent [-]

Nothing, but IMO it’s a bad idea. 1. customers who build a compute workload on top of fargate have no future, newer hardware probably won’t ever support it. 2. It’s already ancient hardware from 3 years ago. 3. AWS now has to take responsibility for building an AMI with the latest driver, because the driver must always be newer than whatever toolkit is used inside the container. 4. AWS needs to monitor those instances and write wrappers for things like dgcm.

cyberax 6 days ago | parent [-]

Fargate is simply a userspace application to manage containers with some ties-in to the AWS control plane for orchestration. It allows users to simply request compute capability from EKS/ECS without caring about autoscaling groups, launch templates, and all the other overhead.

"AWS Lambda for model running" would be another nice service.

The things that competitors already provide.

And this is not a weird nonsense requirement. It's something that a lot of serious AI companies now need. And the AWS is totally dropping the ball.

> AWS now has to take responsibility for building an AMI with the latest driver, because the driver must always be newer than whatever toolkit is used inside the container.

They already do that for Bedrock, Sagemaker, and other AI apps.