Remix.run Logo
Amazon launches Trainium3(techcrunch.com)
78 points by thnaks 3 hours ago | 33 comments
ZeroCool2u 2 hours ago | parent | next [-]

I've had to repeatedly tell our AWS account reps that we're not even a little interested in the Trainium or Inferentia instances unless they have a provably reliable track record of working with the standard libraries we have to use like Transformers and PyTorch.

I know they claim they work, but that's only on their happy path with their very specific AMI's and the nightmare that is the neuron SDK. You try to do any real work with them and use your own dependencies and things tend to fall apart immediately.

It was just in the past couple years that it really became worthwhile to use TPU's if you're on GCP and that's only with the huge investment on Google's part into software support. I'm not going to sink hours and hours into beta testing AWS's software just to use their chips.

ecshafer 2 hours ago | parent [-]

IMO AWS once you get off the core services is full of beta services. S3, Dynamo, Lambda, ECS, etc are all solid. But there are a lot of services they have that have some big rough patches.

kentm 2 hours ago | parent | next [-]

I'd add SQS to the solid category.

But yes, the less of a core building block the specific service is (or widely used internally in Amazon), the more likely you are to run into significant issues.

hnlmorg an hour ago | parent | prev | next [-]

This. 100 times this.

nextworddev an hour ago | parent | prev [-]

Kinesis is decent

zdc1 42 minutes ago | parent [-]

That's heartening to know. I find running Kafka less pleasant.

cmiles8 2 hours ago | parent | prev | next [-]

AWS keeps making grand statements about Trainium but not a single customer comes on stage to say how amazing it is. Everyone I talked to that tries it says there were too many headaches and they moved on. AWS pushes it hard but “more price performant” isn’t a benefit if it’s a major PITA to deploy and run relative to other options. Chips without a quality developer experience isn’t gonna work.

Seems AWS is using this heavily internally, which makes sense, but not observing it getting traction outside that. Glad to see Amazon investing there though.

phamilton an hour ago | parent | next [-]

The inf1/inf2 spot instances are so unpopular that they cost less than the equivalent cpu instances. Exact same (or better) hardware but 10-20% cheaper.

We're not quite seeing that on the trn1 instances yet, so someone is using them.

giancarlostoro 2 hours ago | parent | prev [-]

Not just AWS, looks like Anthropic uses it heavily as well. I assume they get plenty of handholding from Amazon though. I'm surprised any cloud provider does not invest drastically more into their SDK and tooling, nobody will use your cloud if they literally cannot.

cmiles8 2 hours ago | parent | next [-]

Well AWS says Anthropic uses it but Anthropic isn’t exactly jumping up and down telling everyone how awesome it is, which tells you everything you need to know.

If Anthropic walked out on stage today and said how amazing it was and how they’re using it the announcement would have a lot more weight. Instead… crickets from Anthropic in the keynote

cobolcomesback an hour ago | parent | next [-]

AWS has built 20 data centers in Indiana full of half a million Trainium chips explicitly for Anthropic. Anthropic is using them heavily. The same press announcement that Anthropic has made about Google TPUs is the exact same one they made a year ago about Trainium. Hell, even in the Google TPU press release they explicitly mention how they are still using Trainium as well.

VirusNewbie 18 minutes ago | parent [-]

Can you link to the press releases? The only one I'm aware of by Anthropic says they will use Tranium for future LLMs, not that they are using them.

teruakohatu 2 hours ago | parent | prev [-]

> Anthropic isn’t exactly jumping up and down telling everyone how awesome it is, which tells you everything you need to know.

You can’t really read into that. They are unlikely to let their competitors know if they have a slight performance/$ edge by going with AWS tech.

cmiles8 2 hours ago | parent | next [-]

With GCP announcing they built Gemini 3 on TPUs the opposite is true. Anthropic is under pressure to show they don’t need expensive GPUs. They’d be catching up at this point, not leaking some secret sauce. No reason for them to not boast on stage today unless there’s nothing to boast about.

0x457 42 minutes ago | parent [-]

Yes, but Google benefit from people using their TPUs, while Anthropic gains nothing unless AWS throws money at them for saying it.

bilbo0s 7 minutes ago | parent [-]

This.

Anthropic is not going to interrupt their competitors if their competitors don't want to use trainium. Neither would you, I, nor anyone else. The only potential is downside. There's no upside potential for them at all in doing so.

From Anthropic's perspective, if the rest of us can't figure out how to make trainium work? Good.

Amazon will fix the difficulty problem with time, but that's time Anthropic can use to press their advantages and entrench themselves in the market.

fishmicrowaver 2 hours ago | parent | prev [-]

Striking a deal with a competitor (AZURE) does though.

IshKebab 19 minutes ago | parent | prev [-]

> I'm surprised any cloud provider does not invest drastically more into their SDK and tooling

I used to work for an AI startup. This is where Nvidia's moat is - the tens of thousands of man-hours that has gone into making the entire AI ecosystem work well with Nvidia hardware and not much else.

It's not that they haven't thought of this, it's just that they don't want to hire another 1k engineers to do it.

aaa_aaa 3 hours ago | parent | prev | next [-]

Interesting that in the article, they do not say what the chip actually does. Not even once.

Symmetry 2 hours ago | parent | next [-]

A bunch of 128x128 systolic arrays at its heart. More details:

https://newsletter.semianalysis.com/p/amazons-ai-self-suffic...

wmf 3 hours ago | parent | prev | next [-]

Training. It's in the name.

cobolcomesback an hour ago | parent [-]

Ironically these chips are being targeted at inference as well (the AWS CEO acknowledged the difficulties in naming things during the announcement).

wmf 19 minutes ago | parent [-]

The same thing happened to AMD and Gaudi. They couldn't get training to work so they pivoted to inference.

Kye 3 hours ago | parent | prev | next [-]

Vector math

egorfine 3 hours ago | parent | prev [-]

Probably because the only task this chip has to perform is to please shareholders hence there is no need to explain anything to us peasant developers.

caminante 3 hours ago | parent [-]

Time to go squat on trainium4.com [0]

[0] https://www.godaddy.com/domainsearch/find?domainToCheck=trai...

landl0rd an hour ago | parent | prev | next [-]

Anyone considering using trainium should view this Completely Factual Infomercial: https://x.com/typedfemale/status/1945912359027114310

Pretty accurate in my experience, especially re: the neuron sdk. Do not use.

mlmonkey an hour ago | parent | prev | next [-]

Not a single mention of any benchmarks or performance.

pedalpete 24 minutes ago | parent [-]

They say 4x more, but not 4x faster, 4x more memory, but not 4x more than what!?

nimbius 3 hours ago | parent | prev | next [-]

the real news is: "and teases an Nvidia-friendly roadmap"

The sole reason amazon is throwing any money at this is because they think they can do to AI what they did with logistics and shipping in an effort to slash costs leading into a recession (we cant fire anyone else.) The hubris is magnanimous to say the least.

but the total confidence is very low...so "Nvidia friendly" is face saving to ensure no bridges they currently cross for AWS profit get burned.

jauntywundrkind 2 hours ago | parent | prev | next [-]

Amazon aside, interesting future here with NVLink getting more and more folks using it. Intel is also onboard with NVlink. This is like an PCI -> AGP moment, but Nvidia's AGP.

AMD felt like they were so close to nabbing the accelerator future back in HyperTransport days. But the recent version Infinity Fabric is all internal.

There's Ultra Accelerator Link (UALink) getting some steam. Hypothetically CXL should be good for uses like this, using PCIe PHY but lower latency lighter weight; close to ram latency, not bad! But still a mere PCIe speed, not nearly enough, with PCIe 6.0 just barely emerging now. Ideally IMO we'd also see more chips come with integrated networking too: it was so amazing when Intel Xeon's had 100Gb Omni-Path for barely any price bump. UltraEthernet feels like it should be on core, gratis.

wmf 2 hours ago | parent [-]

NVLink Fusion sounds like a total trap where you pay to become Jensen's slave. It may make sense for Intel because they're desperate. It's not a good look for AWS to put themselves in the same category.

UltraEthernet feels like it should be on core, gratis.

I've been saying for a while that AMD should put a SolarFlare NIC in their I/O die. They already have switchable PCIe/SATA ports, why not switchable PCIe/Ethernet? UEC might be too niche though.

ChrisArchitect 2 hours ago | parent | prev [-]

Amazon post: https://www.aboutamazon.com/news/aws/trainium-3-ultraserver-...