Wild that one tenant’s cache-hit traffic could tip over Cloudflare’s interconnect capacity

You'd be surprised how low the capacity of a lot of internet links is. 10Gbps is common on smaller networks - let me rephrase that, a small to medium ISP might only have 10Gbps to each of most of their peering partners. Normally, traffic is distributed, going to different places, coming from different places, and each link is partially utilized. But unusual patterns can fill up one specific link.

10Gbps is old technology now and any real ISP can probably afford 40 or 100 - for hundreds of dollars per link. But they're going to deploy that on their most utilized links first, and only if their peering partner can also afford it and exchanges enough traffic to justify it. So the smallest connections are typically going to be 10. (Lower than 10 is too small to justify a point-to-point peering at all).

If you have 10Gbps fiber at home, you could congest one of these links all by yourself.

Now this is Cloudflare talking to aws-east-1, so they should have shitloads of capacity there, probably at least 8x100 or more. But considering that AWS is the kind of environment where you can spin up 800 servers for a few hours to perform a massively parallel task, it's not surprising that someone did eventually create 800Gbps of traffic to the same place, or however much they have. Actually it's surprising it doesn't happen more often. Perhaps that's because AWS charges an arm and a leg for data transfer - 800Gbps is $5-$9 per second.

▲

transitionnel 3 days ago | parent | next [-]

Future proofing inevitable things should be something to talk about more.

For instance, people will be scraping at a "growing" rate as they figure out how everything AI works. We might as well figure out some standard seeded data packages for training that ~all sources/sectors agree to make available as public torrents to reduce this type of problem.

[I realize this ask is currently idealistic, but it's an anchor point to negotiate from.]

▲

aianus 3 days ago | parent | prev | next [-]

Downloading cached data from Cloudflare to AWS is free to the person doing the downloading if they use Internet gateway

▲

tucnak 3 days ago | parent | prev [-]

Hot take. 40 Gbps is not a real rate; it's just four 10 Gbps in a trenchcoat stacked on top of one another!

	▲	ZWoz 3 days ago \| parent [-]
		Thats true for several other speeds too. 100GE first generation was 10x10GbE, second generation was 4x25GbE. 200GE first version was 25GbE based and so on.

▲

themafia 3 days ago | parent | prev [-]

That's what started the incident.

It was prolonged by the fact that Cloudflare didn't react correctly to withdrawn BGP routes to a major peer, that the secondary routes had reduced capacity due to unaddressed problems, and basic nuisance rate limiting had to be done manually.

It seems like they just build huge peering pipes and basically just hope for the best. They've maybe gotten so used to this working that they'll let degraded "secondary" links persist for much longer than they should. It's the typical "Swiss Cheese" style of failure.

	▲	vlovich123 3 days ago \| parent [-]
		Wasn’t the problem exacerbated precisely by withdrawing a BGP link because all the same traffic is then forced over a smaller number of physical links?