Remix.run Logo
btown 10 hours ago

> we pushed a change via our policy automation platform to remove the BGP announcements from Miami

Is there any way to test these changes against a simulation of real world routes? Including to ensure that traffic that shouldn’t hit Cloudflare servers, continues to resolve routes that don’t hit Cloudflare?

I have to imagine there’s academic research on how to simulate a fork of global BGP state, no? Surely there’s a tensor representation of the BGP graph that can be simulated on GPU clusters?

If there’s a meta-rule I think of when these incidents occur, it’s that configuration rules need change management, and change management is only as good as the level of automated testing. Just because code hasn’t changed doesn’t mean you shouldn’t test the baseline system behavior. And here, that means testing that the Internet works.

PunchyHamster 7 hours ago | parent | next [-]

> Is there any way to test these changes against a simulation of real world routes? Including to ensure that traffic that shouldn’t hit Cloudflare servers, continues to resolve routes that don’t hit Cloudflare?

You can get access to view of routes from different parts of networks but you do not have access to those routers policies, so no

> I have to imagine there’s academic research on how to simulate a fork of global BGP state, no? Surely there’s a tensor representation of the BGP graph that can be simulated on GPU clusters?

Just simulating your peers and maybe layer after is most likely good enough. And you can probably do it with a bunch of cgroups and some actual routing software. There are also network sims like GNS3 that can even just run router images

toast0 6 hours ago | parent | prev | next [-]

I don't know why you would need a tensor whatever. Dump the state of the router (which peers are connected and for how long what routes are they advertising and for how long) as well as the computed routing table and what routes are advertised to peers.

Set a simulation router to have the same state but a new config, and compute the routing table and what routes would he advertised to peers.

Confirm the diff in routing table and advertised routes is reasonable.

This change seemed to mostly be about a single location. Other BGP config changes leading to problems are often global changes, but you can check diffs and apply the config change one host at a time. You can't really make a simultaneous change anyway. Maybe one host changing is ok, but the Nth one causes a problem... CF has a lot of BGP routers, so maybe checking every diff is too much, but at least check a few.

Is that something out of the box on routers? I don't know, people with BGP routers never let me play with them. But given the BGP haiku, I'd want something like that before I messed around with things. For the price you pay for these fancy routers, you should be able to buy an extra few to run sandboxed config testing on. You could also simulate with open source bgp software, but the proprietary BGP daemon on the router might not act like the open source one does.

hnuser123456 8 hours ago | parent | prev | next [-]

You can cross-reference RADB, the RIRs, and looking glass servers, and you'd find 3 different pictures of the internet.

Analemma_ 9 hours ago | parent | prev [-]

I assume it's not possible unless you know the in-memory state of all the other gateway routers on the internet, no? You can know what they advertise, but that's not the same thing as a full description of their internal state and how they will choose to update if a route gets withdrawn.

erredois 7 hours ago | parent [-]

I think you could know the state of the peers and simulate what they advertise and receive and validate that. The test unit would need to be a simulated router that behaves exactly as the real one, I actually think its technically doable with tight version control for routers.