▲ | sitkack 5 hours ago | |||||||||||||||||||||||||
It would be nice to have a lot more detail. The WTF sections are the best part. Sounds like your gear needs "this side towards enemy" sign and/or the right affordances so it only goes in one way. Did you standardize on layout at the rack level? What poke-yoke processes did you put into place to prevent mistakes? What does your metal->boot stack look like? Having worked for two different cloud providers and built my own internal clouds with PXE booted hosts, I too find this stuff fascinating. Also take utmost advantage of a new DC when you are booting it to try out all the failure scenarios you can think of and the ones you can't through randomized fault injection. | ||||||||||||||||||||||||||
▲ | ca508 5 hours ago | parent [-] | |||||||||||||||||||||||||
> It would be nice to have a lot more detail I'm going to save this for when I'm asked to cut the three paras on power circuit types. Re: standardising layout at the rack level; we do now! we only figured this out after site #2. It makes everything so much easier to verify. And yeah, validation is hard - manually doing it thus far; want to play around with scraping LLDP data but our switch software stack has a bug :/. It's an evolving process, the more we work with different contractors, the more edge cases we unearth and account for. The biggest improvement is that we have built a internal DCIM that templates a rack design and exports a interactive "cabling explorer" for the site techs - including detailed annotated diagrams of equipment showing port names, etc... The screenshot of the elevation is a screenshot of part of that tool. > What does your metal->boot stack look like? We've hacked together something on top of https://github.com/danderson/netboot/tree/main/pixiecore that serves a debian netboot + preseed file. We have some custom temporal workers to connect to Redfish APIs on the BMCs to puppeteer the contraption. Then a custom host agent to provision QEMU VMs and advertise assigned IPs via BGP (using FRR) from the host. Re: new DCs for failure scenarios, yeah we've already blown breakers etc... testing stuff (that's how we figured out our phase balancing was off). Went in with a thermal camera on another. A site in AMS is coming up next week and the goal for that is to see how far we can push a fully loaded switch fabric. | ||||||||||||||||||||||||||
|