Remix.run Logo
jiggawatts 3 months ago

Speaking of Kubernetes performance: I had a need for fast scale-out for a bulk testing exercise. The gist of it was that I had to run Selenium tests with six different browsers against something like 13,000 sites in a hurry for a state government. I tried Kubernetes, because there's a distributed Selenium runner for it that can spin up different browsers in individual pods, even running Windows and Linux at the same time! Very cool.

Except...

Performance was woeful. It took forever to spin up the pods, but even once things had warmed up everything just ran in slow motion. Data was slow to collect (single-digit kilobits!), and I even saw a few timeout failures within the cluster.

I gave up and simply provisioned a 120 vCPU / 600 GB memory cloud server with spot pricing for $2/hour and ran everything locally with scripts. I ended up scanning a decent chunk of my country's internet in 15 minutes. I was genuinely worried that I'd get put on some sort of "list" for "attacking" government sites. I even randomized the read order to avoid hammering any one site too hard.

Kubernetes sounds "fun to tinker with", but it's actually a productivity vampire that sucks up engineer time.

It's the Factorio of cloud hosting.

pdimitar 3 months ago | parent [-]

> I gave up and simply provisioned a 120 vCPU / 600 GB memory cloud server with spot pricing for $2/hour and ran everything locally with scripts. I ended up scanning a decent chunk of my country's internet in 15 minutes.

Now that is a blog post that I would read with interest, top to bottom.

jiggawatts 3 months ago | parent [-]

It was the “boring” solution so I don’t know what I could write on the topic!

Both Azure and AWS have spot-priced VMs that are “low priority” and hence can be interrupted by customers with normal priority VM allocation requests. These have an 80% discount in exchange for the occasional unplanned outage.

In Azure there is an option where the spot price dynamically adjusts based on demand and your VM basically never turns off.

The trick is that obscure SKUs have low demand and hence low spot prices and low chance of being taken away. I use the HPC optimised sizes because they’re crazy fast and weirdly cheap.

E.g.: right now I’m using one of these to experiment with reindexing a 1 TB database. With 120 cores (no hyperthreading!) this goes fast enough that I can have a decent “inner loop” development experience. The other trick is that even Windows and SQL Server is free if this is done in an Azure Dev/Test subscription. With free software and $2/hr hardware costs it’s a no-brainer!

pdimitar 3 months ago | parent [-]

Well I mostly meant how do you supply the server resources and how do you crawl so much of the net so quickly. :)

I thought about it many times but never did it on that scale, plus was never paid to do so and really didn't want my static IP banned. So if you ever write on that and publish it on HN you'd find a very enthusiastic audience in me.

jiggawatts 3 months ago | parent [-]

That was pretty boring too! The "script" was just a few hundred lines of C# code triggering Selenium via its SDK. The requirement was simply to load a set of URLs with two different browsers, an "old" one and a "new" one that included a (potentially) breaking change to cookie handling that the customer needed to check for across all sites. I didn't need to fully crawl the sites, I just had to load the main page of each distinct "web app" twice, but I had process JavaScript and handle cookies.

I did this in two phases:

Phase #1 was to collect "top-level" URLs, which I did via Certificate Transparency (CT). There's online databases that can return all valid certs for domains with a given suffix. I used about a dozen known suffixes for the state government, which resulted in about 11K hits from the CT database. I dumped these into a SQL table as the starting point. I also added in distinct domains from load balancer configs provided by the customer. This provided another few thousand sites that are child domains under a wildcard record and hence not easily discoverable via CT. All of this was semi-manual and done mostly with PowerShell scripts and Excel.

Phase #2 was the fun bit. I installed two bespoke builds of Chromium side-by-side on the 120-core box, pointed Selenium at both, and had them trawl through the list of URLs in headless mode. Everything was logged to a SQL database. The final output was any difference between the two Chromium builds. E.g.: JS console log entries that are different, cookies that are not the same, etc...

All of this was related to a proposed change to the Public Suffix List (PSL), which has a bunch of effects on DNS domain handling, cookies, CORS, DMARC, and various other things. Because it is baked into browser EXEs, the only way to test a proposed change ahead of time is to produce your own custom-built browser and test with that to see what would happen. In a sense, there's no "non-production Internet", so these lab tests are the only way.

Actually, the most compute-intensive part was producing the custom Chromium builds! Those took about an hour each on the same huge server.

By far the most challenging aspect was... the icon. I needed to hand over the custom builds to web devs so that they could double-check the sites they were responsible for, and it was also needed for internal-only web app testing. The hiccup was that two builds look the same and end up with overlapping Windows task bar icons! Making them "different enough" that they don't share profiles and have distinct toolbar icons was weirdly difficult, especially the icon.

It was a fun project, but the most hilarious part was that it was considered to be such a large-scale thing that they farmed out various major groups of domains to several consultancies to split up the work effort. I just scanned everything because it was literally simpler. They kept telling me I had "exceeded the scope", and for the life of me I couldn't explain to them that treating all domains uniformly is less work than trying to determine which one belongs to which agency.

pdimitar 3 months ago | parent [-]

EXTREMELY nice. Wish I was paid to do that. :/

jiggawatts 3 months ago | parent [-]

So do I! :(

I only get a "fun" project like this once every year or two.

Selling this kind of thing is basically impossible. You can't convince anyone that you have an ability that they don't even understand, at some fundamental level.

At best, you can incidentally use your full set of skills opportunistically, but that's only possible for unusual projects. Deploying a single VM for some boring app is always going to be a trivial project that anyone can do.

With this project even after it was delivered the customer didn't really understand what I did or what they got out of it. I really did try to explain, but it's just beyond the understanding of non-technical-background executives that think only in terms of procurement paperwork and scopes of works.