I have pretty much the exact same opinion, right down to “cloud shit” (I have even used that exact term at work to multiple levels of management, specifically “I’m not working on cloud shit, you know what I like working on and I’ll do pretty much anything else, but no cloud shit”). I have worked on too many projects where all you need is a 4-8 vCPU VM running nginx and some external database, but for some reasons there’s like 30 containers and 45 different repos and 10k lines of terraform, and the fastest requests take 750ms which should be nginx->SQL->return in <10ms. “But it’s autoscaling and load balanced!” That’s great, we have max 10k customers online at once and the most complex requests are “set this field to true” and “get me an indexed list.” This could be hosted on a raspberry pi with better performance.

But for some reason this is what people want to do. They would rather spend hours debugging kubernetes, terraform, and docker, and spending 5 digits on cloud every month, to serve what could literally be proxied authenticated DB lookups. We have “hack days” a few times a year, and I’m genuinely debating rewriting the entire “cloud” portion of our current product in gunicorn or something, host it on a $50/month vps, point it at a mirror of our prod db, and see how many orders of magnitude of performance I can knock off in a day.

I’ve only managed to convert one “cloud person” to my viewpoint but it was quite entertaining. I was demoing a side project[0] that involved pulling data from ~6 different sources (none hosted by me), concatenating them, deduping them, doing some math, looking up in a different source an image (unique to each displayed item), and then displaying the list of final items with images in a list or a grid. ~5k items. Load time on my fibre connection was 200-250ms, sorting/filtering was <100ms. As I was demoing this, a few people asked about the architecture, and one didn’t believe that it was a 750 line python file (using multiprocessing, admittedly) hosted on an 8 core VPS until I literally showed up. He didn’t believe it was possible to have this kind of performance in a “legacy monolithic” (his words) application.

I think it’s so heavily ingrained in most cloud/web developers that this is the only option that they will not even entertain the thought that it can be done another way.

[0] This particular project failed for other reasons, and is no longer live.

▲ jiggawatts 7 months ago | parent | next [-]

Speaking of Kubernetes performance: I had a need for fast scale-out for a bulk testing exercise. The gist of it was that I had to run Selenium tests with six different browsers against something like 13,000 sites in a hurry for a state government. I tried Kubernetes, because there's a distributed Selenium runner for it that can spin up different browsers in individual pods, even running Windows and Linux at the same time! Very cool.

Except...

Performance was woeful. It took forever to spin up the pods, but even once things had warmed up everything just ran in slow motion. Data was slow to collect (single-digit kilobits!), and I even saw a few timeout failures within the cluster.

I gave up and simply provisioned a 120 vCPU / 600 GB memory cloud server with spot pricing for $2/hour and ran everything locally with scripts. I ended up scanning a decent chunk of my country's internet in 15 minutes. I was genuinely worried that I'd get put on some sort of "list" for "attacking" government sites. I even randomized the read order to avoid hammering any one site too hard.

Kubernetes sounds "fun to tinker with", but it's actually a productivity vampire that sucks up engineer time.

It's the Factorio of cloud hosting.

▲

pdimitar 7 months ago | parent [-]

> I gave up and simply provisioned a 120 vCPU / 600 GB memory cloud server with spot pricing for $2/hour and ran everything locally with scripts. I ended up scanning a decent chunk of my country's internet in 15 minutes.

Now that is a blog post that I would read with interest, top to bottom.

▲

jiggawatts 7 months ago | parent [-]

It was the “boring” solution so I don’t know what I could write on the topic!

Both Azure and AWS have spot-priced VMs that are “low priority” and hence can be interrupted by customers with normal priority VM allocation requests. These have an 80% discount in exchange for the occasional unplanned outage.

In Azure there is an option where the spot price dynamically adjusts based on demand and your VM basically never turns off.

The trick is that obscure SKUs have low demand and hence low spot prices and low chance of being taken away. I use the HPC optimised sizes because they’re crazy fast and weirdly cheap.

E.g.: right now I’m using one of these to experiment with reindexing a 1 TB database. With 120 cores (no hyperthreading!) this goes fast enough that I can have a decent “inner loop” development experience. The other trick is that even Windows and SQL Server is free if this is done in an Azure Dev/Test subscription. With free software and $2/hr hardware costs it’s a no-brainer!

▲

pdimitar 7 months ago | parent [-]

Well I mostly meant how do you supply the server resources and how do you crawl so much of the net so quickly. :)

I thought about it many times but never did it on that scale, plus was never paid to do so and really didn't want my static IP banned. So if you ever write on that and publish it on HN you'd find a very enthusiastic audience in me.

▲

jiggawatts 7 months ago | parent [-]

That was pretty boring too! The "script" was just a few hundred lines of C# code triggering Selenium via its SDK. The requirement was simply to load a set of URLs with two different browsers, an "old" one and a "new" one that included a (potentially) breaking change to cookie handling that the customer needed to check for across all sites. I didn't need to fully crawl the sites, I just had to load the main page of each distinct "web app" twice, but I had process JavaScript and handle cookies.

I did this in two phases:

Phase #1 was to collect "top-level" URLs, which I did via Certificate Transparency (CT). There's online databases that can return all valid certs for domains with a given suffix. I used about a dozen known suffixes for the state government, which resulted in about 11K hits from the CT database. I dumped these into a SQL table as the starting point. I also added in distinct domains from load balancer configs provided by the customer. This provided another few thousand sites that are child domains under a wildcard record and hence not easily discoverable via CT. All of this was semi-manual and done mostly with PowerShell scripts and Excel.

Phase #2 was the fun bit. I installed two bespoke builds of Chromium side-by-side on the 120-core box, pointed Selenium at both, and had them trawl through the list of URLs in headless mode. Everything was logged to a SQL database. The final output was any difference between the two Chromium builds. E.g.: JS console log entries that are different, cookies that are not the same, etc...

All of this was related to a proposed change to the Public Suffix List (PSL), which has a bunch of effects on DNS domain handling, cookies, CORS, DMARC, and various other things. Because it is baked into browser EXEs, the only way to test a proposed change ahead of time is to produce your own custom-built browser and test with that to see what would happen. In a sense, there's no "non-production Internet", so these lab tests are the only way.

Actually, the most compute-intensive part was producing the custom Chromium builds! Those took about an hour each on the same huge server.

By far the most challenging aspect was... the icon. I needed to hand over the custom builds to web devs so that they could double-check the sites they were responsible for, and it was also needed for internal-only web app testing. The hiccup was that two builds look the same and end up with overlapping Windows task bar icons! Making them "different enough" that they don't share profiles and have distinct toolbar icons was weirdly difficult, especially the icon.

It was a fun project, but the most hilarious part was that it was considered to be such a large-scale thing that they farmed out various major groups of domains to several consultancies to split up the work effort. I just scanned everything because it was literally simpler. They kept telling me I had "exceeded the scope", and for the life of me I couldn't explain to them that treating all domains uniformly is less work than trying to determine which one belongs to which agency.

▲

pdimitar 7 months ago | parent [-]

EXTREMELY nice. Wish I was paid to do that. :/

	▲	jiggawatts 7 months ago \| parent [-]
		So do I! :( I only get a "fun" project like this once every year or two. Selling this kind of thing is basically impossible. You can't convince anyone that you have an ability that they don't even understand, at some fundamental level. At best, you can incidentally use your full set of skills opportunistically, but that's only possible for unusual projects. Deploying a single VM for some boring app is always going to be a trivial project that anyone can do. With this project even after it was delivered the customer didn't really understand what I did or what they got out of it. I really did try to explain, but it's just beyond the understanding of non-technical-background executives that think only in terms of procurement paperwork and scopes of works.

▲ throwaway2037 7 months ago | parent | prev | next [-]

These comments:

    > He didn’t believe it was possible to have this kind of performance in a “legacy monolithic” (his words) application.

    > I think it’s so heavily ingrained in most cloud/web developers that this is the only option that they will not even entertain the thought that it can be done another way.

One thing that I need to remind myself of periodically: The amount of work that a modern 1U server can do in 2024 is astonishing.

	▲	sgarland 7 months ago \| parent \| next [-]
		Hell, the amount of work that an OLD 1U can do is absurd. I have 3x Dell R620s (circa-2012), and when equipped with NVMe drives, they match the newest RDS instances, and blow Aurora out of the water. I’ve tested this repeatedly, at multiple companies, with Postgres and MySQL. Everyone thinks Aurora must be faster because AWS is pushing it so hard; in fact, it’s quite slow. Hard to get around physics. My drives are mounted via Ceph over Infiniband, and have latency measured in microseconds. Aurora (and RDS for that matter) has to traverse much longer physical distances to talk to its drives.
	▲	HPsquared 7 months ago \| parent \| prev [-]
		It's nice to think about the amount of work done by game engines, for instance. Factorio is a nice example, or anything graphics-heavy.

▲ mschuster91 7 months ago | parent | prev [-]

I mostly agree with you, but at least using Docker is something one should be doing even if one is on bare metal.

Pure bare metal IME only leads to people ssh'ing to hotfix something and forgetting to deploy it. Exclusively using Docker images prevents that. Also, it makes firewall management much, much easier as you can control containers' network connectivity (including egress) each on their own, on a bare-metal setup it involves loads of work with network namespaces and fighting the OS-provided systemd unit files.