Remix.run Logo
rossdavidh a day ago

It's a great article, until the end where they say what the solution would be. I'm afraid that the solution is: build something small, and use it in production before you add more features. If you need to make a national payroll, you have to use it for a small town with a payroll of 50 people first, get the bugs worked out, then try it with a larger town, then a small city, then a large city, then a province, and then and only then are you ready to try it at a national level. There is no software development process which reliably produces software that works at scale without doing it small, and medium sized, first, and fixing what goes wrong before you go big.

shagie a day ago | parent | next [-]

> If you need to make a national payroll, you have to use it for a small town with a payroll of 50 people first, get the bugs worked out, then try it with a larger town, then a small city, then a large city, then a province, and then and only then are you ready to try it at a national level.

At a large box retail chain (15 states, ~300 stores) I worked on a project to replace the POS system.

The original plan had us getting everything working (Ha!) and then deploying it out to stores and then ending up with the two oddball "stores". The company cafeteria and surplus store were technically stores in that they had all the same setup and processes but were odd.

When the team that I was on was brought into this project, we flipped that around and first deployed to those two several months ahead of the schedule to deploy to the regular stores.

In particular, the surplus store had a few dozen transactions a day. If anything broke, you could do reconciliation by hand. The cafeteria had single register transaction volume that surpassed a surplus store on most any other day. Furthermore, all of its transactions were payroll deductions (swipe your badge rather than credit card or cash). This meant that if anything went wrong there we weren't in trouble with PCI and could debit and credit accounts.

Ultimately, we made our deadline to get things out to stores. We did have one nasty bug that showed up in late October (or was it early November?) with repackaging counts (if a box of 6 was $24 and if purchased as a single item it was $4.50 ... but if you bought 6 single items it was "repackaged" to cost $24 rather than $27) which interacted with a BOGO sale. That bug resulted in absurd receipts with sales and discounts (the receipt showed you spent $10,000 but were discounted $9,976 ... and then the GMs got alerts that the store was not able to make payroll because of a $9,976 discount ... one of the devs pulled an all nighter to fix that one and it got pushed to the stores ).

I shudder to think about what would have happened if we had tried to push the POS system out to customer facing stores where the performance issues in the cafeteria where worked out first or if we had to reconcile transactions to hunt down incorrect tax calculations.

skeeter2020 12 hours ago | parent | next [-]

>> We did have one nasty bug that showed up in late October (or was it early November?)

Having worked in Ecommerce & payment processing, where this weekend is treated like the Superbowl, birth of your first child and wedding day all rolled into one, a nasty POS bug at this time of year would be incredibly stressful!

shagie 12 hours ago | parent [-]

After thinking back on it, I think this was earlyish October. The code hadn't frozen yet, but it was getting increasingly difficult. We were in the "this was deployed to about 1/3 of the stores - all within an 8 drive of the general office". The go/no-go decision for the rest of the stores in October was coming up (and people were reviewing backout procedures for those 100). One of the awkward parts was that marketing had a Black Friday sale that they really wanted to do (buy X, buy Y, get Z half price) that the old registers couldn't support. They wanted to get a "is this going?" so they could start printing the advertising flyers.

Incidentally, this bug resurfaced for the next five years in a different incarnation. Because it had that this department (it was with one sku) had sold $10M this week in October, the running average sales target the next year was MEAN($24k, $25k, $26k, $25k, $10M) ... and the department heads were doing a "you want me to sell how much?!"

This bug had only affected... maybe five stores (still maybe five too many). We were in the "this is the last™ build before all store deployment next week" territory. It did mess with that a bit too as the boxed up registers came with an additional step of "make sure to reboot the register after doing initial confirmation."

The setup teams had a pallet of computers delivered to the stores that were supposed to be "remove the old registers, put these registers in, swap mag strip readers, take that laptop there and run this software to configure the devices on each register." However, the build that the registers had was the buggy build. While that build likely wouldn't hit that bug (it required a particular sale to be active which was only at a few stores and had ended) it still was another step that they had to follow.

Aside: For all its clunkiness, Java Web Start was neat. In particular, it meant that instead of trying to push software to 5k registers (how do you push to registers that are powered off?), instead we'd push to 300 stores and from there JWS would check for an update each time it started up ( https://docs.oracle.com/javase/8/docs/technotes/guides/javaw... ). So instead of pushing to 5k registers, we'd have it pull from 'posupdate' on the local network when it rebooted.

einpoklum a day ago | parent | prev [-]

You could have, in principle, implemented the new system to be able to run in "dummy mode" alongside the existing system at regular stores, so that you see that it produces the 'same' results in terms of what the existing system is able to provide.

Which is to say, there is more than one approach to gradual deployment.

shagie a day ago | parent | next [-]

Not easily when issues of PCI get in there.

Things like the credit card reader (and magnetic ink reader for checks), different input device (sending the barcode scanner two two different systems), keyboard input (completely different screens and keyed entry) would have made those hardware problems also things that needed to be solved.

The old system was a DOS based one where a given set of Fkeys were used to switch between screens on a . Need to do hand entry of a SKU? That was F4 and then type the number. Need to do a search for the description of an item? That was F5. The keyboard was particular to that register setup and used an old school XT (5 pin DIN) plug. The new systems were much more modern linux boxes that used USB plugs. The mag strip reader was flashed to new screens (and the old ones were replaced).

For this situation, it wasn't something that we could send keyboard, scanner, and credit card events to another register.

eru a day ago | parent | next [-]

What's PCI?

Sorry, I'm not familiar with all the acronyms.

shagie a day ago | parent | next [-]

PCI itself is Payment Card Industry. PCI DSS as noted is the Data Security Standard.

https://en.wikipedia.org/wiki/Payment_Card_Industry_Data_Sec...

The time it was in the transition between 2.0 and 3.0 (its been refined many times since).

https://listings.pcisecuritystandards.org/documents/PCI-DSS-... is the 3.2.1 audit report template.

One of the most important things in there is you don't mix dev and production. The idea of putting a development box next to a production box that runs the same transactions... that just doesn't happen.

Failing a PCI DSS audit means hefty fines and increases of transaction fees (paying 1% more on each transaction done with a credit card can make a $10k/month - $100k/month fine a rounding error) to a "no, you can't process credit cards" which would mean... well... shutting down the company (that wouldn't be a first offense - its still not something you want to have a chat about with accounting about why everything costs 1% more now). Those are things that you don't want to deal with as a developer.

So, no. There is no development configuration in production, or mirroring of a point of sales terminal to another system that's running development code.

Development code doesn't touch other people's money. We had enough side eyes looking at the raw data for our manager's payment card on development systems because only people that banked at that local bank occasionally experienced a problem with their visa check card... https://en.wikipedia.org/wiki/Digital_card#Financial_cards - when it says "generally '^'" it means it can be some other character... and it was... and this wasn't a problem for most people, but it turned out that the non-standard separator (that we only found after reading the card's raw data) and a space in the surname would result in misparsing of the track and giving an error - but none of our other cards used a separator that didn't match the "generally").

So, being able to generate real production load (in the cafeteria) without using Visa, Mastercard, etc... was important. As was being able to fall back to using the nearly antique credit card imprinter ( https://en.wikipedia.org/wiki/Credit_card_imprinter ) for the store that was lucky to get a dozen transactions a day.

wcarss 16 hours ago | parent | next [-]

> So, no. There is no development configuration in production, or mirroring of a point of sales terminal to another system that's running development code.

This is a misreading of the suggestion, I think. My reading of the suggestion is to run a production "dry run" parallel code path, which you can reconcile with the existing system's work for a period of time, before you cut over.

This is not an issue precluded by PCI; it is exactly the method a team I led used to verify a rewrite of and migration to a "new system" handling over a billion dollars of recurring billing transactions annually: write the new thing with all your normal testing etc, then deploy it alongside in a "just tell us what you would do" mode, then verify its operation for specific case classes and then roll progressively over to using it for real.

edit: I don't mean to suggest this is a trivial thing to do, especially in the context you mentioned with many elements of hardware and likely odd deployment of updates, etc.

shagie 12 hours ago | parent [-]

Our reading of PCI DSS was that there was no development code in a production build. Having a --dry-run flag would have meant doing that.

You could do "here is the list of skus for transaction 12120112340112345 - run this through the system and see what you get" on our dev boxes hooked up to QA store 2 (and an old device in the lab hooked up to QA store 1). That's not a problem.

Sending the scanner reads to the current production and a dev box in production would have been a hardware challenge. Not completely insurmountable but very difficult.

Sending the keyboard entry to both devices would be a problem. The screens were different and you can hand enter credit card numbers. So keyboard entry is potentially PCI data.

The backend store server would also have been difficult. There were updates to the store server (QA store 1 vs QA store 2 running simultaneously) that were needed too.

This wasn't something that we could progressively roll out to a store. When a store was to get the new terminals, they got a new hardware box, ingenicos were swapped with epson, old epson were replaced with new (same device but the screens had to be changed to match a different workflow - they were reprogrammable, but that was something that stores didn't have the setup to do), and a new build was pushed to the store server. You couldn't run register 1 with the old device and register 2 with a new one.

Fetching a list of SKUs, printing up a page of barcodes and running it was something we could do (and did) in the office. Trying to run a new POS system in a non-production mode next to production and mirroring it (with reconciling end of day runs) wasn't feasible for hardware, software, and PCI reasons that were exacerbated by the hardware and software issues.

Online this is potentially easier to do with sending a shopping cart to two different price calculators and logging if the new one matches the old one. With a POS terminal, this would be more akin to hooking the same keyboard and mouse up to a windows machine and a linux machine. The Windows machine is running MS Word and the linux is running Open office and checking to see that after five minutes of use of the windows machine that the Linux machine had the same text entered into OpenOffice. Of course they aren't - the keyboard entry commands are different, the windows are different sizes, the menus have things in different places in different drop downs... similarly, trying to do this with the two POS systems would be a challenge. And to top it off sometimes the digits typed are hand keyed credit card numbers when the MSR couldn't get a read - and make sure those don't show up on the linux machine.

I do realize this is reminiscent of business giving a poorly spec'ed thing and each time someone says "what about..." they come up with another reason it wouldn't work. This was a system that I worked on for a long while (a decade and a half ago) and could spend hours drawing and explaining diagrams of system architecture and issues that we had. Anecdotes of how something worked in a 4M Sloc system are inherently incomplete.

wcarss 10 hours ago | parent [-]

Neat! Yeah, that's a pretty complex context and I completely see what you mean about the new hardware being part of the rollout and necessarily meaning that you can't just run both systems. My comment is more of a strategy for just a backend or online processing system change than a physical brick and mortar swap out.

In my note about misreading the suggestion, I was thinking generally. I do believe that there is no reason from a PCI perspective why a given production system cannot process a transaction live and also in a dry mode on a new code path that's being verified, but if the difference isn't just code paths on a device, and instead involves hardware and process changes, your point about needing to deploy a dev box and that being a PCI issue totally makes sense, plus the bit about it being a bad test anyway because of the differences in actions taken or outputs.

The example you gave originally, of shipping to the lower stake exceptional stores first and then working out issues with them before you tried to scale out to everywhere, sounded to me like a very solid approach to mitigating risk while shipping early.

shagie 10 hours ago | parent [-]

More of the background of the project.

The original register was a custom written C program running in DOS. It was getting harder and harder to find C programmers. The consultancy that had part of the maintenance contract with it was also having that difficulty and between raising the rates and deprioritizing the work items because their senior people (the ones who still knew how to sling C and fit it into computers with 4 MB of memory that you couldn't get replacement parts for anymore) were on other (higher paying) contracts.

So the company I worked at made the decision to switch from that program to a new one. They bought and licensed the full source to a Java POS system (and I've seen the same interface at other big retail companies too) and replace all the hardware in all the stores... ballpark 5000 POS systems.

The professional services consultancy was originally brought in (I recall it being underway when I started at there in 2010). They missed deadlines and updates and I'm sure legal got in there with failure to deliver on contract. I think it was late 2011 that the company pulled the top devs from each team and set us to working on making this ready in all stores by October 2012 (side note: tossing two senior devs from four different teams into a new team results in some challenging personality situations). And that's when we (the devs) flipped the schedule around and instead of March 2013 for the cafeteria and surplus store (because they were the odd ones), we were going to get them in place in March of 2012 so that we could have low risk production environments while we worked out issues (so many race conditions and graphical event issues hanging with old school AWT).

---

... personality clash memory... it was on some point of architecture and code and our voices were getting louder. Bullpen work environment, (a bunch of unsaid backstory here) but the director was in the cube on the other side of the bullpen from us. The director "suggested" that we take our discussion to a meeting room... so we packed up a computer (we needed it to talk about code), all of the POS devices that we needed, put it on a cart, pushed the cart down the hall into a free conference room (there were two conference rooms on that floor - no, this wasn't a building designed for development teams) and set up and went back to loudly discussing. However, we didn't schedule or reserve the room... and the director that kicked us out of the bullpen had reserved the room that we had been kicked into shortly after we got there. "We're still discussing the topic, that will probably be another 5-10 minutes from now... and it will take us another 5 minutes pack the computer back up and take it back to the bullpen. Your cube with extra chairs in it should be available for your meeting and it's quiet there now without our discussions going on."

hipratham 17 hours ago | parent | prev | next [-]

Why not use aged/ anonymized data? This way you can use Prod data in Dev with custom security rules anonymizing your data and following DSS.

wcarss 16 hours ago | parent [-]

Lead: "We have six weeks to ship. Questions?"

Dev: "Could we pull an export of relevant historical data and get some time to write code to safely anonymize that, and stand up a parallel production system using just the anonymized data and replicate our deploy there, so we can safely test on real-ish stuff at scale?"

Lead: "I'll think about it. In the meantime, please just build the features I asked you to. We gotta hustle on this one."

I'm not arguing with this hypothetical exchange that it's infeasible or even a bad idea to do exactly what you suggested, but attempting to justify an upfront engineering cost that isn't directly finishing the job is a difficult thing to win in most contexts.

philipallstar 15 hours ago | parent [-]

It's very common to use identical systems but anonymised data shipped back to test environments in such cases. There are certain test card numbers that always fail or always succeed against otherwise-real infrastructure on the card provider's side.

wcarss 14 hours ago | parent [-]

Absolutely, I agree that it's a useful pattern. I've personally typed 4111 1111 1111 1111 into a stripe form more times than I want to even think about.

My point above was that it's not necessarily easy to convince the operators of a business that it's a justifiable engineering expense to set up a new "prodlike but with anonymized data" environment from scratch, because it's not a trivial thing to make and maintain.

I do think it's pretty easy to convince operators of a business to adopt the other strategy suggested in a sibling thread: run a dry mode parallel code path, verify its results, and cut over when you have confidence. This shouldn't really be an alternative to a test environment, but they can both achieve similar stuff.

13 hours ago | parent | next [-]
[deleted]
philipallstar 13 hours ago | parent | prev [-]

> I do think it's pretty easy to convince operators of a business to adopt the other strategy suggested in a sibling thread: run a dry mode parallel code path, verify its results, and cut over when you have confidence. This shouldn't really be an alternative to a test environment, but they can both achieve similar stuff.

I agree - it's a nice low-risk way of doing things.

shagie 12 hours ago | parent [-]

Elsecomment I explained this more...

It is as low risk as trying to use Windows and Microsoft Word with a keyboard and mouse mirrored to a Linux machine running Open Office and expecting the same results.

You can't run the two systems side by side - different screens, different keyboard entry... and some of the keyboard entry can't touch the other system.

And this is assuming you can put a dry path into the production system. If the answer is "no", then you're putting a dev environment into a production environment... and that's certainly a "no".

We had test environments and we had a lab were we had two rows of systems where the two systems sat back to back and each row was hooked up to a different test store (not feasible in a production store environment).

ChrisGreenHeur 21 hours ago | parent | prev [-]

surely you have logs from the production systems? just gather the logs and run them through the dev box. verify the end result matches between the two. You dont actually need the dev box to sit next to the production system.

brendoelfrendo 20 hours ago | parent [-]

You cannot, under any circumstances, keep a real card # and use it as test data. I think that's where this conversation is getting hung up, because the idea of running a transaction through prod and them doing the same in test to see if it matches isn't something you can do. I mean, of course you can throw the prices and UPCs at the new system and verify that the new system's math matches the old system, but that's only the most basic function of a POS system. Testing a transaction from end-to-end would have to be done with synthetic data in an isolated environment, and I'll assume that's what OP is trying to articulate.

antihero 15 hours ago | parent | next [-]

There's all this stuff but I remember when I was a Junior freelancer I was analysing a calendar availability sync script for a small holiday bookings company (not the big one). The hosts would have a publicly accessible Google Calendar with their bookings on which the script I was fixing would pull from.

Turns out, most of the host stored their customers long cards + expiry etc in the comment field of the booking.

ChrisGreenHeur 19 hours ago | parent | prev [-]

the reproduction is always fake to some extent, that does not matter, the point is to do as good a job as you can.

for example you can have a fake transaction server with the credit card numbers made up and mapped to fake accounts that always have enough money, unless the records show they did not.

Ghoelian 18 hours ago | parent | next [-]

I've also worked with payment processors a lot. The ones I've used have test environments where you can fake payments, and some of them (Adyen does this) even give you actual test debit and credit cards, with real IBAN's and stuff like that.

skeeter2020 12 hours ago | parent | next [-]

Don't know anything about the OP's system, other than "POS" but the bug they experienced - and (maybe?) all the typical integration stuff like inventory management - is very complex and wouldn't manifest itself in a payment processing failure. I'm doubtful that anyone's production inventory or accounting systems allow for "fake" transactions that can be validated by an e2e test

shagie 11 hours ago | parent [-]

POS stands for Point Of Sales in this context.

It was a linux running on (year appropriate) https://www.hp.com/us-en/solutions/pos-systems-products.html... - and add on all the peripherals. The POS software was standalone-ish (you could, in theory, hook it up to a generator to a register and the primary store server and process cash, paper check, and likely store branded credit cards)... it wouldn't be pleasant, but it could.

The logic for discounts and sales and taxes (and if an item had sales tax in that jurisdiction) was all on register. The store server logged the transaction and handled inventory and price lookup, but didn't do price (sale, taxes) calculations itself.

brazzy 14 hours ago | parent | prev [-]

It's even public: https://docs.adyen.com/development-resources/test-cards-and-...

CamouflagedKiwi 17 hours ago | parent | prev [-]

At some point you start to get far away from reality though. If the cards have fake numbers then other auth information is also incorrect - e.g. the CVC won't match, the PIN won't either (depending on the format in use maybe). You can fake all that stuff too but now how much of that system are you really testing?

nenxk 16 hours ago | parent [-]

I mean in his example the discount bug they ran into wouldn’t have needed any card numbers that could have been discovered with fake/cloned transactions that contained no customer detail in this case it seems it would have been best to test the payment processing in personal at a single store and then also testing with sales logs from multiple other locations

ChrisGreenHeur 14 hours ago | parent [-]

yep, it sounds like the first implementation step really should have been to gather a large test set of data and develop the system with that in mind after understanding the test data, starting with making tests from the test data.

skeeter2020 12 hours ago | parent [-]

They explained the scenario though and it seems like a combination of rarer edge cases. It's great to think your awesome dev team and QA would have collected test data representing this ahead of time, but surely you've all been caught by this? I know I have; that's why we don't have flawless systems at launch.

ghaff a day ago | parent | prev | next [-]

PCI DSS in full. Payment Card Industry Data Security Standard. Basically a bunch of stuff you need to comply with if you're processing credit cards.

master_crab a day ago | parent | prev [-]

Payment card industry. Credit card info (personal user data, etc). There’s a whole boatload of data privacy issues you run into if you mess that up. So compliance is essential.

qotgalaxy a day ago | parent | prev [-]

[dead]

skeeter2020 12 hours ago | parent | prev | next [-]

From my experience a lot of the hardest problems in this space are either 1. edge cases or 2. integration-related and that makes them hard to validate across systems or draw boundaries around what's in the dummy mode. This type of parallel, live, full system integration test is hard to pull off.

lanstin 11 hours ago | parent [-]

In 1997 I was working on an integration between AOL and Circuit City (ha I outlived them both) to enable free AOL accounts for people buying PCs or some such; about a week before launch I changed the data returned from encoding spaces as "+" to "%20" and broke their integration (perl script). Very upsetting for them, and I felt bad.

I also had some weird bug when we started registrations from German accounts and I didn't handle umlauts (or UTF-16 with nuls in the string) in passwords properly.

android521 15 hours ago | parent | prev [-]

Sounds good in theory but very few real world projects can afford to run with old system in parallel

ozim 14 hours ago | parent | prev | next [-]

There is no solution because these projects are not failing because of technical reasons.

They are failing because of political scheming and bunch of people wanting to have a finger in the pie - "trillions spent" - I guess no one would mind earning couple millions.

Then you have "important people" who want to be important and want to have an opinion on font size and that some button should be 12px to the right because they are "important" it doesn't matter for the project but they have to assert the dominance.

You have 2 or 3 companies working on a project? Great! now they will be throwing stuff over the fence to limit their own cost and blame others while trying to get away with as least work done cashing most money possible.

That is how sausage is made. Coming up with "reasonable approach" is not the solution because as soon as you get different suppliers, different departments you end up with power/money struggle.

throw0101c 13 hours ago | parent | next [-]

> They are failing because of political scheming and bunch of people wanting to have a finger in the pie - "trillions spent" - I guess no one would mind earning couple millions.

Not (necessarily) wrong, but if you start small, Important People may not want to bother with something that is Unimportant and may leave things alone so something useful and working can get going. If you starting with an Important project then Important People will start circling it right away.

munificent 11 hours ago | parent | next [-]

Even starting small isn't a surefire way to avoid that problem. They'll just show up once the thing gets big enough.

Witness how the web was once a funny little collection of nerds sharing stuff with each other. But once it got big enough that you could start making money off it, the important people showed up and started taking over. The web still has those odd little corners, but it's largely the domain of a small number of giant powerful corporations.

I don't think there is a silver bullet for dealing with egomaniacs who want infinite power. They seem to be a part of the human condition and dealing with them is part of the ticket price for having a society.

camgunz 8 hours ago | parent [-]

Dunno if you listen to Ezra Klein but he had an anthropologist on once who described this tribe of humans who when someone came back having bagged big game, they had to run a gauntlet of everyone else downplaying their accomplishment like "that's not that big, your father caught bigger", and "maybe one day you'll bring down an adult deer" etc. The whole idea was like, egomaniacs are pretty bad, and they had a cultural defense against it.

I often think a weakness of liberal, western society is the insistence on rationality, that like the hunter in question could just easily put their abilities and accomplishments alongside those of others and get a pretty accurate picture. This is super untrue; we need systems to guard against our frailties, but we can't admit we have them, so we keep falling into the same ditches.

ozim 4 hours ago | parent | prev [-]

I guess for me important point is that it is not technical issue and we already have all technical tools/processes to do really big software projects.

Even if people dislike scrum, find Git complicated and don’t want to open up JIRA - these tools are not the problem, these tools help building loads of working software.

We as software engineers with devops can deliver great and complex projects and build great systems. Lots of businesses people don’t even understand how much in control we can be of the environments and code.

Yet developers/IT is there to be blamed. Like we should be ashamed, Uncle Bob will give lectures “how developers should be more professional”.

Yet I always find business people who are like children in the corn field.

With small difference business/sales guys are pushy and walk over engineering guys and engineers bend over and take the blame and business guys can always say “those IT kids playing with toys instead of doing real job”.

ethbr1 13 hours ago | parent | prev | next [-]

Political corruption is like environmental radiation: a viable fix is never 'just get rid of political corruption'*. It's an environmental constant that needs to be handled by an effective approach.

That said, parent's size- and scope-iterative approach also helps with corruption, because corruption metastasizes in the time between {specification} and {deliverable}.

Shrink that, by tying incremental payments to working systems at smaller scales, and you shrink the blast radius for failure.

That said, there are myriad other problems the approach creates (encouraging architectures that won't scale to the final system, promoting duct taped features on top of an existing system, vendor-to-vendor transitions if the system builder changes, etc).

But on the whole, the pros outweigh the cons... for projects controlled by a political process (either public or private).

That's why military procurement has essentially landed on spiral development (i.e. iterative demonstrated risk burn-down) as a meta-framework.

* Limit political corruption, to the extent possible in a cost efficient manner, sure

pksebben 11 hours ago | parent | prev [-]

> There is no solution because these projects are not failing because of technical reasons.

There is no technical solution. There are systems and governance solutions, if the will is there to analyze and implement them.

solatic a day ago | parent | prev | next [-]

That's what works for products, not software systems. Gradual growth inevitably results in loads of technical debt that is not paid off as Product adds more feature requests to deliver larger and larger sales contracts. Eventually you want to rewrite to deal with all the technical debt, but nobody has enough confidence to say what is in the codebase that's important to Product and what isn't, so everybody is afraid and frozen.

Scale is separately a Product and Engineering question. You are correct that you cannot scale a Product to delight many users without it first delighting a small group of users. But there are plenty of scaled Engineering systems that were designed from the beginning to reach massive scale. WhatsApp is probably the canonical example of something that was a rather simple Product with very highly scaled Engineering and it's how they were able to grow so much with such a small team.

mekoka a day ago | parent | next [-]

> Gradual growth inevitably results in loads of technical debt.

Why is this stated as though it's some de facto software law? The argument is not whether it's possible to waterfall a massive software system. It clearly is possible, but the failure ratios have historically been sufficiently uncomfortable to give rise to entirely different (and evidently more successful) project development philosophies, especially when promoters were more sensitive to the massive sums involved (which in my opinion also helps explains why so many wasteful government examples). The lean startup did not appear in a vacuum. Do things that don't scale did not become a motto in these parts without reason. In case some are still confused about the historical purpose of these benign sounding advices, no, they weren't originally addressed at entrepreneurs aiming to run "lifestyle" businesses.

lanstin 11 hours ago | parent | next [-]

I think the logic is that good code is code which is maintainable and modifyable; bad code is difficult to change safely. Over time, all code is changed until it is bad code and cannot be changed more. So overtime most code is bad code which is scary to touch.

tonyhart7 14 hours ago | parent | prev | next [-]

its not the law but a cost

software is unique field where project can be a problem that no matter how much money you throw, there is something that we can "improve" or make it better

that's why we start something small, a scope if you want call it that way

of course start something small or dare I call it simpler would result in more technical debt because that things its not designed with scale in mind because back to the first point

jappgar 15 hours ago | parent | prev [-]

It is a law. The law of entropy.

Try as you might, you cannot fight entropy eternally, as mistakes in this fight will accumulate and overpower you. It's the natural process of aging we see in every lifeform.

The way life continues on despite this law is through reproduction. If you bud off independent organisms, an ecosystem can gain "eternal" life.

The cost is that you must devote much of your energy to effective reproduction.

In software, this means embracing rewrites. The people who push against rewrites and claim they're not necessary are just as delusional as those who think they can live forever.

cjfd 14 hours ago | parent [-]

You don't understand very much about entropy. This reasoning is very, very, very sloppy.

jappgar 14 hours ago | parent | next [-]

Now I remember why I stopped commenting here.

ebcode 11 hours ago | parent | prev [-]

low-effort comment with ad hominem and zero rationale. fairly toxic.

otterley a day ago | parent | prev | next [-]

Software is a component of a product, if not the product itself. Treating software like a product, besides being the underlying truth, also means it makes sense to manage it like one.

Technical debt isn’t usually the problem people think it is. When it does become a problem, it’s best to think of it in product-like terms. Does it make the product less useful for its intended purpose? Does it make maintenance or repair inconvenient or costly? Or does it make it more difficult or even impossible to add competitive features or improvements? Taking a product evaluation approach to the question can help you figure out what the right response is. Sometimes it’s no response at all.

jfreds a day ago | parent | next [-]

Took me way too long to learn this. It still makes me sad to leave projects “imperfect” and not fiddle in my free time sometimes

YetAnotherNick 21 hours ago | parent | prev [-]

The discussion is not about the product where you can just remove the stuff. The thread was testing in small setting and then moving to oddball setting. If it is required to cover oddball settings, it makes sense to know and plan for oddball setting.

Jtsummers a day ago | parent | prev | next [-]

Designing or intending a system to be used at massive scale is not the same as building and deploying it so that it only initially runs at that massive scale.

That's just a recipe for disaster, "We don't even know if we can handle 100 users, let's now force 1 million people to use the system simultaneously." Even WhatsApp couldn't handle hundreds of millions of users on the day it was first released, nor did it attempt to. You build out slowly and make sure things work, at least if you're competent and sane.

solatic a day ago | parent | next [-]

Sure, but if you did a good job, the gradual deployment can go relatively quickly and smoothly, which is how $FAANG roll out new features and products to very large audiences. The actual rollout is usually a bit of an implementation detail of what first needed to be architected to handle that larger scale.

coliveira a day ago | parent | next [-]

The issue with FAANG is that they already have the infrastructure to make these large scale deployments. So any new system - by necessity - needs to conform to that large scale architecture.

brendoelfrendo 20 hours ago | parent [-]

The other nice thing about FAANG is that almost nothing they do is actually necessary. If Facebook rolls out a new feature and breaks something for a few hours, it doesn't actually matter. It's harder to move fast and break things if you're, say, a bank, and every minute of downtime is a minute where your customers can't access their money. Enough minutes go by and you may have a very, very expensive crisis on your hands.

lanstin 10 hours ago | parent | next [-]

Replying to myself in sibling: except maybe people paying for ads, which is more of a faith based action; it's well known a lot of ad traffic is fraudulent, but not which traffic. So if you pay for ads, who can tell what happened.

lanstin 11 hours ago | parent | prev [-]

Yeah, people get real upset about even 1 messed up money transaction.

vlovich123 a day ago | parent | prev | next [-]

You get certain big pieces correct maybe but you’d be surprised how many mistakes get made. For example, I had designed the billing system for a large distributed product that the engineer ended up implementing not as described in the spec which fell down fairly quickly with even a modicum of growth.

eru a day ago | parent | prev | next [-]

Well, Google got good at large scale rollouts, because they are doing large scale rollouts all the time. _And_ most of the time, the system they are rolling out is a small iteration from the last system they rolled out: the new GMail servers look almost exactly like the last GMail servers, but they have on extra feature flag you can turn on (and which is disabled by default) or have one bug fixed.

That's a very different challenge from rolling out a brand new system once.

mmooss 19 hours ago | parent | prev [-]

FAANG tests first on test beds, and on subsets of their user base.

agos 19 hours ago | parent [-]

also, see what happened last week when Cloudflare pushed out a bad configuration without trying it on a subset

mk89 a day ago | parent | prev [-]

No but whatsapp was built by 2 guys that had previously worked at Yahoo, and they picked a very strong tech for the backend: erlang.

So while they probably didn't bother scaling the service to millions in the first version, they 1) knew what it would take, 2) chose already from the ground up a good technology to have a smoother transition to your "X millions users". The step "X millions to XYZ millions and then billions" required other things too.

At least they didn't have to write a php-to-C++ compiler for Php like Facebook had, given the initial design choice of Mark Zuckeberg, which shows exactly what it means to begin something already with the right tool and ideas in mind.

But this takes skills.

Jtsummers a day ago | parent | next [-]

> No but whatsapp was built by 2 guys that had previously worked at Yahoo, and they picked a very strong tech for the backend: erlang.

https://news.ycombinator.com/item?id=44911553

Started as PHP, not as Erlang.

> 1) knew what it would take, 2) chose already from the ground up a good technology to have a smoother transition to your "X millions users".

No, as above, that was a pivot. They did not start from the ground up with Erlang or ejabberd, they adopted that later.

mk89 a day ago | parent [-]

Thanks, somehow I remembered wrong.

nradov a day ago | parent | prev [-]

Did they succeed because of Erlang or in spite of Erlang? We can't draw any reliable conclusions from a single data point. Maybe a different platform would have worked even better?

HalcyonicStorm a day ago | parent | next [-]

Erlang is uniquely suited to chat systems out of the box in a way that most other ecosystems aren't. Lightweight green threads via the BEAM vm, process scheduler so concurrent out of the box, immutable data structures, message passing as communication between processes.

nradov 12 hours ago | parent [-]

There's nothing unique about Erlang. I have nothing against it but other companies have built messaging systems using other platforms that work as well or better than WhatsApp.

awesome_dude a day ago | parent | prev [-]

Yeah - the technology used is a seperate concern to their abilities as users (developers) of that technology and the effectiveness at handling the scale.

I, for example, have always said that I am more than capable of writing code in C that is several orders of magnitude SLOWER than what I could write in.. say Python.

My skillset would never be used as an example of the value of C for whatever

philipallstar 16 hours ago | parent | prev | next [-]

> Gradual growth inevitably results in loads of technical debt that is not paid off as Product adds more feature requests to deliver larger and larger sales contracts.

This isn't technical debt, necessarily. Technical debt is a specific thing. You probably mean "an underlying design that doesn't perfectly map to what ended up being the requirements". But then the world moves on (what if a regulation is added that ruins your perfect structure anyway?) and you can't just wish for perfect requirements. Or not in software that interacts directly with the real world, anyway.

jimbokun a day ago | parent | prev | next [-]

Yes, it can be very difficult to add “scale” after the fact, once you already have a lot of data persisted in a certain way.

paulsutter a day ago | parent | prev | next [-]

You have to design for scale AND deploy gradually

rossdavidh a day ago | parent [-]

Yes, absolutely. Knowing that it will need to get big eventually is important, but not at all the same as deploying at scale initially.

dustingetz a day ago | parent | prev | next [-]

we get paid to add to it, we don’t get paid to take away

cjfd a day ago | parent [-]

Now there is your problem. It is only true in the context of grave incompetence, though. I have worked on tickets with 'remove' in the title.

fatbird 11 hours ago | parent | prev | next [-]

There's nothing wrong with technical debt per se. As with all debt, the problem is incurring it without a plan or means to pay it off. Debt based financing is the engine of modern capitalism.

Gradual growth to large scale implies an ongoing refactoring cost--that's the price of paying off the technical debt that got you started and built initial success in small scale rollouts. As long as you keep "servicing" your debt (which can include throwing away an earlier chunk and building a more scalable replacement with the lessons learned), you're doing fine.

The magic words here to management/product owners is "we built it that way the first time because it got us running quickly and taught us what we need to know to build the scalable version. If we'd tried to go for the scalable version first, we wouldn't have known foo, bar and baz, and we'd have failed and wouldn't have learned anything."

lelandbatey a day ago | parent | prev | next [-]

Gradual growth =/= many tacked on features. Many tacked on features =/= technical debt. Technical debt =/= "everybody is afraid and frozen." Those are merely often correlated, but not required.

Whatsapp is a terrible example because it's barely a product; Whatsapp is mostly a free offering of goodwill riding on the back of actual products like Facebook Ads. A great example would be a product like Salesforce, SAP, or Microsoft Dynamics. Those products are forced to grow and change and adapt and scale, to massive numbers doing tons of work, all while being actual products and being software systems. I think such products act as stark rebukes of what you've described.

golemiprague a day ago | parent | prev [-]

[dead]

hinkley a day ago | parent | prev | next [-]

The dominant factor is: there is a human who understands the entire system.

That is vastly easier to achieve by making a small, successful system, which gets buy in from both users and builders to the extent that the former pay sufficient money for the latter to be invested in understanding the entire system and then growing it and keeping up with the changes.

Occasionally a moon shot program can overcome all of that inertia, but the “90% of all projects fail” is definitely overrepresented in large projects. And the Precautionary Principle says you shouldn’t because the consequences are so high.

dominicrose 17 hours ago | parent [-]

This works for Clojure, git and even Linux. It seems there's a human who understands the entire system and decides what's allowed to be added to it. But these things are meant to be used by technical people.

The non-technical people I know might want to use Linux but stay on Windows or choose Mac OS because it's more straightforward. I use Windows+WSL at work even though I would like to use a native Linux distribution.

I know someone who created a MUD game (text online game) and said to him I wanted to make one with a browser client. He said something we could translate as "Good, you can have all the newbies." Not only was he right that a MUD should be played with a MUD client like tintin++, but making a good browser client is harder than it seems and that's time not spent making content for the game or improving the engine.

My point is that he was un uncomprimising person who refused adding layers to a project because they would come at a cost which isn't only time or dollars but also things like motivation and focus.

hinkley 10 hours ago | parent | next [-]

You’re conflating “knows the system” with benevolent dictator. It’s not the same. It’s down to whether in a planning or brainstorming session, there is anyone who can say that a plan won’t work or if there’s a better one.

Also it doesn’t have to be singular. You need at least one, in case that person leaves or becomes problematic. That dictator doesn’t always remain benevolent and they can hold a project hostage if they don’t like something that everyone else wants.

wickedsight 17 hours ago | parent | prev [-]

> ... even Linux. It seems there's a human who understands the entire system and decides what's allowed to be added to it.

I really wonder what will happen to Linux once Linus is no longer involved.

13 hours ago | parent | next [-]
[deleted]
inemesitaffia 16 hours ago | parent | prev [-]

Greg KH

base698 16 hours ago | parent | prev | next [-]

> A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system.

Gall’s law wins again.

throw0101c 13 hours ago | parent | prev | next [-]

> I'm afraid that the solution is: build something small, and use it in production before you add more features.

Gall's Law:

> A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system.[8]

* https://en.wikipedia.org/wiki/John_Gall_(author)#Gall's_law

OtherShrezzing a day ago | parent | prev | next [-]

While I think this is good advice in general, I don’t think your statement that “there is no process to create scalable software” holds true.

The uk gov development service reliably implements huge systems over and over again, and those systems go out to tens of millions from day 1. As a rule of thumb, the parts of the uk govt digital suite that suck are the parts the development service haven’t been assigned to yet.

The Swift banking org launches reliable features to hundreds of millions of users.

There’s honestly loads of instances of organisations reliably implementing robust and scalable software without starting with tens of users.

sjclemmy a day ago | parent | next [-]

The uk government development service as you call it is not a service. It’s more of a declaration of process that is adhered to across diverse departments and organisations that make up the government. It’s usually small teams that are responsible for exploring what a service is or needs and then implementing it. They are able to deliver decent services because they start small, design and user test iteratively and only when there is a really good understanding of what’s being delivered do they scale out. The technology is the easy bit.

pjc50 18 hours ago | parent | prev | next [-]

UK GDS is great, but the point there is that they're a crack team of project managers.

People complain about junior developers who pass a hiring screen and then can't write a single line of code. The equivalent exists for both project management and management in general, except it's much harder to spot in advance. Plus there's simply a lot of bad doctrine and "vibes management" going on.

("Vibes management": you give a prompt to your employees vaguely describing a desired outcome and then keep trying to correct it into what you actually wanted)

robertlagrant 16 hours ago | parent | prev | next [-]

> and those systems go out to tens of millions from day 1

I like GDS (I even interviewed with them once and saw their dev process etc) but this isn't a great example. Technically GDS services have millions of users across decades, but people e.g. aren't constantly applying for new passports every day.

A much better example I think is Facebook's rollout of Messenger, which scaled to billions of actual users on day 1 with no issues. They did it by shipping the code early in the Facebook app, and getting it to send test messages to other apps until the infra held, and then they released Messenger after that. Great test strategy.

zipy124 16 hours ago | parent | prev | next [-]

GDS's budget is about £90 million a year or something. There are many contracts that are still spent on digital, for example PA consulting for £60 million (over a few years) which do a lot of the gov.uk home-office stuff, and their fresh grads they hire cost more to the government than GDS's most senior staff...

sam_lowry_ a day ago | parent | prev [-]

SWIFT? Hold my beer. SWIFT did not launch anything substantial since its startup days in early 70-ies.

Moreover, their core tech did not evolve that far from that era, and the 70-ies tech bros are still there through their progeniture.

Here's an anecdote: The first messaging system built by SWIFT was text-based, somewhat similar to ASN.1.

The next one used XML, as it was the fad of the day. Unfortunately, neither SWIFT nor the banks could handle 2-3 orders of magnitude increase in payload size in their ancient systems. Yes, as engineers, you would think compressing XML would solve the problem and you would by right. Moreover, XML Infoset already existed, and it defined compression as a function of the XML Schema, so it was somewhat more deterministic even though not more efficient than LZMA.

But the suits decided differently. At one of the SIBOS conferences they abbreviate XML tags, and did it literally on paper and without thinking about back-and-forth translation, dupes, etc.

And this is how we landed with ISO20022 abberviations that we all know and love: Ccy for Currency, Pmt for Payment, Dt for Date, etc.

noname120 a day ago | parent | next [-]

Harder to audit when every payload needs to be decompressed to be inspected

WJW 15 hours ago | parent [-]

Is it? No auditor will read binary, so you already need a preprocessing step to get it to a readable format. And if you're already preprocessing then adding a decompression step is like 2 lines tops.

a day ago | parent | prev [-]
[deleted]
hintymad a day ago | parent | prev | next [-]

> https://www.amazon.com/How-Big-Things-Get-Done/dp/0593239512

This is what https://www.amazon.com/How-Big-Things-Get-Done/dp/0593239512 advocates too: start small, modularize, and then scale. The example of Tesla's mega factory was particular enticing.

nostrademons a day ago | parent | prev | next [-]

Came here to say this. I still think that Linus Torvalds has the most profound advice to building a large, highly successful software system:

"Nobody should start to undertake a large project. You start with a small trivial project, and you should never expect it to get large. If you do, you'll just overdesign and generally think it is more important than it likely is at that stage. Or worse, you might be scared away by the sheer size of the work you envision. So start small, and think about the details. Don't think about some big picture and fancy design. If it doesn't solve some fairly immediate need, it's almost certainly over-designed. And don't expect people to jump in and help you. That's not how these things work. You need to get something half-way useful first, and then others will say "hey, that almost works for me", and they'll get involved in the project."

-- Linux Times, October 2004.

tsimionescu 21 hours ago | parent | next [-]

I don't think this applies in any way to companies contracted to build a massive system for a government with a clear need. Linus is talking about growing a greenfield open-source project, which may or may not ever be used by anyone.

In contrast, if your purpose is "we need to manage our country's accounting without pen and paper", that's a clear need for a massive system. Starting work on this by designing a system that can solve accounting for a small firm is not the right way to go. Instead, you have to design with the end-goal in mind, since that's what you were paid for. But, you don't launch your system to the entire country at once: you first use this system designed for a country in a small shop, to make sure it actually handles the small scale well, before gradually rolling out to more and more people.

mrweasel 18 hours ago | parent | next [-]

> for a government with a clear need.

There's your problem. The needs are never clear, not on massive systems. Governments will write a spec, companies will read the spec, offer to implement it as written, knowing full well that it won't work. Then they charge exorbitant fees to modify the system after launch, so that it will actually full fill business needs.

The Danish government is famous for sucking at buying massive IT systems.

  * Specs for new tax system: 6000 page, tax laws not included. That's basically impossible to implement and it predictably failed. The version that worked: Implement just the basics to collect TV license fees. The build from there.

  * System to calculate the value of people home, I think we're at round five (rumors has it that one system worked, but was scrapped because it showed that most home are massively overvalued and it do terrible things to the tax collection in the municipalities).

  * New case management system for the police, failed, development never restarted. One suggested solution was to have the police hire a handful of the best developers in the country and have them produce smaller deliverable over a number of year. The money wasted could have funded 10 world class developers for ~30-50 years.
patmorgan23 11 hours ago | parent | prev [-]

Building those systems is a long term project, and you have to start small with a minimum number of functions, scope creep on those initial use cases often kills these kinds of projects.

ozim 20 hours ago | parent | prev | next [-]

No Linus Torvalds would stand against people in projects from article, he would slam the door and quit.

Those projects that author pointed out are basically political horror stories. I can imagine how dozens of people wanted to have a cut on money in those projects or wanted to push things because “they are important people”.

There is nothing you can do technically to save such projects and it is NOT an IT failure.

nly 19 hours ago | parent | prev | next [-]

Works with implementations and not APIs though.

A bad API can constrain your implementation and often can't be changed once it's in use by loads of users. APIs should be right from day one if possible.

globalise83 18 hours ago | parent [-]

I would add the nuance that the possibility of controlled migration from one versioned API to another should be right from day one, not necessarily the first API version.

ambicapter a day ago | parent | prev [-]

This is a really dense paragraph of lifetime-accumulated wisdom in that single quote.

WJW 16 hours ago | parent | prev | next [-]

While I like the "start small and expand" strategy better than the "big project upfront", this trades project size for project length and often that is no better:

- It gives outside leadership types many more opportunities to add requirements later. This is nice is they are things missed in the original design, but it can also lead to massive scope creep.

- A big enough project that gets done the "start small and expand" way can easily grow into a decade-plus project. For an extreme example, see the multi-decade project by the Indian rail company to gradually replace all its railways to standard gauge. It works fine if you have the organisational backing for a long duration, but the constant knowledge leaks from people leaving, retiring, getting promoted, etc can be a real problem for a project like that. Especially in fields where the knowledge is the product, like in software.

- Not every project can feasibly start small.

hermitcrab 10 hours ago | parent | prev | next [-]

>It's a great article, until the end where they say what the solution would be. I'm afraid that the solution is: build something small, and use it in production before you add more features.

I think that is true for a lot of projects. But I'm not sure it is realistic to incrementally develop a control system for a nuclear reactor or an air traffic control system.

21 hours ago | parent | prev | next [-]
[deleted]
eru a day ago | parent | prev | next [-]

> If you need to make a national payroll, you have to use it for a small town with a payroll of 50 people first, get the bugs worked out, then try it with a larger town, then a small city, then a large city, then a province, and then and only then are you ready to try it at a national level.

You could also try to buy some off-the-shelf solutions? Making payroll, even for very large organisations, isn't exactly a new problem.

As a corollary I would also suggest: subsidiarity.

> Subsidiarity is a principle of social organization that holds that social and political issues should be dealt with at the most immediate or local level that is consistent with their resolution.

(from https://en.wikipedia.org/wiki/Subsidiarity)

If you solve more problems more locally, you don't need that many people at the national level, thus making payroll there is easier.

tsimionescu 21 hours ago | parent [-]

I think you'll find that is exactly what people do. However, payroll solutions are highly customized for every individual company and even business unit. You don't buy a payroll software in a box, deploy it, and now you have payroll. Instead, you pay a payroll software company, they come in and get information about your payroll systems, and then they roll out their software on some of your systems and work with you to make sure their customizations worked etc. There's rarely any truly "off-the-shelf" software in B2B transactions, especially the type of end-user solutions that also interact with legal systems.

Also, governments are typically at least an order of magnitude larger than the largest companies operating in their countries, in terms of employees. So sure, the government of Liechtenstein has fewer employees than Google overall, but the US government certainly does not, and even Liechtenstein probably has way more government employees than Google employees in their country.

duxup a day ago | parent | prev | next [-]

I work at a small shop, I'm a big advocate of giving customers the 0.1 version and then talking it out what they want. It's often not exactly what they asked for at the start ... but it often is better in the end.

It's hard to hit the target right the first time.

BrenBarn a day ago | parent | prev | next [-]

Yes. Also the same applies to companies. There should not be companies that are growing to $100 million revenue while losing money on a gamble that they will eventually get big enough to succeed. Good first, big later.

SchemaLoad a day ago | parent [-]

$100M maybe. But pretty much all tech needs an initial investment before you can start making profit. It takes a lot of development before you can get a product that anyone would want to pay for.

roeles 21 hours ago | parent | prev | next [-]

Not saying you're wrong, but I wonder what is the differentiating factor for software? We can build huge things like airliners, massive bridges and buildings without starting small.

Incremental makes less sense to me when you want to go to mars. Would you propose to write the software for such a mission in an incremental fashion too?

Yet for software systems it is sometimes proposed as the best way.

Ensorceled 14 hours ago | parent | next [-]

> We can build huge things like airliners, massive bridges and buildings without starting small.

We did start small with all of those things. We developed rigorous disciplines around engineering, architecture, material sciences. And people died along the way in the thousands[0][1]

People are still dying from those failures; The Boeing 737 MAX 9 crash was only two years ago.

> Incremental makes less sense to me when you want to go to mars.

This is yet another reason why a manned Mars mission will be exceedingly dangerous NOT a strike against incremental development and deployment.

[0] https://en.wikipedia.org/wiki/List_of_building_and_structure...

[1] https://en.wikipedia.org/wiki/List_of_accidents_and_incident...

cheepin 21 hours ago | parent | prev | next [-]

All of the things you mentioned are designed and tested incrementally. Furthermore software has been used on Mars missions in the past, and that software was also developed incrementally. It’s proposed as the best way because it’s a way that has a track record

roeles 10 hours ago | parent [-]

> All of the things you mentioned are designed and tested incrementally.

In a different way that what is proposed in this thread. We don't build a small bridge and grow it. We build small bridges, develop a theory for building bridges and use that to design the big bridge.

I don't know of any theory of computing that would help us design a "big" program at once.

19 hours ago | parent | prev [-]
[deleted]
jiggawatts a day ago | parent | prev | next [-]

You will never get to the moon by making a faster and faster bus.

I see a lot of software with that initial small scale "baked into it" at every level of its design, from the database engine choice, schema, concurrency handling, internal architecture, and even the form design and layout.

The best-engineered software I've seen (and written) always started at the maximum scale, with at least a plan for handling future feature extensions.

As a random example, the CommVault backup software was developed in AT&T to deal with their enormous distributed scale, and it was the only decently scalable backup software I had ever used. It was a serious challenge with its competitors to run a mere report of last night's backup job status!

I also see a lot of "started small, grew too big" software make hundreds of silly little mistakes throughout, such as using drop-down controls for selecting users or groups. Works great for that mom & pop corner store customer with half a dozen accounts, fails miserably at orgs with half a million. Ripping that out and fixing it can be a decidedly non-trivial piece of work.

Similarly, cardinality in the database schema has really irritating exceptions that only turn up at the million or billion row scale and can be obscenely difficult to fix later. An example I'm familiar with is that the ISBN codes used to "uniquely" identify books are almost, but not quite unique. There are a handful of duplicates, and yes, they turn up in real libraries. This means that if you used these as a primary key somewhere... bzzt... start over from the beginning with something else!

There is no way to prepare for this if you start with indexing the book on your own bookshelf. Whatever you cook up will fail at scale and will need a rethink.

rmunn 21 hours ago | parent | next [-]

Counterpoint: the idea that your project will be the one to scale up to the millions of users/requests/etc is hubris. Odds are, your project won't scale past a scale of 10,000 to 100,000. Designing every project to scale to the millions from the beginning often leads to overengineering, adding needless complexity when a simpler solution would have worked better.

Naturally, that advice doesn't hold if you know ahead of time that the project is going to be deployed at massive scale. In which case, go ahead and implement your database replication, load balancing, and failover from the start. But if you're designing an app for internal use at your company of 500, well, feel free to just use SQLite as your database. You won't ever run into the problems of scale in this app, and single-file databases have unique advantages when your scale is small.

Basically: know when huge scale is likely, and when it's immensely UNlikely. Design accordingly.

jiggawatts 18 hours ago | parent [-]

> Odds are, your project won't scale past a scale of 10,000 to 100,000.

That may be a self-fulfilling prophecy.

I agree in general that most apps don't need fancy scaling features, but apps that can't scale... won't... and hence "don't need scaling features".

> You won't ever run into the problems of scale in this app, and single-file databases have unique advantages when your scale is small.

I saw a customer start off with essentially a single small warehouse selling I dunno... widgets or something... and then the corporation grew and grew to a multi-national shipping and logistic corporation. They were saddled with an obscure proprietary database that worked like SQLite and had incredibly difficult to overcome technical challenges. They couldn't just migrate off, because that would have needed a massive many-year long total rewrite of their app.

For one performance issue we were seriously trying to convince them to use phase-change cooling on frequency-optimized server CPUs like a gamer overclocking their rig because that was the only way to eke out just enough performance to ensure their overnight backups didn't run into the morning busy time.

That's just not an issue with SQL Server or any similar standard client-server database engine.

jkrejcha 17 hours ago | parent [-]

I think part of that thinking though is that if you do basic stuff like use a standard database engine or don't go too off the beaten path if that's what you need, it tends to be that you get the ultimately needed scale for basically free.

This is a lot of times what I see the "don't build for huge scale" to be. It's not necessarily "be proud of O(n^2) algorithms". Rather it's more "use Postgres instead of some hyperscale sharded database when you only have 10 million users" because the alternative tends to miss the forest (and oftentimes the scale, ironically) for the trees

jiggawatts 7 hours ago | parent [-]

Yes, but also I've found that using a decently scalable engine is insufficient for a good outcome without the scaled data.

The best software I've written always had > 10 GB of existing data to work with from day one. So for example the "customers" table didn't have one sample entry, it had one million real entries. The "products" table had a real history of product recalls, renames, category changes over time, special one-off products for big customers, etc...

That's how you find out the reality of the business instead of some idealised textbook scenario.

Things like: Oh, actually, 99% of our product SKUs are one-off specials, but 99% of the user interactivity and sales volume is with the generic off-the-shelf 1% of them, so the UI has to cater for this and database table needs a "tag" on these so that they can be filtered. Then, it turns out the filtering 10 million products down to 100K has non-trivial performance issues when paging through the list. Or even worse, 50% of the products are secret because their mere existence or their name is "insider info" that we don't want our own staff to see. Did I say "staff"? I meant subcontractors, partners, and resellers, all with their own access rules and column-level data masking that needs to be consistent across dozens of tables. Okay, let's start going down the rabbithole of column naming conventions and metadata...

You can't predict that stuff in a vacuum, no human has the foresight needed to "ask the right questions" to figure this all out through workshops or whatever. The ground-truth reality is the best thing, especially up-front during the early phases of development.

A lot of the above is actually easy to implement as a software developer, but hard to change half-way-through a project.

t43562 9 hours ago | parent | prev [-]

You can by making a bigger and bigger rocket though.

chrisweekly a day ago | parent | prev | next [-]

See also Gall's Law:

"All complex systems that work evolved from simpler systems that worked"

Cthulhu_ 18 hours ago | parent | prev | next [-]

That's the ideal, but a lot of these big problems can't start small because the problem they have is already big. A lot of government IT programs are set up to replace existing software and -processes, often combining a lot of legacy software's jobs and the manual labor involved.

If you have something like a tax office or payroll, they need to integrate decades of legislation and rules. It's doable, but you need to understand the problem (which at those scales is almost impossible to fit in one person's head) and more importantly have diligent processes and architecture to slowly build up and deploy the software.

tl;dr it's hard. I have no experience in anything that scale, I've been at the edges of large organizations (e.g. consumer facing front-ends) for most of my career.

chrsw a day ago | parent | prev | next [-]

That sounds like the way nature handles growth and complexity: slowly and over long time scales. Assume there will be failures, don't die and keep trying.

When you bite off too much complexity at once you end up not shipping anything or building something brittle.

bryanhogan a day ago | parent | prev | next [-]

You just need: Plan -> Implement -> Test -> Repeat

Whether you are creating software, games or whatever, these iterations are foundational. How these steps look like in detail of course depends on the project itself.

mathattack 11 hours ago | parent | prev | next [-]

I do get concerned when the solution is to be more strict on the waterfall process.

I used to believe there were some worlds in which waterfalls are better: where requirements are well know in advance and set in stone. I’ve since come to realize neither of those assumptions is ever true.

the_duke a day ago | parent | prev | next [-]

The accounting, legal and business process requirements are vastly different at different scales, different jurisdictions, different countries, etc.

There's a crazy amount of complexity and customizability in systems like ERPs for multinational corporations (SAP, Oracle).

When you start with a small town, you'll have to throw most of everything away when moving to a different scale.

That's true for software systems in general. If major requirements are bolted on after the fact, instead of designed into the system from the beginning, you usually end up with an unmaintainable mess.

rossdavidh a day ago | parent [-]

Knowing that the rules for your first small deployment are not the same as the rules for everywhere, is valuable for designing well. Trying to implement all of those sets of rules in your initial deployment, is not a good idea. There is a general principle that you shouldn't code the abstraction until you've coded for the concrete example 2 or 3 times, because otherwise you won't make the right abstraction. Looking ahead is not the same as starting with the whole enchilada for your initial deployment.

Izikiel43 a day ago | parent | prev | next [-]

What works at small scale possibly won't work at a huge scale.

skywhopper a day ago | parent | next [-]

But what hasn’t even been tried at a small scale definitely won’t work at a huge scale.

rossdavidh a day ago | parent | prev [-]

Which is absolutely true, and a reason to try at medium scale second. But what doesn't work at small scale, almost certainly won't work at huge scale.

patrick451 13 hours ago | parent | prev [-]

Imagine if the only way to build a skyscraper was to start with a dollhouse and keep tacking extensions and pieces onto it until. Imagine if the only way to build a bridge across San Francisco bay was to start with pop sickle sticks.