We wrote a conferencing app and server (years before Zoom). Tested the server by having automated headless apps run in gangs, a hundred at a time, hopping from conversation to conversation, turning mic and camera on and off, logging out and logging back in. Used it for years, the Bot Army we called it. Responsible for our rock-solid quality reputation. Not API design or test classes or constraints or anything. Just, trying the damn thing, in large cases, for a long time.

When it ran an hour, we celebrated. When it ran overnight, we celebrated. When it ran a week we celebrated, and called that good enough.

▲

rossant 4 days ago | parent | next [-]

How much work was it to go from 1 hour to 1 week? How many issues have you discovered, what were they? Genuinely interested.

▲

JoeAltmaier 4 days ago | parent [-]

It took a big fraction of our energy to get to each new stage. Always it was something new and unexpected. It was quite a while ago, but lets see what I remember.

Some good fraction were mismatches between the app (bot) state and the server state. A bot would be expecting a message and stall. The server thought it had said enough.

The app side used a lot of libraries, which it turns out are never as robust as advertised. They leak, race, are very particular about call order. Have no sense of humor if they're still connecting and a disconnect call is made, for instance.

The open source server components were fragile. In one instance, the database consistency library had an update where, for performance, a success message was returned before the operation upstream was complete. Which broke, utterly, the consistency promise that was the entire point of using that product.

A popular message library created a timer on each instantiation. Cancelled it, but in typical Java fashion didn't unlink it. So, leak. Tiny, but you do it enough times, even the biggest server instance runs out of memory.

We ran bots on Windows, Linux, even a Mac. Their network libraries had wildly different socket support. We'd run out of sockets! They got garbage collected after a time, but the timer could be enormous (minutes).

Our server used a message-distribution component to 'shard' messages. It had a hard limit on message dispatching per second. I had to aggregate the client app messages (we used UDP and a proprietary signaling protocol) to drop the message rate (ethernet packet rate) by an order of magnitude. Added a millisecond of latency, which was actually important and another problem.

Add the usual Java null pointers, order-dependent service termination rules (never documented), object lifetime surprises. It went on and on.

Each doubling of survival-time the issues got more arcane and more interesting. Sometimes took a new tool or technique to ferret out the problem.

To be honest, I was in hog heaven. Kept my brain plastic for a long time.

	▲	rossant 3 days ago \| parent [-]
		Wow, really interesting write-up, thank you! It really proves the immense value of this kind of automated, realistic stress test.

▲

yakshaving_jgt 4 days ago | parent | prev [-]

As effective as that sounds, having that integrated test suite didn’t preclude you from also having more granular isolated tests.

	▲	JoeAltmaier 4 days ago \| parent [-]
		Sure, in a generous world with lots of resources. Given the startup environment and the overworked team, it's a choice how to spend limited time and energy.