▲ | JoeAltmaier 4 days ago | |
It took a big fraction of our energy to get to each new stage. Always it was something new and unexpected. It was quite a while ago, but lets see what I remember. Some good fraction were mismatches between the app (bot) state and the server state. A bot would be expecting a message and stall. The server thought it had said enough. The app side used a lot of libraries, which it turns out are never as robust as advertised. They leak, race, are very particular about call order. Have no sense of humor if they're still connecting and a disconnect call is made, for instance. The open source server components were fragile. In one instance, the database consistency library had an update where, for performance, a success message was returned before the operation upstream was complete. Which broke, utterly, the consistency promise that was the entire point of using that product. A popular message library created a timer on each instantiation. Cancelled it, but in typical Java fashion didn't unlink it. So, leak. Tiny, but you do it enough times, even the biggest server instance runs out of memory. We ran bots on Windows, Linux, even a Mac. Their network libraries had wildly different socket support. We'd run out of sockets! They got garbage collected after a time, but the timer could be enormous (minutes). Our server used a message-distribution component to 'shard' messages. It had a hard limit on message dispatching per second. I had to aggregate the client app messages (we used UDP and a proprietary signaling protocol) to drop the message rate (ethernet packet rate) by an order of magnitude. Added a millisecond of latency, which was actually important and another problem. Add the usual Java null pointers, order-dependent service termination rules (never documented), object lifetime surprises. It went on and on. Each doubling of survival-time the issues got more arcane and more interesting. Sometimes took a new tool or technique to ferret out the problem. To be honest, I was in hog heaven. Kept my brain plastic for a long time. | ||
▲ | rossant 3 days ago | parent [-] | |
Wow, really interesting write-up, thank you! It really proves the immense value of this kind of automated, realistic stress test. |