Remix.run Logo
com2kid 13 hours ago

My team once encountered a bug that was due to a supplier misstating the delay timing needed for a memory chip.

The timings we had in place worked, for most chips, but they failed for a small % of chips in the field. The failure was always exactly identical, the same memory address for corrupted, so it looked exactly like an invalid pointer access.

It took multiple engineers months of investigating to finally track down the root cause.

triyambakam 12 hours ago | parent [-]

But what was the original estimate? And even so I'm not saying it must be completely and always correct. I'm saying it seems wild to have no starting point, to simply give up.

com2kid 11 hours ago | parent | next [-]

Have you ever fixed random memory corruption in an OS without memory protection?

Best case you trap on memory access to an address if your debugger supports it (ours didn't). Worst case you go through every pointer that is known to access nearby memory and go over the code very very carefully.

Of course it doesn't have to be a nearby pointer, it can be any pointer anywhere in the code base causing the problem, you just hope it is a nearby pointer because the alternative is a needle in a haystack.

I forget how we did find the root cause, I think someone may have just guessed bit flip in a pointer (vs overrun) and then un-bit-flipped every one of the possible bits one by one (not that many, only a few MB of memory so not many active bits for pointers...) and seen what was nearby (figuring what the originally intended address of the pointer was) and started investigating what pointer it was originally supposed to be.

Then after confirming it was a bit flip you have to figure out why the hell a subset of your devices are reliably seeing the exact same bit flipped, once every few days.

So to answer your question, you get a bug (memory is being corrupted), you do an initial investigation, and then provide an estimate. That estimate can very well be "no way to tell".

The principal engineer on this particular project (Microsoft Band) had a strict 0 user impacting bugs rule. Accordingly, after one of my guys spend a couple weeks investigating, the principal engineer assigned one of the top firmware engineers in the world to track down this one bug and fix it. It took over a month.

snovv_crash 10 hours ago | parent [-]

This is why a test suite and mock application running on the host is so important. Tools like valgrind can be user to validate that you won't have any memory errors once you deploy to the platform that doesn't have protections against invalid accesses.

It wouldn't have caught your issue in this case. But it would have eliminated a huge part of the search space your embedded engineers had to explore while hunting down the bug.

com2kid an hour ago | parent [-]

Custom OS, cross compiling from Windows, using Arm's old C compiler so tools like valgrid weren't available to us.

Since it was embedded, no malloc. Everything being static allocations made the search possible in the first place.

This wasn't the only HW bug we found, ugh.

pyrale 9 hours ago | parent | prev [-]

There is a divide in this job between people who can always provide an estimate but accept that it is sometimes wrong, and people who would prefer not to give an estimate because they know it’s more guess than analysis.

You seem to be in the first club, and the other poster in the second.