Debugging Strategies

From Wayne's Dusty Box of Words

Several years ago I came across the book Debugging: The 9 Indispensable Rules for Finding Even the Most Elusive Software and Hardware Problems by David J. Agans. The book really resonated with me — someone had found a clear and concise way to summarize many of the skills that I have refined over the years to become good at debugging. The book also helped me reflect on what went wrong, and what I could have done better in some past debugging experiences.

Our local chapter of the IEEE (Northeast Wisconsin), has recently started a program where engineers present training sessions to other engineers for professional development hour (PDH) credit. I was the first to present and chose this topic. My presentation focused on the 9 rules from Agans and connecting my own personal experiences to those rules. There was also plenty of time for the audience to relate their own debugging war stories, which was my favorite part.

In developing the presentation I had to recall as many debugging war stories from my past as possible. These included examples from work, but also from home. Here are some of the examples I used to illustrate each of the rules:

Rule 1 – Understand the system I was working on a system with an LCD screen. When we turned it on, the colors were all scrambled; the screen basically looked correct, but the colors were all wrong. We spent many weeks trying driver settings, color space correction, complaining to the LCD vendor, etc. In the end, there was a register in the video driver that controlled the bit mapping of the pixel bits to the multiplexed LVDS serial stream bit positions. There were multiple possible settings for this mapping. The key is that before you can start debugging you need to fully understand the details of the system, including architecture, part datasheets, and even possible register settings. If we had understood the system, this problem was easy to find.

Rule 2 – Make it fail I was working on an Ethernet network adapter card with network processing offload capabilities. The card would periodically reset for no apparent reason. Of course, the problem never happened when a hardware engineer was around. The key to solving this problem was that a software engineer noticed that it sometimes happened during boot up during a memory test on one particular board. The hardware team suggested changing the memory test pattern to one that would be more stressful from a noise standpoint. When this was done, on this particular card the board would sometimes enter a resetting loop where every time it got to a particular part of the memory test it would cause a reset, the board would attempt to boot up again, and it would reset again at the same point. Residual noise was enough to push this board over the limit over and over, and it was easy to capture on the scope. Once we made it fail and could see what was happening, the problem was easy to find and fix.

Rule 3 – Quit thinking and look Engineers have a strong tendency to sit around and theorize about what is happening and try fixes blindly. I can think of many times where this behavior was reinforced, where the engineer essentially got lucky, thus making them more likely to act this way in the future. In one example, we had a system that used a vendor IP block to interface a CAN bus to a computer platform through a PCI bus. The drivers wouldn’t load. We spent many weeks trying different things in software on the computer, reading datasheets, complaining to the vendor, and coming up with theory after theory about what could be wrong. Finally, the vendor suggested we bring the design to their office in Germany to debug. Their engineer hooked up a scope to the memory bus that the driver was writing the firmware to bootstrap the IP block. They didn’t have a logic analyzer, so he watched bit by bit, one at a time, the data. Finally, at bit 29 he saw that something was wrong. An informed look at the schematic showed that bit 29 and 30 were swapped. This schematic had been reviewed at least 3 times by different engineers, and no one noticed this subtle error. It jumped off the page once we knew. It turned out that there was another problem with the design in that an EEPROM with settings programmed had the bytes programmed in the wrong order as well, but this also was easy to see with the scope as the data was read out of the EEPROM. It’s a pain to hook up the scope or logic analyzer, and it takes time. Generally though, if you take the time to look at the problem, collect some objective evidence, you’ll get to the answer much faster than if you think and try things.

Rule 4 – Divide and conquer The key here is to start at the problem, where the design is broken, and work your way backward to find where it breaks. You do this by binary searching. My example here was an audio system that played two MP3 files through different speakers, and they were required to be in sync within 10 ms. The 20 minute long MP3 files were off by many seconds by the end. There were many places in the chain where the slip could have occurred. The key to finding the problem was first capturing the slipping behavior at the analog speaker while playing a sine wave audio file to observe the slipping. Then we worked backward to the digital data stream and eventually isolated the problem to the processors doing the decoding. One of the processors had a DMA ISR that was taking too long sometimes, and so data was not being sent out correctly.

Rule 5 – Change one thing at a time Rule 6 – Keep an audit trail These rules are tightly related. Good examples here are during debugging to find an EMC radiated emissions problem. I change one thing at a time to first determine the path that the emissions are escaping through (cable, chassis, etc.) and then determine the source of the noise. As I go through this process, I change one thing at a time and back out the changes that don’t have any effect until I know the source and path. At every step, I keep a log of what I’ve done and the results. OneNote is a great tool to do logging in using a notebook page and indenting to follow the paths of what is all included at each level.

Rule 7 – Check the plug Many puzzling problems have very simple answers like I forgot to plug it in. My example here is that one time I noticed that our outside pole lamp was not working. Now I had installed in this lamp a long-life compact fluorescent light, and it was fairly new, so there was no way it could be burned out yet. Also, the lamp is controlled by a fancy digital timer in the wall switch plate, and that had failed before and had to be replaced. So naturally, being the engineer type, I went to the store and bought a new timer and replaced it. Nothing happened. Now I remembered back to when I had installed the pole lamp in the yard, I had an issue with the underground wiring. There was a short to earth in the cable and it wasn’t working. I had to dig up the cable to repair it. So next I headed for the garage to get the shovel. Before I could start digging, my wife, the voice of reason, suggested that maybe I should check bulb. Begrudgingly I did. The problem was that the bulb had turned loose by a quarter turn. So I was right, that it wasn’t burned out, but I missed a simple problem that was easy to fix without much work if I had more carefully checked my assumptions up front. Always be aware of your assumptions, especially the most basic ones when debugging, and check them out before getting too deep.

Rule 8 – Get a fresh view Sometimes the best thing to do is get an unbiased opinion from an expert or outsider. The key is not to taint them with your own theories of what is wrong, but only give them the facts. I’ve also found over the years great value in just explaining problems to someone who isn’t an expert in the subject matter. They often have great insights because they can separate the problem from all the peripheral details. At a minimum, they can listen and reflect, and help you clarify your own understanding. My example here is that I had a GPS heart rate monitor that wouldn’t turn on. It didn’t do anything when pressing its buttons and no charging icon showed up when plugging it into the USB port. I was ready to get a new one because obviously there was some catastrophic hardware failure in the power supply. My father-in-law did some quick digging, by looking in the troubleshooting section of the manufacturer's web site. It turned out that the software in the monitor was locked up, and the solution was to initiate the soft reset by holding down a combination of keys. He didn’t know anything about the inner workings of the hardware or software in the device, but he did know that when something is broken, you can first look to user forums or manufacturer instructions to figure it out.

Rule 9 – If you didn’t fix it, it ain’t fixed This seems obvious, but the temptation is to ignore intermittent problems that you can’t reproduce. Just because the problem goes away, doesn’t mean that the root cause is addressed, and it will likely come back. An example here is a peristaltic pump design that I worked on. It had to run at motor speeds as low as 10 RPM. The motion was not very well controlled instantaneously however, the average speed over a minute was close because there was an integrator in the control loop. However, sometimes, the error was quite high. We tried numerous mechanical fixes that sometimes made things better, but until we addressed the root problem of the unstable control loop, the average speed error would always fluctuate under certain conditions. Those are the rules. It is a great book and a quick read. It will either show you the way to debug or provide reinforcement to what you already know. I highly recommend getting your own copy.

Source: https://macroware.wordpress.com/2014/03/29/debugging-strategies-techniques/