18 Lessons From 13 Years of Tricky Bugs
In Learning From Your Bugs, I wrote about how I have been keeping track of the most interesting bugs I have come across. I recently reviewed all 194 entries (going back 13 years), to see what lessons I have learned from them. Here are the most important lessons, split into the categories of coding, testing
Coding
These are all issues that have caused difficult bugs for me in the past:
1. Event order. When handling events, it is fruitful to ask the following questions: Can the events arrive in a different order? What if we never receive this event? What if this event happens twice in a row? Even if it would normally never happen, bugs in other parts of the system (or interacting systems) could cause it to happen.
2. Too early. This is a special case of “Event order” above, but it has caused some tricky bugs, so it gets its own category. For example, if signaling messages are received too early, before configuration and start-up procedures are finished, a lot of strange behavior can happen. Another example: when a connection was marked as down
3. Silent failures. Some of the hardest bugs to track down have (in part) been caused by code that silently fails and continues instead of throwing an error. For example, system calls (like bind) that return error codes that aren’t checked. Another example: parsing-code that just returned instead of throwing an error when it encountered a faulty element. The call continued for a while in a faulty state, making the debugging much harder. It is better to return an error as soon as a failure case is detected.
4. If. If-statements with several conditions , if (a or b), especially when chained, if (x) else if (y), have caused many bugs for me. Even though if-statements are conceptually simple, they are easy to get wrong when there are multiple conditions to keep track of. These days I try to rewrite the code to be simpler to avoid having to deal with complicated if-statements.
5. Else. Several bugs have been caused by not properly considering what should happen if a condition is false. In almost every case, there should be an else-part for each if-statement. Furthermore, if you set a variable in one branch of an if-statement, you should probably set it in the other as well. Related to this is the case when a flag is set. It is easy to only add the condition for setting the flag, but forgetting to add the condition for when the flag should be reset again. Leaving a flag set forever will likely lead to bugs down the road.
6. Changing assumptions. Many of the bugs that were the hardest to prevent in the first place were caused by changing assumptions. For example, in the beginning there could only be one customer event per day. Then a lot of code is written under this assumption. At some later point, the design is changed to allow multiple customer events per day. When this happens, it can be hard to change all cases that are affected by the new design. It is easy to find all the explicit dependencies on the change, but the hard part is to find all the cases that implicitly depend on the old design. For example, there may be code that fetches all customer events for a given day. An implicit assumption may be that the result set is never greater than the number of customers. I don’t have a good strategy on how to prevent these problems, so suggestions are welcome.
7. Logging. Visibility into what the program does is crucial, especially when the logic is complicated. Make sure to add enough (but not too much) logging, so you can tell why the program does what it does. When everything works fine, it doesn’t matter, but as soon as (the inevitable) problem happens, you will be happy that you added proper logging.
Testing
As a developer, I am not done with a feature until I have tested it. At a minimum this means that every new or changed line of code has been executed at least once. Furthermore, unit testing or functional testing is good, but not enough. The new feature must also be tested and explored in a production-like environment. Only then can I say that I am done with a feature. Here are some important lessons my bugs taught me about testing:
8. Zero and null. Make sure to always test with zero and null (when applicable). For a string it means both a string of length zero, and a string that is null. Another example: test the disconnection of a TCP connection before any data (zero bytes) was sent on it. Not testing with these combinations is the number one reason for bugs slipping through that I should have caught when testing.
9. Add and remove. Often new features involves being able to add new configurations to the system, for example a new profile for phone number translation. It is very natural to test that it works to add a new profile. However, I have found that it is easy to forget to test the removal of the profile as well.
10. Error handling. The code that handles errors is often hard to test. It’s best to have automatic tests that check the error handling code, but sometimes that is not possible. One trick I sometimes use then is to modify the code temporarily to cause the error handling code to run. The easiest way to do this is to reverse an if-statement, for example flipping it from if error_count > 0 to if error_count == 0. Another example is misspelling a database column name to cause the desired error handling code to run.
11. Radom input. One way of testing that can often reveal bugs is to use random input. For example, the ASN.1 decoding of the H.323 protocol operates on binary data. By sending in random bytes to be decoded, we found several bugs in the decoder. Another example is to generate scripts with test calls, where the call duration, answer delay, first party to hang up and so on were all randomly generated. These test scripts exposed numerous bugs, particularly where there were interference from events happening close together.
12. Check what shouldn’t happen. Often testing involves checking that a desired action happened. But it is easy to overlook the opposite case – to check that an action that shouldn’t happen actually didn’t happen.
13. Own tools. Usually I have created my own small tools to make testing easier. For example, when I worked with the SIP protocol for VoIP, I wrote a small script that could reply with exactly the headers and values I wanted. That tool made testing a lot of corner cases easy. Another example is a command line tool that can make API calls. By starting small, and gradually adding features as needed, I have ended up with very useful tools. The advantage of writing my own tools is that I get exactly what I want.
It is never possible to find all bugs in testing though. In one case, I made a change to the handling of correlation numbers that consisted of two parts: the routing address prefix (always the same), and the dynamically allocated number from 000 to 999. The problem was that when finding the correlation, the first digit of the dynamically allocated number was mistakenly removed before looking in the table. So instead of looking for e.g. 637, you were looking for 37, which wasn’t in the table. This means that it worked up until 100, so the first 100 calls worked, then all the 900 following failed. So unless I tested more than 100 times before restarting (which I didn’t), I would not find this problem when testing.
Debugging
14. Discuss. The debugging technique that has helped me the most in the past is to discuss the problem with a colleague. Often it is enough to simply describe the problem to a co-worker for me to realize what the problem is. Furthermore, even if they are not very familiar with the code in question, they can often come up with good ideas of what could be wrong anyway. Discussing with a co-worker has been especially effective with my most difficult bugs.
15. Pay close attention. Often when debugging a problem took a long time, it was because I made false assumptions. For example, I thought the problem happened in a certain method when in fact it never even got to that method in the first place. Or the exception that was thrown wasn’t the one I assumed it was. Or I thought the latest version of the software was running, but it was an older version. Therefore, be sure to verify that details instead of assuming. It’s easy to see what you expect to see, instead of what is actually there.
16. Most recent change. When things that used to work stop working, it is often caused by the last thing that was changed. In one case, the most recent thing changed was just the logging, but an error in the logging caused a bigger problem. To make regressions like this easier to find, it helps to commit different changes in different commits, and to use clear descriptions of the changes.
17. Believe the user. Sometimes when a user reports a problem, my instinctive reaction is: “That’s impossible. They must have done something wrong.” But I have learnt not to react that way. More times than I would like, it turns out that what they report is what actually happens. So these days, I take what they report at face value. Of course I still double check that everything has been set correctly etc. But I have seen so many cases where weird things happened because of unusual configuration or unanticipated usage, that my default assumption is that they are correct and the program is wrong.
18. Test the fix. When a fix for a bug is ready, it must be tested. First run the code without the fix, and observe the bug. Then apply the fix and repeat the test case. Now the buggy behavior should be gone. Following these steps makes sure it actually is a bug, and that the fix actually fixes the problem. Simple but necessary.
Other observations
Over the 13 years that I have been keeping track of the trickiest bugs I have encountered, a lot of things have changed. I have worked on a small embedded system, on a large telecom system and on a web-based system. I have worked in C++, Ruby, Java and Python. Several classes of bugs from my C++ days have simply disappeared, like stack overflows, memory corruption, string problems and some forms of memory leaks.
Other problems, like loop errors and corner cases, I see far fewer of because I have been unit-test more logic. But that doesn’t mean there aren’t bugs – there still are. The lessons in this post help me to limit the damage at the three stages of coding, testing and debugging. Let me know in the comments what other tricks and techniques you have found useful when preventing or finding bugs.