In the this lecture, we will carry on our discussion on the art of debugging.
The development of good code usually follows the design methodology we have discussed in class: first you understand the customers needs and captures these in the requirements of the system; then the system design starts. The design spec. and implementation spec follow capturing the data flow, IO, important data structure and algorithm design and finally the pseudo code for the system. After all that: which I typically think of as the thinking about the problem and using abstraction (data structure design), comes the coding phase. Once you have written code you need to test it. Yes, I know this is boring! But unit testing, sub-system testing, system testing and finally integration testing are a really important part of the life cycle of good software development. There is a tension to consider when testing - you want a set of tools that help you quickly find the bugs, but you don’t want to feel this is some sort of worthless tax strategy imposed by the company you work for to get code out of the door to the customer. So once you have coded your functions, units, modules, sub-system you need to test it and find the bugs - they will exist! Once you have a tested system and your system fails - maybe a bug not uncovered during testing - then you need to debug. In fact solving problems found while testing also requires a set of debugging tools and strategies.
In the next lecture we will discuss testing and in this one we discuss the tools and strategies you need to debug and make your code bug free - or your money back. The great computer scientist Edsger Dijkstra once said that testing can demonstrate the presence of bugs but not their absence. We will come back to that in the next lecture. To some degree design, testing, and debugging - in my opinion - are not so stepwise in terms of a known formula as professed by many of the software engineering books - they require some experience, nose, detective work; there is indeed an art to design, testing, and debugging.
In this lecture on debugging we draw from “Linux Programming” by Matthew and Stones in terms of the material and coding examples from their chapter on debugging. References and pointers to these books can be found on the course webpage.
BTW, you can skip these notes if you write bug free code ;-)
We plan to learn the following from today’s lecture:
Up until now you have probably being relying on simple debug skills: using printf. But as the number of lines in your code grows, indeed at this point, the number of files in the systems you are coding, then printf while helpful is not effective enough. For example, a printf placed in your code may work but the segment fault that occurs 50 lines after your printf stops the printf from finally writing out that “I HERE AM AT LINE 100” message to the screen. So you natrually think that the bug must be before line 100 but it’s at line 150. So printf is not your friend in this example, rather it’s a red herring.
Debugging code is an essential part of the design and implementation methodology we discussed earlier. Many problems can arise when coding. Bugs can be simple: array script error are sometime difficult to debug: the systems runs for days and then fails, say, because of a memory leak or buffer overflow problem. Programmer aim to understand the nature of the bug they are trying to swat: is it predicatble, does it always fail under the same set of conditions, and so on. These are clues that help track down those pesky bugs in complex systems. To some degree being a good deugger of code C comes with experience.
By now you are use to bus errors, segfaults, and seeing files such as core dumps in your directory when you run your program and it fails. We will use the GNU debugger (gdb) to help solve these problems. Through a set of examples we will show how to debug problems in a systematic manner. But first let’s dicuss why bugs occur and what technques other than running gdb help.
The complexity of a program is related to the number of interacting components; for example, the crawler interacts with an external and distribued wget() command. There is a line of thought that says as a rule of thumb the number of pseky bug grows with the number of interactions. Reducing the complexity and interactions enables us to focus in on the location of bugs in code.
Dubgging problems ranges from easy, moderate through down right super hard. Techniques that help reduce degugging time include:
We will discuss these techniques in this lecture. First a work on C.
The C programming language can be dangerous to code with if you are not familiar with the dangers; for example: C does not support array subscript checking so it is easy to write and read at locations beyond the end of an array that you have defined. So be ware of this when you access arrays using variables, particularly, in loops (e.g., x = array[i] where the length of the array is 20 but i = 21). Pointers can be dangerous too - very danferous. You know that already. For example, not initialising a pointer and then trying to access elements of a structure or memory pointed to by the unitialized pointer is playing fire. There are many other possible scenarios. There is no garbage collection in C. So if you malloc() without free() you will have a memory leak which might cause your program to fail or not, or fail sometimes. There are tools for determining if there are memory leaks such as valgrind that we will use.
Up until now you have been using mostly prinft() to help you debug your code. Printf can only get you so far. Many different types of errors or bugs can exist in software. For example, you may have bug free code but the performance of the system woeful. How do you find performance errors in your code - could it be the choice of data structure is too “slow” or the structure of your code? What happens if you code looks error free but you have memory leaks? Printf will not help. What if you have a “seg fault” which you have all experienced that occurs after your printf (which lets say is executed) but never gets printed out because the process seg faults, as discussed above. What happens if your system runs for hours and under a certain set of system conditions the code fails. Working your way through 1000s of printf statements may not help. When a bug is buried deep in the execution of your software you need sophisticated tools to track those down. Printf will not help much. This lecture talks about tools to help with performance issues, memory leaks and difficult bugs.
The major groups of errors found in system development are requirement spec, design spec and coding errors. In what follows, we discuss these common types of errors found in systems.
Many times errors creep into system development because of some misunderstanding between what the customer wants and what the developers deliver. These “errors” in the requirement phase are very costly and typically some form of requirement analysis may catch them but your very best programmer has little chance of discovering these types of errors in the software life-cycle. That is right even your very best programmers will write the wrong code. Requirement reviews can also help here. But it is important to try and identify what the systems requirements are asking the designers to do and if that is indeed what the customer wants.
Assuming that the Requirement Spec is good the next set of errors occur in the design phase. In terms of our methodology errors would be reflected in the Design Spec. Do not sit down and start coding once you understand the requirements; think, abstract and write the Design Spec. You need to understand the data flow, IO, data structures, functional decomposition, pseudo code and algorithms before you write a line of code. If errors creep in the design stage; for example, if the code is to design a search on a large volume of data the data structures which look functionally correct may imped performance - maybe a hash table would be the best choice. This issue relates to performance errors. Many types of design errors can occur and misunderstandings between software engineers in the design phase can be helped with writing good clear specs that people read and discuss in detail in design reviews before a line of code is written.
The main error people associate with software is coding errors. We all make coding errors. Having a solid Design Spec to work from that has gone through a rigorous review helps. Translating that spec to an Implementation Spec that we have discussed in this course is the next step. Once code is written and unit tested (we will discuss this in the next lecture) then “desk checking” your code is your best friend - more on coding inspection below. Hacking at the terminal in the hope of digging the error out should not be the next step - first read your code and convince yourself it works. Make believe that you are the computer executing the code line by line - update the data structures just like your code. Study the IO in your code as you execute it. You will find many errors this way. Even better ask another person working in the project but not familiar with your code to desk check your code. Sometimes errors can be staring you in the face and a fresh set of eyes can pick those pesky bugs out. The take home is that you do not need a computer to find bugs.
Clearly computers do help and tools can help enormously to track down difficult bugs. Using the compiler with strong flags can help by just running gcc. In this class, we have used: “gcc -Wall -pedantic -std=c99” which is a good start. Never execute code that give you warnings - fix the code.
When tracking down pesky bugs we can think of the following steps to finding them:
Many times people rush and “hack” the debug phase and sit and the terminal hoping to eventually track down that bug via trail and error. Most people do this as their first resort. You will find this approach can be successful but can be very time consuming - read take longer than other techniques. One of the most effective debug tools is you! Stop and read your code. Pretend you are a computer and execute the code with a pen and paper. Code inspection is very useful. Programmers closely trace through their code in detail. Look for boundary problems in code, many times bugs exist at the boundaries - of structures, arrays, code (e.g., for loops). Many difficult bugs require more power than just hacking and hoping. Once you have read your code and convinced yourself it works then assuming bugs lurk you need to instrument your code and start the detective work. Read on.
It is a smart idea to use conditional “#ifdef” to include debug code. This can be switched on using by defining a variable in the command line of the compiler. The “-D” option of the gcc compiler allows us to define a named value (e.g., DEBUG below). Using conditional definitions to print out important data structures or where in the code that an assert happens is very useful. In addition, a single switch such as -DDEBUG can “turn on” all the test code in a system. This is better than defining a global variable to do the same thing because in the case of conditional debug flags the test code can be turned off and is not included in the binary. Make sure you turn off the test code before the code is shipped, e.g., make debug becomes make release.
We will use a buggy sort function in these notes to illustrate a number of approaches to debugging – note, this is not a great piece of code: buggy_sort0.c . Note, that the line, file and time are also printed out. We first use conditionally compiled debug code in buggy_sort1.c . Note, that the line, file and time are also printed out. It is clear now that the sort does not work, but, why? We discuss this later in the notes – buggy is very buggy.
Note, in the code below it states: mygcc -g -o bug -DDEBUG buggy_sort1.c. Note that -g and -ggdb are mostly equivalent flags. There may be some differences between machines.
The cflow tool prints a function call tree showing which functions call which others. This is very helpful to see the structure of the code and to see how it operates. The tool essentially analyses the source files and builds a graph from the code.
The gprof tool is use to profile the runtime performance code. Many times you might be confident that the code is bug free because there are no functional problems. However, the code could be worthless if it does not meet its performance requirements. The gprof tool is an execution profiling tool that is useful when tracking down performance problems.
Warning. Don’t try and use gprof on your mac. It has problems with Intel archiecture. Log on to a machine in the lab to try it out.
Anecdotal evidence. Many time programmers focus on getting the code to work functionally and then think about speed up. This is not always the most productive approach to design or systems development. Best to design for speed if needed (e.g., use a hash table instead of searching a long double linked list). I recall once working as a consultant on improving the performance of a radio router. The performance of the system coded in Ada was appalling and someone’s head was about to roll. I spend probably two weeks just studying the code of a very large system - difficult to keep that all in your head. Profiling the code highlighted the cost of a system that had been desig/Users/atc/teaching/cs23/notes/s5/lab5/srcned and coded with excessive number of tasks and rendezvous. The cost of interprocess communications was high. What did I do? It was not nice. Turned the system into one large task and replaced all interprocess communications (IPC) (which represented system calls) with my library that implemented the IPC API. I changes couple 100 lines of code in a system of 20,000 lines of code. The improvement was massive. Packets could be forwarding from one radio input to the output radio in under 100 msec which was down from 1 second! I was king for the day, or week. I made my changes, tested them locally, desk checked the code closely - and, it ran first time! The guy who designed the system wanted me to fail - I could feel it. But when that router ran first time. Well that is a moment I will always remember. My reward? I got to design the next system. The problem was essentially a performance bug. The changes was simple once the problem was identified. I took a very radical approach that ran against OO design. But that is what was needed. The router forwarded packets a 1 second intervals was not going to fly with the customer. And, yes, I saved the day and it was fun.
Note to instructor: use lab4 solution in cs23/l16/lab4 as example code
The output of the profile is below:
The valgrind tool is excellent for finding a number of problems with illegal access and memory leaks.
In what follows, we look at some very buggy code called leaky.c
First we will compile the code and the run the valgrind tool on the code. No recompilation is need to run valgrind.
Note from the valgrind output below it indicates various problems in the leaky.c code and tells the user what line a particular problem occurred on; for example:
==2189== Invalid read of size 1
==2189== at 0x804841C: main (leaky.c:20)
There is an invalid read at line 20.
How do I find a line number in my code?
You can use a emacs short cut to goto line e.g., 20 in your code. Hold down the ESC key then hit the X key followed by the G key then 20 then CR [carriage return]. You will be brought to line 20. This is handy when gcc or valgrind or any tool for that matter tells you there is a problem at a particular line.
The complete valout is here valout
We use the intentionally buggy buggy_sort.c code from “Linux Programming” by Mathew and Stones. We use gdb in a series of debugging moves that includes set breakpoints, commands, and patching running code to verify fixes to three bugs we encounter. This is a simple example that builts on our use of gdb in the course.
We study three classes of bugs: uninitialized variables, boundary/edge condition bugs and incorrect coding. We find the bugs using different techniques, fix and verify the changes.
The original sort code buggy_sort0.c looks harmless at first glance but pesky bugs lurk. Running the compiled buggy_sort0.c shows nothing!
Let’s instrument the the code to print the main data structure before and after sort runs. To do this we add -DDEBUG code to look at data structures before and after the sort is called buggy_sort1.c . Now we can see from the print outs that the sort does not work. This is our first evidence that things are not hunky-dory, an english term for OK.
We use gdb to set a breakpoint at sort and the step through. The first thing we notice is that the sort fuction does not process the list and fall out of the for loop first time through. Taking a look at the for loop statement more closely we see that it relies on a variable “s” which is not initialized. Note, s could have any value if not initialized. Theefore, could do different things on different machines. We fix this bug by initialising s=1 in buggy_sort2.c
We know that bugs happend at boundary conditions and the there is only one main data structure. Bugs lurk at edge cases. We make the arrays much larger and discover a segment fault lurking. We make the change to the main data structure in buggy_sort2.c. We use gdb to find the second bug and fixed it - for (j = 0; j < n - 1; j++) - n changed n-1. We make the changes to buggy_sort3.c
The final bug is more tricky. We run the code again and note that the sort almost works. We set a break point in sort and look at the results each time the break point is set. It is clear that the loop terminates early. We find the fix and instead of recompiling we use patching of the code. Found and patched the second fix for // n–; commented this line out. buggy_sort3.c
The good code (well, working code) is buggy_sort4.c
We are going to compile and run the code and look at the output:
Bugs happen at boundaries. Writing one byte beyond a byte array will not be detected by the runtime system but if we increase the size of the allocation we might highlight bugs.
Reading a byte beyond a byte array in this example lets the code run somewhat operationally normally. Errors occur at boundary conditions - i.e., at the edge of the data structure. So lets increase the size of the data structure such that an illegal access will show up. We change data to a 4K array buffer.
We “define SIZE 4096” in the code below.
Any access beyond this boundary will show up at execution.
We have a segment fault - we have the bug out in the open!
We use code inspection and instrumentation and gdb to find the bugs.
We have a seg. fault. But where in the code?
The backtrace command helps debug fault segments. It provides the path through the code prior to the point when the segment fault occurs.
This is a short program to the backtrace is very short. We have not called many functions. The sort is called with sort (a=0x600ca0, n=5) at buggy_sort2.c:41. If the code had a more complex trace we would be able to dig into the details and find out the control flow of the code which would be use in getting to the errors.
OK. Let’s but the fix in above into buggy_sort3.c and run it again. Hopefully, that was the last pesky bug...
The code completes without a seg fault now but it still does not work. The table is not ordered by the sort. But what we notice is that n=3 before it exits and the next breakpoint is not hit. Looking more closely at the code we see that the code does not execute the outer loop n times where n is 5. n is set to 3 along the way which is incorrect. Closer inspection of the code we can see that line 52 that is culprit (i.e., n–;). The input parameter n dictates how many times the inner and outter loop execute but n should not be decremented in this code because variables i and j are used to drive these loops, respectively:
We can use patching to change the code without recompilation. This is very cool. It is like hot wiring cars.
Our buggy sort is no longer buggy - now works great!
We put the fix in buggy_sort4.c
Other useful gdb commands are:
step - single step through a piece of code line by line
next - step “over” a routine that you do not want to step into (by using step)
OK, we are done. Being an effective debugger requires you to have an approach first: a set of techniques that allow you to work through finding easy and more progressively hard bugs. Tools can be a great help.
In the next lecture, we will talk more about testing your code with the goal of convincing yourself that your code has no bugs. However, as Dijkstra said “testing” only demonstrate the presence of bugs but not their absence. We will talk more about where bugs lerk and how the art of testing can help discover and remove many of the common bugs.
We strongly recommend that you play with these various versions of buggy_sort and go through the notes and reproduce what we did above in a step by step manner; then you can apply these simple techniques in nailing your pesky buys.