Friday, August 10, 2012

Debugging C++ (Part 2): Valgrind and gdb

This article is the second post of the "Debugging C++" series. This post is an introduction of Valgrind and gdb. It is intended for people that don't know these because it starts from the beginning and gives some links to the documentation. In the valgrind part, I present the notion of "definitely lost", "indirectly lost", etc. If you are already familiar with these notions, I invite you to go through the second part of this post about gdb. In this second part I start by briefly presenting what is gdb and what it is useful. But the main interest of this section is that I present the notion of reverse-debugging, which allows you to run your program backward. I also present a useful trick when you want to debug a code you wrote (and you can modify) inside an unknown environment (for example, imagine you want to debug your (buggy) malloc implementation and you test it with ls).

Before starting tools like Valgrind or gdb, think about compiling your code in debug mode, and not using optimizations. Otherwise you'll get trouble to debug it. For example with gdb, your cursor will jump from one line to another without following the linearity of the source code. This can be disappointing, this is due to that.

1 Valgrind

Valgrind is a set of tools that contains a memory checker (memcheck), a profiler (users can use two modules: callgrind and cachegrind), a tool for checking the memory consumption (massif), a synchronisation checker (Helgrind). The tools I use the most in this list is Memcheck, and this is all I will talk about in this post. I'm sure there will be other posts on this list of tools in the future. This is just a brief introduction. If you already know valgrind, you should skip this part.

Memcheck helps you to detect any memory corruption. How it looks like? Let's assume you have a program like this:

#include <vector>

int main(int argc, char *argv[])
{
  std::vector<int> v;
  v.reserve(2);

  v[2] = 4;
}

The call to reserve allocates the space for two integers. But we try to write in a unallocated address. So it is a mistake from the programmer, and even if in this example it is trivial that there is a mistake, sometimes it is not, because it is hidden in a lot of code. So, detecting this is not so simple. That's where Memcheck is helpful. Here is what he says about this snippet.

==4641== Invalid write of size 4
==4641==    at 0x40099A: main (test.cc:8)
==4641==  Address 0x5955048 is 0 bytes after a block of size 8 alloc d
==4641==    at 0x4C286E7: operator new(unsigned long)\
 (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==4641==    by 0x400FB3: __gnu_cxx::new_allocator<int>::\
 allocate(unsigned long, void const*) (new_allocator.h:94)
==4641==    by 0x400ECA: std::_Vector_base<int, std::allocator<int> >::\
 _M_allocate(unsigned long) (in /tmp/aout)
==4641==    by 0x400D3B: int* std::vector<int, std::allocator<int> >::\
 _M_allocate_and_copy<std::move_iterator<int*> >(unsigned long, \
 std::move_iterator<int*>, std::move_iterator<int*>) (stl_vector.h:1109)
==4641==    by 0x400AC8: std::vector<int, std::allocator<int> >::\
 reserve(unsigned long) (vector.tcc:76)
==4641==    by 0x400988: main (test.cc:6)

4641 corresponds to the PID of the process. Every line coming from valgrind is formatted like this: ==PID==. Line beginning by a space are just the continuation of the line above. It was just for having the output of valgrind fitting in the page.

Remember to use "-g" when you compile your program, otherwise you'll get the name of the binary instead of the name of the file and the line number.

It starts by giving the name of the violation: "Invalid write" and tells us how many bytes we violate. Here this is 4 (sizeof(int)). Then he tells us where we have made a mistake, and then he shows what is the mistake. We have written after an allocated area. And then he shows the stack of the call that led to allocate this area. It starts by a call to reserve in the main line 6, and so on.

As you can see, the message is clear and it helps finding this kind of mistakes easily. This is a valuable tool for writing bug-free software. Every of your program should work without any error in valgrind. Because even if it doesn't create an error directly, it can lead to very weird error later and this is called (in my school at least^^) a mystical error. Because there is no clue that could help (without valgrind).

This is a simple example just to see how powerful is this tool. Another common use of Memcheck is its ability to detect memory leaks. As an example:

int main()
{
  int* p = new int;

  p = nullptr;
}

This is a trivial case of a memory leak because after the affectation of p to nullptr, there is no more pointer to the area allocated with new. It is lost. This is trivial to detect it, but that could be less trivial. So once again, we can rely on Memcheck to warn us. Let's see what he has to say about this:

==4736== HEAP SUMMARY:
==4736==     in use at exit: 4 bytes in 1 blocks
==4736==   total heap usage: 1 allocs, 0 frees, 4 bytes allocated
==4736==
==4736== 4 bytes in 1 blocks are definitely lost in loss record 1 of 1
==4736==    at 0x4C286E7: operator new(unsigned long) (in \
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==4736==    by 0x40060D: main (tests.cc:3)
==4736==
==4736== LEAK SUMMARY:
==4736==    definitely lost: 4 bytes in 1 blocks
==4736==    indirectly lost: 0 bytes in 0 blocks
==4736==      possibly lost: 0 bytes in 0 blocks
==4736==    still reachable: 0 bytes in 0 blocks
==4736==         suppressed: 0 bytes in 0 blocks

I just show the interesting part of the output of valgrind. To get this, I ran valgrind –leak-check=full <program_name>. We start by reading the heap summary, which says that when we exit, 4 bytes are still in use. It indicates the number of allocation and the number of deallocation. We see that the two numbers are not equal, so there is a problem. There is more allocation than free, it's a leak.

At the bottom of the message we can see the kind of leak. There are four kinds of leaks. The full explanation can be found in the documentation, in the section 4.2.7. A short explanation follows.

AA and BB are heap blocks, and C the set of pointer of your program (in reality it is more than that. See the section 4.2.7 cited above). And consider an arrow represent that there is at least one pointer to the first byte of the allocated area, and an arrow and a `?' represents a pointer to a byte inside the allocated area and not the first (this is called interior-pointer in the link given above). No arrow means that the heap block is unreachable. These are simplified examples to understand the concept behind these things.

Once we have understood what represents these names, we can look at the message in the middle that gives the location of each memory blocks that leak. It says definitely lost. So we just override the address of this area, and it appears line 3. Once we know the supposed reason of this leak and its position in our source file, it is easy to fix it.

This is the power of Memcheck. This is the first thing I run when I have a bug. Because even if the error he can reveal is not responsible of the bug, it will be annoying in the future.

A important note about valgrind, since it use its own allocator and deallocator, some bugs may disappear when using Memcheck. It happens to me (at least) once, when I had the output of my program that depends on the state of the computer (state of the heap in fact), which led my algorithm to have different results if I run it several time consecutively. This is generally due to a memory corruption or something like that, so I used Memcheck. And under Memcheck, the program starts to act normally.

An interesting post about valgrind that talks about can be found here.

2 gdb

A second tool I use when I still have a bug after using valgrind is gdb (GNU Debugger).

gdb is a debugger that allows you to walk through your code at run-time and see how the lines you (or your colleagues) wrote influence the program and detect bugs.

I don't want to write a basic tutorial to gdb since there are really tons of them on the Internet. I'd prefer talking about two things that can be useful and not known by everyone. The first one is backward debugging. The second one is a trick to debug a library when you can't run gdb on the program that uses the library. Baptiste Afsa, one of my teacher at Epita, gave me this trick.

2.1 Backward debugging

Why backward debugging? Simply because I facepalmed myself too many time after going too far in my debugging process. I mean going after the critical point after running through trivial code slowly for 5 minutes and being forced to restart the whole thing.

This feature was introduced in 2009 with gdb version 7.0. Documentation is here. To be able to enable this feature, you need to tell gdb to record the state of your program. The official tutorial is here. So I don't have to explain that in details (No I'm not lazy! :) but I don't like duplication and I don't like duplication). I recommend to read the tutorial!

But the thing is that recording introduces an overhead when you use next, continue or whatever. This can slow down the debugging process, but I think it can really help the programmer.

Give it a try, it is helpful.

2.2 The infinite loop trick

This tips allows you to debug the code of a library call by a program that you have difficulties to run with gdb. The case where I use it is when we had to code malloc (free, realloc, calloc too) at school. And to check if it works we used LD_PRELOAD to make binary use our version of malloc instead of the standard version.

Let's assume we run ls with our malloc, how would you debug it?

$ gdb ls
...
Reading symbols from /bin/ls...(no debugging symbols found)...done.
(gdb) break main
Function "main" not defined.

Well… Seems hard right? The solution given by my teacher was to use an infinite loop. How it works:

#include <iostream>

int main()
{
  bool cond = true;

  while (cond)
    continue;

  std::cout << "Out of the infinite loop" << std::endl;
  // Stuff
}

The trick was to add this snippet at the beginning of the code of our function, and to run the program. For example, compile the code above (don't forget -ggdb). This is an infinite loop, yay!

gdb can attach to a running process with its PID, and is able to set the value of a variable when debugging. So the trick is just:

$ pgrep <program_name>
<pid>
$ gdb attach <pid>
...
main () at tests.cc:8
8           continue;
(gdb) set var cond = false
(gdb) n
7         while (cond)
(gdb) n
10        std::cout << "Out of the infinite loop" << std::endl;
(gdb)

By setting the variable cond to false, we are out of the infinite loop, and we have the hand on the program to debug as we want. Impressive right?

I think this trick is useful for desperate situation like the one described above.

This was two great features I wanted to share with you. Do you have features that you want to share?

No comments:

Post a Comment