Welcome in the third post of this series about debugging C++. In here, I will talk about something less usual because it allows to debug
after the crash of the program, this method use dmesg
. I
just present the case where we work with several libraries and your
program crashes without any clue on which library is responsible of
this, nor how to reproduce this behavior.
1 dmesg
I heard about dmesg
when reading the tsuna's blog. Unfortunately I
don't have any competence (for now) for reading assembler. But the fact
that we can discover the name of the faulty function is helpful. I had
to use this when I worked on a robot during my internship (a next post
will present that). We worked with libraries and this is what shows my
post.
On a robot there is a lot of parameters coming from the miscellaneous sensors, and the execution of the same program depends on a lot of parameters. So it is really hard to reproduce a bug. If the nice "segmentation fault" message appears, how can you debug that? Considering that you can't run valgrind, and running gdb is painful.
dmesg
was my solution. I wrote a shell script to make the
computation for me. Let's start by creating a dummy library which
exports one function that segfault if a null pointer is given, and
let be sadistic, we will call it with nullptr
.
// file: libprint.hh #ifndef TMP_LIBPRINT_HH_ # define TMP_LIBPRINT_HH_ int dereference(int* t); #endif // !TMP_LIBPRINT_HH_ // file: libprint.cc #include "libprint.hh" int dereference(int* t) { return *t; } // file: main.cc #include "libprint.hh" int main() { return dereference(nullptr); }
We create a libprint.so
that contains the
dereference
function. And we compile the file main into a
binary print
linked with this library. And oh, surprise!
Segmentation fault. Let's start the hunting. We call dmesg, and look
at the last line:
[184608.332284] print[31332]: segfault at 0 ip b772e422 sp bf8ad218 error 4 in libprint.so[b772e000+1000]
We need two information: the name of the library that contains the
bug, and the address of the faulty instruction in this library. To get
the name of the library, we have to take the last field and to remove
the part into []
. To have the address of the faulty
instruction we have to take the value of the instruction pointer (ip),
and the value before the +
in the last field. And we just
have to subtract the value of the second value to the value of ip. If
you are wondering why subtracting these two values to know the address
of the ip in the library a draw may help.
I hope the picture helped, in fact, this subtraction removes the offset corresponding to the position of the library (address).
The question is how to make this process automatically? First, we can
make the assumption that we always run dmesg right after the error,
so we can suppose that we can make a call to tail
to
keep only the last line. But sometimes this assumption isn't correct,
so our solution must be able to get a value given in argument. In here
we use the shell default value assignment. As a little remainder:
output=$1 output=${output:="default value"}
If an argument is given, output will be equal to its value, otherwise it will be equal to "default value". So we can use it to decide whether we use the first argument of the program or directly call dmesg.
The part of the message before the colon is useless, so we can remove it. Then we have to get the value of the fifth field to get the value associated to ip, and we have to get the last field.
The name of the library and the address where it is mapped in the memory lie in the last field. So we have to cut it in two and we can get the needed information.
All these operations can be made by using only awk and sed.
Once we have the two addresses we just have to make the operation. We
use the builtin system of the shell to make the subtract. Beware, they
are in hexadecimal! So we must prefix the value by 0x
to
tell the base to the shell. Now we have the result (in decimal), we
want it converted into hexadecimal, we use bc. It is a tool for making
numeric computations. And we are grateful, there is a way to make it
convert a number from a base to another. The syntax is simple, you
have to set the variable obase
to 16 (default value is
10). And that's all, remember to append the 0x before the address,
because bc won't.
Here is the complete script:
#! /bin/sh output=$1 output=${output:=`dmesg | tail -1`} output=`echo $output | sed -e 's/.*: //'` first=`echo $output | awk '{ print $5; }'` second=`echo $output | awk '{print $11; }'` library=`echo $second | sed -e 's/\[.*//'` second=`echo $second | sed -e 's/.*\[//' -e 's/\+.*//'` address=`echo $((0x$first - 0x$second))` address=`echo "obase=16; $address" | bc` echo "Segmentation fault in $library at: 0x$address."
And the way to use it is simple, just run it just after a segmentation fault when working with a library. Here is what it says about our case.
$ ./dmesg.sh
Segmentation fault in libprint.so at: 0x422.
And now, just run gdb like this (it is how I get with my libprint.so example):
$ gdb libprint.so ... (gdb) disass 0x422 Dump of assembler code for function _Z11dereferencePi: 0x0000041c <+0>: push %ebp 0x0000041d <+1>: mov %esp,%ebp 0x0000041f <+3>: mov 0x8(%ebp),%eax 0x00000422 <+6>: mov (%eax),%eax 0x00000424 <+8>: pop %ebp 0x00000425 <+9>: ret End of assembler dump. (gdb) ...
If you are fluent with assembler you could read it, or use the meta
data given by gdb: "Z11dereferencePi". Oops, I realized that I have
forgot to use "-g" when compiling. Not important: we have a mangled
symbol. We can use one of the method presented in one of my previous
post. And voila, we know that our mistake is in the function
dereference(int*)
. Pretty good when, without this method
I was unable to know where it fails, why, and in the impossibility to
reproduce it since there is too much parameters. I don't know how I
would have done without this method.
I put this script on my github account, so if you want to fork it to enhance it, it is possible.
Hope you liked it!
Hi!
ReplyDeleteThanks for sharing this cool debugging technique. A few days ago I had a similar discussion with a colleague (about how to use gdb on a stripped binary). After reading your post I decided to write my own article on the subject:
http://felix.abecassis.me/2012/08/gdb-debugging-stripped-binaries/
This article explains how to use GDB at a lower level: breakpoints on memory addresses, locating entry point, stepping assembly instructions, etc...
Keep on the great work!
I think you make this tutorial because you know the source code and what happend. My question is : What cand find some infos about dmesg output? For example i got this error: proj[5433] general protection ip:8048092 sp:bfbde120 error:202 in proj[8048000+1000]
ReplyDeleteCool trick ;)
ReplyDeleteThanks for sharing !