Sunday, August 12, 2012

Debugging C++ (Part 3): dmesg

Welcome in the third post of this series about debugging C++. In here, I will talk about something less usual because it allows to debug after the crash of the program, this method use dmesg. I just present the case where we work with several libraries and your program crashes without any clue on which library is responsible of this, nor how to reproduce this behavior.

1 dmesg

I heard about dmesg when reading the tsuna's blog. Unfortunately I don't have any competence (for now) for reading assembler. But the fact that we can discover the name of the faulty function is helpful. I had to use this when I worked on a robot during my internship (a next post will present that). We worked with libraries and this is what shows my post.

On a robot there is a lot of parameters coming from the miscellaneous sensors, and the execution of the same program depends on a lot of parameters. So it is really hard to reproduce a bug. If the nice "segmentation fault" message appears, how can you debug that? Considering that you can't run valgrind, and running gdb is painful.

dmesg was my solution. I wrote a shell script to make the computation for me. Let's start by creating a dummy library which exports one function that segfault if a null pointer is given, and let be sadistic, we will call it with nullptr.

// file: libprint.hh

int dereference(int* t);

#endif // !TMP_LIBPRINT_HH_

// file:
#include "libprint.hh"

int dereference(int* t)
  return *t;

// file:
#include "libprint.hh"

int main()
  return dereference(nullptr);

We create a that contains the dereference function. And we compile the file main into a binary print linked with this library. And oh, surprise! Segmentation fault. Let's start the hunting. We call dmesg, and look at the last line:

[184608.332284] print[31332]: segfault at 0 ip b772e422 sp bf8ad218 error 4 in[b772e000+1000]

We need two information: the name of the library that contains the bug, and the address of the faulty instruction in this library. To get the name of the library, we have to take the last field and to remove the part into []. To have the address of the faulty instruction we have to take the value of the instruction pointer (ip), and the value before the + in the last field. And we just have to subtract the value of the second value to the value of ip. If you are wondering why subtracting these two values to know the address of the ip in the library a draw may help.

I hope the picture helped, in fact, this subtraction removes the offset corresponding to the position of the library (address).

The question is how to make this process automatically? First, we can make the assumption that we always run dmesg right after the error, so we can suppose that we can make a call to tail to keep only the last line. But sometimes this assumption isn't correct, so our solution must be able to get a value given in argument. In here we use the shell default value assignment. As a little remainder:

output=${output:="default value"}

If an argument is given, output will be equal to its value, otherwise it will be equal to "default value". So we can use it to decide whether we use the first argument of the program or directly call dmesg.

The part of the message before the colon is useless, so we can remove it. Then we have to get the value of the fifth field to get the value associated to ip, and we have to get the last field.

The name of the library and the address where it is mapped in the memory lie in the last field. So we have to cut it in two and we can get the needed information.

All these operations can be made by using only awk and sed.

Once we have the two addresses we just have to make the operation. We use the builtin system of the shell to make the subtract. Beware, they are in hexadecimal! So we must prefix the value by 0x to tell the base to the shell. Now we have the result (in decimal), we want it converted into hexadecimal, we use bc. It is a tool for making numeric computations. And we are grateful, there is a way to make it convert a number from a base to another. The syntax is simple, you have to set the variable obase to 16 (default value is 10). And that's all, remember to append the 0x before the address, because bc won't.

Here is the complete script:

#! /bin/sh

output=${output:=`dmesg | tail -1`}
output=`echo $output | sed -e 's/.*: //'`

first=`echo $output | awk '{ print $5; }'`
second=`echo $output | awk '{print $11; }'`

library=`echo $second | sed -e 's/\[.*//'`
second=`echo $second | sed -e 's/.*\[//' -e 's/\+.*//'`

address=`echo $((0x$first - 0x$second))`
address=`echo "obase=16; $address" | bc`

echo "Segmentation fault in $library at: 0x$address."

And the way to use it is simple, just run it just after a segmentation fault when working with a library. Here is what it says about our case.

$ ./
Segmentation fault in at: 0x422.

And now, just run gdb like this (it is how I get with my example):

$ gdb
(gdb) disass 0x422
Dump of assembler code for function _Z11dereferencePi:
   0x0000041c <+0>:     push   %ebp
   0x0000041d <+1>:     mov    %esp,%ebp
   0x0000041f <+3>:     mov    0x8(%ebp),%eax
   0x00000422 <+6>:     mov    (%eax),%eax
   0x00000424 <+8>:     pop    %ebp
   0x00000425 <+9>:     ret
End of assembler dump.
(gdb) ...

If you are fluent with assembler you could read it, or use the meta data given by gdb: "Z11dereferencePi". Oops, I realized that I have forgot to use "-g" when compiling. Not important: we have a mangled symbol. We can use one of the method presented in one of my previous post. And voila, we know that our mistake is in the function dereference(int*). Pretty good when, without this method I was unable to know where it fails, why, and in the impossibility to reproduce it since there is too much parameters. I don't know how I would have done without this method.

I put this script on my github account, so if you want to fork it to enhance it, it is possible.

Hope you liked it!


  1. Hi!

    Thanks for sharing this cool debugging technique. A few days ago I had a similar discussion with a colleague (about how to use gdb on a stripped binary). After reading your post I decided to write my own article on the subject:
    This article explains how to use GDB at a lower level: breakpoints on memory addresses, locating entry point, stepping assembly instructions, etc...

    Keep on the great work!

  2. I think you make this tutorial because you know the source code and what happend. My question is : What cand find some infos about dmesg output? For example i got this error: proj[5433] general protection ip:8048092 sp:bfbde120 error:202 in proj[8048000+1000]

  3. Cool trick ;)
    Thanks for sharing !