Silmor . de
Site Links:
Impressum / Publisher

Debugging Crashes

You may know those mega-useful bug reports: "The program just crashed. It shows me that it happened at address some number..."

If it was Windows that's all you are ever going to get. Even if you still had the originally released binary - it would be hard to find this address. If it was Unix/Linux the user probably turned off core dumps, or would be reluctant to send you one that contains his precious data.

What you really need is a simple stack-dump. But where to get it from a compiled C/C++-binary? Fortunately there are the GNU tools. This article describes how to use some of them to get useful information the next time one of your programs crashes underneath one of your customers.

Getting informed

Most modern operating systems inform programs of fatal problems by sending a signal to them. The two most important ones are SIGABRT and SIGSEGV. SIGABRT (abort signal) is usually generated by the program itself if it reaches a state in which it cannot continue to run (eg. an assertion fails). SIGSEGV (segmentation violation) informs the program that it tried to access memory that it is not allowed to access or that simply is not allocated - this happens for example if it tries to use an object that has already been freed. On Unixoid systems there are a couple more signals that might be a good idea to catch (eg. SIGILL and SIGBUS).

If these signals are caught we can stop the program and call any other program that analyzes the current process (eg. a debugger).

Subscribing to signals is fairly easy using the signal or sigaction command. The signal handler registered there should call the external program.

Here an example for Unix/Linux:

static void myhandler(int sig)
{
    char buf[1024];
    fprintf(stderr,"Caught signal %i, trying to call debugger...\n",sig);
    snprintf(buf,1023,"debugger %i",getpid());
    system(buf);
    if(sig!=SIGABRT){
        signal(SIGABRT,SIG_DFL);
        abort();
    }
}

int main()
{
    signal(SIGABRT,myhandler);
    signal(SIGSEGV,myhandler);
    /*....*/
}

In the example above I used signal(2) since the functionality of sigaction(2) is not required here and signal(2) has the nice effect that the handler is dis-associated with the signal as soon as the signal triggers the first time. This prevents the handler from ending up in an endless loop.

In the snprintf(3)-call the string "debugger" needs to be replaced by a debugger command or some wrapper around it (see below for an example using gdb). It might also be a good idea to use one of the exec(2)-calls plus waitpid(2) instead of system to avoid problems with the shell. All the debugging work should then be done in the external process. Especially if the wrapper that is used displays some kind of GUI it should be executed in an external process to avoid problems - the signal might have been caused by the GUI, so doing more GUI work will only aggravate the problem and disrupt the stack that we want to analyse.

The last few lines of the handler check whether the process is already in abort mode. If the handler was called by SIGABRT the program will return to the abort(2) function and continue to generate a core dump after the signal routine returns. If this was another signal we reset the handling of SIGABRT (to prevent loops) and cause the program to abort and dump core (if we did not do anything it would try to continue running).

To port this to (non-CRT) Windows the printf routines need to replaced. The getpid(2) call is called _getpid() under Windows. Also system(3) should probably be replaced by some version of spawn (with a flag to wait for the child-process). Instead of going on to abort you should call _exit (normal exit will try to clean up the program and is probably not safe to be called from within a signal handler). The output of the debugger should also be redirected into the GUI and/or into a file. Of course this should be done by the wrapper process, not by the crashed program itself.

Calling GDB

There are hundreds of debuggers on the market, but most cost a lot of money and are not exactly cheap to deliver along with every copy of a software product. Fortunately GDB (GNU DeBugger) is available for most platforms.

GDB is highly customizable both with options and with scripts. We will use both here.

First GDB needs to be called in a way to attach it to the running (crashed) program and to make it execute some commands on its own:

gdb -n -batch -x debug_script -p 1234

The -n switch will prevent it from executing any commands in a .gdbinit file that the user might have and which might alter some feature we rely on. The -batch switch tells it to not require any input from the user. The -x switch gives it a script file to execute and the -p switch gives it the ID of the process to debug (1234 needs to be replaced by the ID we got from getpid above).

The debug_script file should contain all commands that need to be executed in the debugger. For most programs it will be enough if it just contains the single command backtrace (or bt for short) to generate a backtrace of the call-stack of this process. For multi-threaded programs someting more complex is required:

set width 0
set height 0
echo \nCurrent Thread Backtrace:\n
thread
backtrace
echo \n-----\n\nThread Info:\n
info threads
echo \n-----\n\nAll Threads:\n
thread apply all backtrace

The two set lines tell GDB that there are no limits on the "terminal" it works on. This prevents it from waiting for the user to hit a key to scroll through many lines. The echo command just outputs the text given to it. The thread command all by itself just outputs some short information about the current thread. The info threads command lists all threads of the process and some short information about them. The thread apply command executes a command in the context of multiple threads. In our case we generate a backtrace of all threads.

All these commands may take some time (up to a minute) to execute, so a potential wrapper should display some warning to the user that tells him what is going on.

Note: I haven't tested this on Windows yet. I also do not know how GDB reacts to binaries built with a non-GNU compiler.

Symbols

For all of the above to work the binary and all important libraries need to contain debugging symbols - otherwise the debugger will not be able to do much more than the operating system already does: it will show the addresses of the stack frames, but it will not be able to display the function names.

If you do not want to carry symbols in the executable file in order to make it smaller and faster to load they can be exported to a different file (let's assume the executable is called foo):

objcopy --only-keep-debug foo foo.dbg
objcopy --strip-debug foo
objcopy --add-gnu-debuglink=foo.dbg foo

The first line will copy the debug information into foo.dbg. All other sections (eg. code, data) of the file will be dropped from foo.dbg. The second line removes debug information from foo itself. The last line adds a section to foo that tells the GDB where to find the debug symbols if they are needed. Both files need to be distributed along each other for the debugging above to work.

What if foo contains some symbols that you do not want your customers to know about - eg. the location of some proprietary code or a license check? Before you execute the commands above drop these few specific symbols from the executable:

objcopy -w --strip-symbol='*SecretCode*' foo

This will remove all symbols that contain "SecretCode" (assuming that this is a class name it will remove the debug symbols for all methods of this class). This command can be called with several patterns or several times to remove more symbols. Be careful to not remove anything important! Any symbols that are removed from the executable will later show up as "??" in the debug output if a crash involves them.


Webmaster: webmaster AT silmor DOT de