Debugging involves trying to isolate a “bug” in a computer program, and then removing that bug. (The term “bug” is used to describe a mistake, or perhaps some other sort of problem.)
There are a variety of techniques that may be used, including using various different pieces of software that are designed to help locate a bug.
Quite commonly, the term “reverse engineering” refers to the process of converting code from “machine language” into “assembly language”, which is less of a challenge to read (although many people may find that even “assembly language” looks quite challenging), and then trying to (automatically or convert) that “assembly language” code into another language, including a functional/working language like C, or even another format like psuedo-code or flowcharts.
Some of the software tools that are commonly used for one of these processes (such as debugging) their software tools that can also be quite useful for performing the other process (such as “reverse engineering”). The main differences between “debugging” and “reverse engineering” may be related to approach and goals. However, the actual behaviors may be quite similar. As this guide is more about the techniques being used, the guide does not try much to heavily distinguish between the different goals. As a result, this guide may be useful for both types of tasks (both debugging, and “reverse engineering”), and does not make much of any distinction between these goals.
Note that this guide is mainly about trying to investigate behavior by looking at the actual software code/instructions. Other techniques for troubleshooting, including steps that may be taken by end users, may be covered by troubleshooting section (and its many sub-sections. One of those sub-sections is the area discussing error handling, which has a subsection about handling crashes. That crash-handling section discusses using some of the debugging software. Other areas of the troubleshooting section that may be of related interest include problem messages, as well as viewing logs.
The “debugger”, also called “debugging software”, generally refers to a type of software program. This guide refers to the first program as the “debugging software”. The purpose of the debugging software is to interact with another program (which is currently being debugged). This guide uses the phrase “(main, getting-debugged) program” to refer to the program that is actively being debugged. (Presumably, that is the “program” of key/main importance. The phrase “debugged” is not meant to suggest that the program is fully debugged, but that it is actively getting debugged.)
- Handling Debugging
Every program should be debuggable. This does not necessarily mean that every person can debug each program. For example, a typical end user may not have permissions to see the binary code of a program that runs on a web server. Multiple attempts to interact with a program like that could result in the program locking out further attempts; this could commonly happen when program, specifically a program called an “instrusion prevention system” (“IPS”), is monitoring network traffic.
Additionally, debugging may be much easier when source code is available. For example, OpenBSD FAQ 2: section on reporting bugs shows how a debugger will show which instruction in a function caused a system to stop running normally. (In the example, the debugger output stated, “
”, and then lists a function name and a hexadecimal offset. The hexadecimal offset, as well as objdump info, helps to identify the exact instruction where a problem was detected. That instruction's “assembly code” can also match up with a line of source code that uses the programming language named C. How to debug kernel crashes shows an example a bit more colorfully.) Identifying the line of source code that causes an error can be quite helpful for a programmer.
If a person who uses the program does not have access to the source code (which would generally only be an issue encountered when the software is not created using “open source”), the person may still be able to gather some available information that will help a programmer who does have access to the source code. Using a debugger is often a way to help gather some information which can be quite easy and useful.
Fixing the problem will generally be a task that should be done by a person who does have access to the source code. (The alternative involves making some very low-level modifications, which is typically far more challenging.) Therefore, the generally recommended approach is to allow the troubleshooting efforts to be led by a programmer who does have access to the source code. Until that person responds to a support request, a person who uses the program and encounters a problem can do some things to help. In theory, one thing to do is to leave the computer in its current (perhaps “broken”) state. In practice, leaving a computer in a very non-functional state may often be viewed as infeasible. What can often be done, though, is to save any sort of dump file that gets created. For instance, the section on types of dump files may be worthwhile to check if a Microsoft Windows machine encounters a system panic (which may typically involve rebooting, or showing a “bugcheck”/BSOD screen). Some additional related information may be shown in the Crash reporting section.
Also, save a copy of any log files that the program uses. If a person does not know what log files are used, check some standard locations. On a server that is not very busy, main log files might not be updated frequently except when problems occur. So, if a problem occurs, check when the log files were most recently updated.
- More overview on debugging techniques
A programmer can effectively use the source code to enable simpler debugging. One approach is to support logging to a file. Another approach, which novice programmers tend to learn quite early, is to insert statements that may output information to a screen. This style of debugging can be quite effective, but can also easily become annoying for an end user (and even the programmer). Therefore, once such “debugging” statements/instructions/commands are not absolutely necessary, many programmers simply decide to entirely remove all such the statements from the source code.
Another approach, which may often be more useful, is to leave useful debugging statements in the source code, but to have them be disabled when a variable has a specific value. This can allow a programmer to easily re-enable the useful statements, and might also give a clue that a specific variable's value is particularly important. An example of how this may be implemented is discussed in the section about the
However, there are also other techniques, including using software such as “disassemblers” (software that performs “disassembly”) or “debuggers” (which typically allow for memory inspection, breakpoints, and manually stepping through code). This guide is generally about using those types of approaches.
Norm Matloff's “Student's Guide to the Secret Art of Debugging”, page 5 and Norm Matloff's guide to debugging: section about not using printf tells students to not output values to the screen as a primary method of debugging. Actually, that advice seems questionable: if information is really important, then displaying the information on the screen may be a good idea. In fact, displaying the information, even in a way where end users can see it, may be a good long term solution. Different people may find that different solutions work best. However, Norm Matloff was not wrong in trying hard to encourage people to become familiar with the debugging software. Using debugging software well can often lead to programmers understanding nuances more quickly than altering the flow of a program.
- What debuggers do
Wouldn't it be nice if the debugging software just debugged the software? Meaning, wouldn't it be nice if the debugging software actually removed the bugs?
The most common features of debugging software are inspection, breakpoints, and single stepping. Another common feature is a “watch”, which may be implemented as a conditional breakpoint. All of these features are related to the idea of being able to run a piece of a program, and then pausing the program. While the program is paused, the debugging software can report the status of memory (such as the value that is being remembered by a specific variable name). Some software might even allow making some changes, which could be useful for quick tests. For example, a program's memory could be altered so that a variable has a more desirable value. As another example, it may even be possible to insert a function call, on the fly. The exact capabilities may vary between different debugging software products.
A lot of these steps might also be easily performed by altering the program's source code (for programmers who do have access to the source code), but debugging software can make the process easier. For example, a program could be written to run a loop three times, and then start outputing information. However, that can often require introducing cluttered-looking code into the “source” file, and then re-compiling. For some software, re-compiling might take some time, and then re-running a loop three times might also take some more time. When using a debugger, the source code will not require any changes. The debugging software can cause the loop to be run three times, and then the main software (which is being debugged) may be paused. The person who uses the debugging software can then perform one or more tasks, such as inspecting the contents stored in a specific (named) section of memory.
Many people like debugging software, once they become familiar with the debugging software. There may be a learning curve, but programmers may often find that the debugging software saves more time.
- Have test data
If the software is going to be working with data, it is generally best to have some data that can be trashed, corrupted, or manipulated in any other conceivable way. If there is not a simple test-case scenario, then using a copy of actual data may be worthwhile. However, make sure that the data being used (even if it is a copy of production data) can be changed in all sorts of ways without causing problems.
There may be some rare instances where that is not feasible, such as if there is reliance on an external server that only works with “production” data. If that server is run by a different organization, the server's behavior might not be alterable by the programmer. In such cases, be careful.
- Overview: describing debugging symbols
Binary executable files can generally be created with one of two methods: with debugging “symbols”, or without these “symbols”. This term, “symbols”, (typically?) refers to names: the names of functions and/or variables. Having these symbols available can help with debugging (and “reverse engineering”). As indicated before, some companies may prefer that their software does not successfully reverse engineered. (They are concerned that successful reverse engineering may help their competition.) Also, at least in some cases, a computer may run some code more slowly when the binary executable files that contain symbols. So, symbols are not always included in the files.
In the case of software by Microsoft, there have been cases where Microsoft has released downloadable executable code that contains symbols. For example, Microsoft Windows Hardware Developer Central: Microsoft Debugging Tools for Windows Mellenium Edition contains “a debug version of io.sys” ... “and a collection of symbol files.” Packages of symbols, entirely meant to help people using debugging software, have also been released for some newer Microsoft software as well. Some detials may be found in crash handling in the section called “Using Microsoft's Debuggers for Windows”, and then the sub-section called “Getting/Handling Symbols”.
- Using debugging symbols
Software vendors can have symbols be created in a way that may be useful for debugging. The symbols may then be distributed by the software vendor. For example: crash handling in the section called “Using Microsoft's Debuggers for Windows”, and then the sub-section called “Getting/Handling Symbols”. That process will probably help most with symbols for code which is an actual part of the Microsoft Windows operating system.
For software which is “open source”, the software vendors might not distribute symbols, but may simply expect that programmers can create the symbols by using available source code. For example, crash handling: OpenBSD crash refers to using “objdump info”, including the need for “Compiling with debug info”.
Programmers use software programs to create workable machine language. For example, programmers who use the C programming language might use GCC, LLVM, PCC, or Microsoft Visual Studio. (Other options have also existed, such as Borland Turbo C++, and may even still be available.) The software that implements these programming languages may have an option to include symbols.
- Using debugging symbols in GCC
-gcommand line parameter. For example, instead of:
Use this instead:
That will cause GCC to generate the debugging symbols.
- Executable handling
- Using the debugger
- This information was lengthy enough to go into a separate section: Using a debugger
- [#exesegm]: Memory segments
The details about the memory segments may be useful for computer programmers who are familiar with memory segmentation (see: Wikipedia's page for (computer) “Memory segmentation”). Executable code stored in a file, on a disk, is typically organized in a way that will match how the code will be stored later in memory. So, understanding a bit about how the code is stored in memory can help to understand some information that is stored on the disk.
Some programming langauges refer to segments starting with a period (so: “.text” rather than just “text”).
Here is some information about some different memory segments.
- code segment (“.text”)
The “code” segment is also known as, and is even more commonly known as, the “text” segment. Instruction code is often placed in a section called “text”.
This may seem strange, since a basic definion of a “text file” is a file that contains only common white space characters (a simple space, and a tab) or other characters that are easy to identify when seen, and which are typically simple to type on a keybaord found in the USA. However, native machine code instructions involve using other characters.) Stack Overflow question about the text segment's name included some speculation, but no clear-cut answer. It appears that the phrase “text segment” is quite old, possibly pre-dating Unix. This would obviously mean that the term pre-dates any Unix standard, like a definition of a “text file” in Unix. It may be that the program code describes actions similar to the text of a story book, while data described details similar to illustrations in a book.
A piece of the operating system will read the code from the executable file, and place that code into a section of memory. Also, another section of the operating system will re-claim the memory when the program exits. Modern operating systems may apply a rule which prohibits any other code from being able to write to memory that is used to store a “code segment”. (Some CPUs may also have features that may help operating systems to enforce such a rule very quickly.) This breaks an ability to use intentionally self-modifying code, but is believed to help with computer security and bug detection.
- Initialized Data Segment
This stores some variables. The variables may be provided with initial values that are located in the executable file.
Some variables may be placed in the “Block Started by Symbol” memory segment (more well known by its abbreviation, “bss”). The BSS may typically be stored on the disk by simply recording the size of the BSS. When the operating system starts up a program, the operating system will find the required BSS size, and then will reserve that much space in memory for the BSS, and also clear out all of the bits to zero. So, information stored in the BSS may end up taking up more memory than disk space.
This memory is used by the functions that handle dynamically allocated memory.
As an example of “dynamically allocated memory” from the programming language called C: a program may have a variable that is a “pointer”. That variable can point to the start of a section of memory. The program may then check to see how much memory is needed. If the program determines that 512 bytes are required, then the program may use a function that is called “
malloc” to allocate 512 bytes of memory. The pointer can then point to the start of that section of memory. On the other hand, the program might determine that 2 kilobytes of memory are needed. The “
malloc” command allows memory to be allocated/reserved after the program performs some functionality, and so this process can be helpful when the required amount of memory may vary.
So, the heap is a memory segment. The “
malloc” function in C will provide memory to a part of a program, by using up some of the memory in the heap.
(used for addresses related to function calls, and storing copies of variables that are parameters, and variables that are local to a function)
The location of the heap is often stored next to some free memory, and then the stack is often stored on the other side of that free memory. For instance, Wikipedia's page on “Data segment”: section called “Heap” indicates that the heap is near the BSS. Jason W. Bacon's Computer Science 315 Lecture Notes: 10.4: Memory segments shows the stack near data, and then free space, and then the heap. Although the method described by Wikipedia is probably more likely, the concept which is usually more important is simply understanding that both the heap, and the stack, grow by using up some memory from a single amount of free space. So when the heap grows, there is less free space available for the stack. This is often referred to as growing “towards each other”. When either the heap or the stack grow, the edge of the growing piece of memory gets closer to the edge of the other piece of memory.
The details of what a stack looks like will vary based on what calling convention is used. As an example, x86 code involving the C programming language may often use a calling convention named cdecl (which refers to how parameters are declared with the C programming language). Some details about this calling convention, or alternatives, are shown by Wikipedia's article on x86 Calling Conventions. This is not generally an issue when an entire program is made from the same programming language. However, when bits and pieces of a program use different calling conventions, functions that which to call other functions need to make sure that the right calling convention is used.
Actually, these segment descriptions may be fairly common for multiple computers, but are probably environment-dependent. So, code running on a different operating system, and on a different processor, may have memory segmented differently. As a simple example, around the time that x64 processors began to be released, CPUs started to provide features to help enforce rules that specified that only certain software could write to specific segments of memory. Older processors did not support such features.
It seems that an executable file containing machine language instructions will typically be laid out in a format that has the data on the disk match the way that data will be organized in memory. So, if a variable is going to be stored in the “.text” section of memory, then the executable file may also be storing that variable in a section called “.text”. One exception may be the section that is called the BSS.
Since the code on the disk is stored according to which memory segment the code will be copied to, and since rules get applied to memory segments, we can therefore conclude that the rules also correspond to the code on the disk.
Displays the size of segments.
Writes names found from a file. So, this program will look in the specified filename, and output information about the “names” found. These “names” are also called “symbols”, and tend to be the names given to functions and global variables (and perhaps other variables?)
This will often give symbolic names.
Each symbol will have some location data, and then names. The most interesting column to start with may be the third column, which shows the names of “things” like functions or global variables. Functions that start with __ (two underscores) might be ???. Functions that start with _Z may be C++ functions/methods which gave unpleasant-looking names. Using the
--demangle) parameter may help make the names a bit nicer. (Furthermore, there is an optional parameter to the
-Cparameter, so that the full parameter looks like “
The first column shows a location within a program. This location is an address. Command line parameters may alter how the first column is rendered, but the default is a hexadecimal number. For instance, the first column may say that a specific function is at the exact location of 0x3B. (The output of
may show this in lowercase, and without a leading 0x specifier, and with enough elading zeros to make the address take up a specific amount of width. So, for example, the output may look like “
The second column provides some more summary details about the location. Specifically, this column helps to idnetify the “segment” of memory that will be used to store any data related to this symbol. For example, if the
mainfunction's location is at 0x3B, and 0x3B is in the “.text” section of the executable file, then
may show the letter “T”.
(For more details about these locations, see the section on executable segments.)
The output in the second column may be useful for more than just tasks related to memory layout handling. Details about the location of code may help to identify what file contains code, which may be useful when dealing with a program that stores some code in external libraries. (For example, some external libraries may come from sources that may be trusted more than code contained in a different location.) An uppercase letter indicates that the symbol is located “global”, which commonly means that the symbol is located in some sort of external library. As a generalization, with a few more exceptions, a lowercase letter indicates that the symbol is located in other libraries. The common exceptions (“
”, and “
”) are documented by the program's man page.
to show the headers/names of sections. Note that these names might not be guaranteed accurate, as the program will sometimes need to show the “usual addresses”. However, in many cases, this will show the names of the sections.
Like many other commands that will be discussed, it may be useful to append “
$PAGERvariable is set. If it is not, set that variable to point to the name of a paging executable (such as
or a “text editor” (like
. Details may be found at dirsver/1/mainsite/techns/hndldata/hndldata.htm#osenvvar">OS environment variables. For example, sh-compatible shells may run
To view contents with segment names:
To see literal strings, use: “
”. Another command, designed to output any strings that have sufficiently long sequences of characters that appear to be rather readable, is the “strings” command. “
objdump-s -j .rodata
The actual machine code instructions are likely to be stored in a section called “
”. See if that section exists. The entire section may be viewed by running:
objdump-s -j .text
- How to view disassembled code
Next, try to understand those instructions. To do so, use the disassembler that is built into
objdump-D -j .text
That will show the code starting at the .text section. That is probably what is wanted, but maybe not. Running:
will show a “
” hexadecimal address. That may frequently correspond to the start of the “.text” function, which may be recognized as the beginning of a function called _start (with one underscore) or, perhaps less likely, __start (with two underscores at the beginning of the name). The hexadecimal address (without the “
start address 0x
” prefix) may also be used as a search string with
, as follows:
In theory, if the address is short (e.g. 7 digits), padding the address with leading zeros (e.g. to 8 digits for a 32-bit executable) may be good to narrow down
's results. A caret could even be done to make sure it is at the beginning of a line.
Once the starting address is clear, do run
-dor, for more information,
-D. For even more information, add more parameters, such as:
objdump-D -F -r -R -S -x -w
The output of
may include some function names. Some compilers may mangle the function names, and so
will then see the mangled function names.
can try to de-mangle the function names. In particular, if C++ is used, then using
may be helpful. (That is, include all the other parameters like
-Dand the filename; just add a “
-C” to the beginning of the parameters.) For some compilers, instead of using
-C, the function names may be de-mangled by using “
--demangle” A program called
may also perform a similar function (and be useful when dealing with other programs, like
, based on commentary from StackOverflow question on C++ mangling. See also: Wikipedia on name mangling (section on C++), man page for
which is recommanded by man page for
(section on demangling)).
- Understanding disassembled code
The instructions will likely involve setting some values, using instructions like MOV or PUSH. Note that these instructions are likley to be in the AT&T syntax, so the source comes before the destination. So, “MOV %esp,%ecx” is copying the contents of %esp into the contents of %ecx. This is the opposite order of Intel's syntax for assembly language, which specifies the destination first.
Eventually, most programs will have a code jumping instruction, such as “jmp”, some other instruction that starts with j (like “je”), or perhaps most likely, “call”. That is basically a jump to the start of a function.
If the instruction is a “jmp” instruction, and the syntax being used is AT&T syntax, and the address is prefaced with an asterisk, then the instruction pointer is not going to start pointing to the hexadecimal number after the asterisk. Instead, the program is going to look at the specified location in memory, and read an address that will be jumped to. Note that when reading the address, the byte order may be swapped (due to endianness). So, if the specified address is *0x804d1f4 (which is the same as *0x0804d1f4), and the bytes at that location are 0xf6890408, reading each byte in backwards order will yield 0x08048946. So that is the byte number of the next instruction. (The easiest way to guess whether the bytes will be swapped is to look to see if the destination is a number that looks similar to the code where the JMP instruction is found. In this case, they both start with 0x084#####, so the byte swap seemed appropriate.) Of course, the computer doesn't guess; it just proceeds with whichever endianness rules the computer is actively using.