Crash Handling

This web page is about crash handling. The parent web page, Errors, has further information about some other types of errors which may be related, such as system reboots and dealing with a kernel panic (such as a BSOD in Microsoft Windows, or a “kernel panic” in Unix). Some commentary about the tools used for crash handling might also be provided by that section about handling a system panic.

[#crashrep]: Crash Reporting
[#wnerrrep]: Windows Error Reporting (“WER”) (previously “Online Crash Analysis” (“OCA”))

(The information on this page may be duplicated with information on the Windows Error Reporting (“WER”) (previously “Online Crash Analysis” (“OCA”)) page. Also, crash reporting may be related/relevant, and should be referred to from this page.)

Some of this information may have been written before it was found that details may still be available. (For instance, are some submitted details available online?) This text might be spreading a bit of caution which, in hindsight, appears may be undue... This may be discussed a bit further by: Save the data! However, such caution might be sprinkled into some more of the text on this page.

Overview

(This information is at: Windows Error Reporting (“WER”) (previously “Online Crash Analysis” (“OCA”)) page (“Overview” section).

What (not) to do when seeing the “Windows Error Reporting” screen

(This information is at: Windows Error Reporting (“WER”) (previously “Online Crash Analysis” (“OCA”)) page (section titled: What (not) to do when seeing the “Windows Error Reporting” screen).

First, be aware of what to do and what not to do with the program. There may be an option to view more details. It is recommended that this option be taken.

Overview: Data that gets saved
See: Windows Error Reporting (“WER”) (previously “Online Crash Analysis” (“OCA”)) page
Others
Mozilla Crash Reporter, Wikipedia's article on “Crash Reporter” (and, naturally, debugging info). (For further details, see the Crash Reporting page.)
Debugging/analyzing crash info
Approach Overview

This guide is mainly focused on steps that can be performed simply, and relatively quickly, and which stand a fairly good chance of getting some solid information that will help to be able to identify or fix a problem.

Low level code

First, let's look quickly at what this section of text does not try to cover: analyzing the low-level assembly code instructions used by a program. That concept may be discussed by some other text. Specifically, the section titled “using a debugger” may discuss trying to interact with assembly level instructions. Surely some familiarity with Assembly Language can also be helpful. (Learning how to understand the low level assembly instructions may be covered in some other documented areas. An Assembly Language guide has not yet been posted at the time of this writing, but may be getting added soon, to the coding section, or a sub-section of the coding section.) The section about debugging code may have other information (other than just the section about using debugger software) that may also be useful/related.

Full understanding of a computer's behavior can be achieved by looking at the instructions that the computer will follow, and understanding those instructions. This particular section of text does not provide many details about that process because, unfortuantely, that process does have some problems. Understanding those instructions tyipcally requires some training. Even a loop can take quite a bit of time for an untrained eye to understand. Many programming languages support easy implementations of making function calls, but even a simple function call may be fairly complex in assembly langauge as several steps are taken to perform tasks like identifying memory that will be used for variables. Even for people who do have enough knowledge to be able to understand the meaning of the individual assembly level instructions, the process may take a very long time for people with very limited experience understanding the instructions inside of an executable.

That overview was largely written based on x86 assembly language; many other assembly languages are likely similar. There continue to be ongoing efforts to make computers easier to use for not only end users, but also computer programmers. At the time of this writing, most or all popular computing platforms typically use assembly-level instructions that are individually simple, and so accomplishing useful tasks will require analyzing several instructions.

The contents in this section of documentation focus more on what information can be obtained without studying the Assembly-level instructions. Advantages to these methods are a smaller learning curve, and frequently having an ability to get information fairly quickly, and allowing the system to try to resume functionality again fairly soon.

Mis-identification can be easy

Throughout this section (no matter whether the debugging technique involves using WinDbg on Microsoft Windows, or ddb in a BSD operating system), do remember this detail:

[#misidclp]: Avoiding mis-identification

One step that is often used, when debugging, is to figure out what code is active. However, just because code was active at the time when a problem occurred does not necessarily mean that code was at fault for creating an error.

How to solve Windows system crashes in minutes (Page 4 of 4) explains two possible scenarios (which, despite the title of the page, are scenarios that are not limited to Microsoft Windows). One is in the section called, “The operating system is the culprit”. It is noted that, statistically, when the core pieces of the operating system, such as a kernel or important piece of a subsystem used by all users of the popular operating system, may be identified “as the culprit, and they often are, don't be too quick to accept it. It is far more likely that some” other software “called upon” the operating system “to perform an operation and passed a bad instruction, such as telling it to write to non-existent memory. So, while the operating system certainly can err, exhaust all other possibilities before you call Microsoft! The same goes for debugging Unix, Linux, and NetWare.” (Naturally, the likelihood of an operating system error may be higher if using any sort of “beta” operating system code which became available before the date of the official release when the vendor starts recommending that average end users use the code.)

The web page that was just quoted also gives one other key example of software that is very often active (but not the true source of the problem). That type of software is anti-malware software. Such software is often active as it performs its task, both when things are working fine and also when problems occur. Now, granted, real-time anti-virus software is known to be a source of incompatibilities and other problems, but chances are quite high that there may be some other cause. If troubleshooting suggests that anti-malware software is experiencing a problem, don't discount that possibility: there may be a problem that is caused by the anti-malware software (combined with any data files used by the anti-malware software) that is active. However, before just blaming the anti-malware software vendor, do take a moment to consider whether the problem originated by anti-maleware software, or maybe from another source.

Another possibility may be a driver file. The driver may have failed due to a hardware problem. An example of a hardware problem showing up as a driver error, which may be very common, is a graphics card that acts imperfectly due to becoming too hot. In the case of problematic hardware affecting a driver, replacing the driver may or may not provide any decently substantial benefit to system stability.

Getting dump information
Operating system dump files

The operating system may generate a dump file of the running operating system.

[#enabldmp]: Enabling and configuring automatic dumping

This section is about how to automatically respond to situations by causing a dump file to be made. See intentionally creating a crash report for details about (the least disruptive methods of) manually, intentionally causing dump file to be created.

Microsoft Windows

Microsoft KB Q129845 showed that these memory dump files that could be useful as “a Microsoft support professional may be able to debug the dump file.” However, self-service may be less expensive than paid Microsoft support. Such self-service might also benefit from the presense of a dump file, so enabling these is recommended.

Microsoft KB Q254649: Overview of memory dump file options for Windows Vista, Windows Server 2008 R2, Windows Server 2008, Windows Server 2003, Windows XP, and Windows 2000 has info on this.

For more details about having dump files be saved, see: Microsoft KB Q130536: Windows does not save memory dump file after a crash, Microsoft KB Q886429: What to consider when you configure a new location for memory dump files in Windows Server 2003.

[#oslogdmp]: Logging to the system log when the operating system makes dump information
BSD
OpenBSD manual page for the topic “crash” indictes that a sysctl called “ddb.log” may result in information being dumped into the system log that is viewable with dmesg. (A copy of this system buffer may be stored when the system is first started, placed in the /var/run/dmesg.boot file.)
Microsoft Windows
Go to the System Properties (by running control sysdm.cpl or holding the Windows key on the keyboard and pressing the [Pause/Break] key, or going to the System icon in the Control Panel and, if in Vista or newer, choosing Advanced system settings; any of these methods may require UAC permission if using Vista or newer), and choose the Advanced tab. In the “Startup and Recovery” section, choose the “Settings...” option.
[#dumptype]: Details about the dump types
Dump types in Microsoft Windows

Microsoft Windows can create various types of dump files. The types that store less information can take up far less disk space, be trasfered more quickly, and possibly be opened and analyzed more quickly. They may be more difficult to work with because getting useful information may require some additional files to be very useful. Setting up the debuggers to use these additional files may take additional time. For someone who has little experience debugging, the easiest type of file to debug with may be the largest files.

Professional programmers who know how to effectively use the small files may prefer to use the small files, at least in part because those smaller files may be easier for end users to deal with (since the smaller files may: be faster to create, require less storage space store, easier to transmit without exceeding file size limitations, and faster to transfer). However, this guide may still be rather introductory in nature, and so does not explain the details of the complexities of how to effective deal with the smaller files.

Microsoft KB Q254649: Overview of memory dump file options for Windows Vista, Windows Server 2008 R2, Windows Server 2008, Windows Server 2003, Windows XP, and Windows 2000 discusses: Complete memory dump, Kernel memory dump, Small memory dump (64KB).

Complete memory dump

Q254649 says “If you select the Complete memory dump option, you must have a paging file on the boot volume that is sufficient to hold all the physical RAM plus 1 megabyte (MB).” Also note that it says “The Complete memory dump option is not available on computers that are running a 32-bit operating system and that have 2 gigabytes (GB) or more of RAM.”

It seems that some operating systems are disabling this as an option, by default. Enabling this may require modifying the registry. For instance, TrendNet's info on generating a full memory dump on Windows 7 and Windows Server 2008 R2 (as cached by Google, in text-only mode) says to set a specific registry entry to 0x1. These commands would do that:

REG QUERY HKLM\SYSTEM\CurrentControlSet\Control\CrashControl /v CrashDumpEnabled
echo e.g., pre-existing value of 0x7
REG ADD HKLM\SYSTEM\CurrentControlSet\Control\CrashControl /v CrashDumpEnabled /t REG_DWORD /d 1
Kernel memory dump

This may be substantially smaller than a complete memory dump, which may be nicer in part by allowing debugging software to process the data more quickly.

Small memory dump

Although Q254649 refers to “Small memory dump (64KB)”, it also says “This option requires a paging file of at least 2 MB on the boot volume”. A potentially nice part about this is “A history of these files is stored in a folder.”

Microsoft KB Q315263 says “Because a small memory dump file contains limited information, the actual binary files must be loaded together with the symbols for the dump file to be correctly read.” (This is done with the “-i” parameter of WinDbg or kd.)

So there's the overview.

Having watched Channel 9 MSDN: Defrag Tools: WinDbg/Bugchecks (starting about 9min58sec), the guest said “For the vast majority of blue screens, if you're just wanting to look at them yourself int he debugger, a kernel dump is probably good enough.” He went on to noted that kernel memory dumps are sufficient for dealing with most types of situations, but these dump files really only provide limited access to what applications/programs are doing, so complete memory dumps are more useful in some cases.

Dump data in Unix

If the operating system crashes, but it still looks safe enough to create a dump file, the data may be stored to swap space. See also: savecore.

Dump files for other programs

ProcDump (by Sysinternals) may be able to get a dump file of a process.

Interacting with debugging software
[#wndbgsfw]: Debugging software for Microsoft Windows
Using NirSoft's software

This may not provide as much information as extentively using Microsoft's Debuggers for Windows, but this software may be fairly fast to use.

Blue Screen View reads information from a MiniDump file. This may not provide more raw data than what the Blue Screen showed, but it is a way to conveninetly get the information. Also, it may present some information in a way that is more easily understandable.

LiveKd (by Sysinternals)

LiveKd, released by Sysinternals (which was a seperate organization but is now part of Microsoft), may allow debugging commands to work with a live environment rather than needing to first create a dump file that gets analyzed.

This sounds interesting, but details on using this software are not currently provided by this guide.

Using Microsoft's Debuggers for Windows
Introduction/Details about what may be expected

Note: This guide may not have a lot of details about effectively using this software. This guide may not provide all of the steps to commonly reach a full resolution. However, this does cover some of the basic steps for getting the critical software installed, and performing some basic commands (even if details about how to use the results of those commands are details that are not yet fully described by the current version of this text). This guide does show how to get some useful information, but not as many details about gathering yet more information or how to analyze the information.

Many times, successful decisions may be made by IT personnel without needing to rely on debugging. For instance, debugging tools may help to identify what line of source code led to the assembly instructions that caused a detected error to occur. That could be awesome information for a computer software developer, but may not really be useful to someone without access to the source code. A software bug might even have already been fixed by the software developer. If a network administrator upgrades some software to the later version, the issue might be resolved even if the network administrator hadn't quite figured out the technical reason why the suspect software had been failing.

However, if there is a desire to try these tools, this guide may help complete some of the early steps.

Overview of the software

CodeProject.com's “Windows Debuggers: Part 1: A WinDbg Tutorial describes WinDbg as well as other debugging software: KD, CDB, NTSD, and Visual Studio (including Visual Studio .NET).

NTSD
CodeProject.com Debug Tutorial Part 1: Beginning Debugging Using CDB and NTSD discusses CDB and NTSD (and WinDbg). It notes, “Windows 2000 and higher systems generally have NTSD already installed on the system! This is a big bonus as you do not need to install any extra software for quick debugging.”
CDB
CodeProject.com: how to Use and Understand the Windows Console Debugger discusses using this software (including using the software on some private code).
WinDbg

This program provides a GUI. (Yet, commands can get typed. So, there is a bit of a “command line”-type of feel to the program.)

This comes as part of a software package that Microsoft calls “Debugging Tools for Windows”.

To get this:

The easy way

Get from http://download.microsoft.com/download/A/6/A/A6AC035D-DA3F-4F0C-ADA4-37C8E5D34E3D/setup/WinSDKDebuggingTools/dbg_x86.msi

If that saved you time, then consider up-voting Wu Yongzheng's answer... as that is how the webmaster of this web page found the URL.

Version 10
You may start by using this hyperlink: “Get Debugging Tools for Windows (WinDbg) (from the SDK)”. (That hyperlink was found from this Developer page.) After starting to run the 1.1MB SDKSETUP.EXE, you will be presented with an option. This guide recommends choosing “Download” ... “for installation on a separate computer”. That will provide you with a convenient copy of the installer. Then, under the user's Downloads folder (%USERPROFILE%\Downloads\), look under Windows Kits\##\StandaloneSDK.

Version 7
497KB downloader: winsdk_web.exe found from
https://www.microsoft.com/download/confirmation.aspx?id=8279 (which might just redirect to https://www.microsoft.com/en-us/download/confirmation.aspx?id=8279). That hyperlink was titled “Get the standalone debugging tools for Windows Vista as part of Windows 7 SDK”, found from Windows Driver Kit page)
Getting/Handling Symbols

Symbols may be needed. There should be a place on the hard drive where symbols may be placed. In this example, a directory of C:\SymPath will be used for this purpose.

Some symbols may also be used from a remote location. Once this is set up, software may be able to just obtain symbols automatically from the Internet. This may be the most convenient method, so this method is the recommended approach to start with.

Some symbols may or may not be available. A lack of symbols may or may not be a problem. For instance, the home page for LiveKD (by Sysinternals) says, “The Microsoft debugger will complain that it can't find symbols for LIVEKDD.SYS. This is expected, since I have not made symbols for LIVEKDD.SYS available, and does not affect the behavior of the debugger.” In some cases, symbols files that may be useful might not be made available to the general public. (This might be due to competitive reasons, hoping to limit competition's ability to understand some aspects of the software, or perhaps is done to limit self-service. The expectation may be that end users will sumbit debug information to a software developer who may have access to the symbols.)

Packages of symbols have been released: Symbols download page.

Handling symbols
Setting the location of public symbols files using a menu option in WinDbg

Choose the File menu, and then “Symbol File Path”. (magicandre1981's answer to yilduz's SuperUser.com question on WinDbg symbols indicates that Ctrl-S can also be used.)

If you don't have any downloaded symbols, and you want to just download them as needed, you may use:

SRV*http://msdl.microsoft.com/download/symbols

If the local directory is going to be C:\SymPath then a value to use may be:

SRV*C:\SymPath*http://msdl.microsoft.com/download/symbols

Multiple directories for symbols can be used, as follows:

SRV*C:\SymPathOne*C:\SymPathTwo*http://msdl.microsoft.com/download/symbols

Then use the menu to close the workplace, and save the newly modified workplace. TechReplublic's guide to using WinDbg notes “This should lock in the Symbol path.”

MS KB311503 specifies an alternative, _NT_SYMBOL_PATH environment variable. e.g.: Set _NT_SYMBOL_PATH=symsrv*symsrv.dll*f:\localsymbols*http://msdl.microsoft.com/download/symbols

See also the note below about using a reload command.

Use a command in WinDbg
.sympath+ SRV*C:\SymPath*http://msdl.microsoft.com/download/symbols
Getting symbols

MS KB Q311503 notes, “http://msdl.microsoft.com/download/symbols is not browseable and is only intended for access by the debugger.” (This is despite looking like a URL that may be designed for HTTP access.)

As previously mentioned, symbols can be obtained as needed online. However, sometimes additional symbols are needed. Also, perhaps there may be a desire to have the symbols for offline use.

Directions for downloading the symbols in WinDbg are at MSDN page titled “Download and Install Debugging Tools for Windows”. (That web page is now redirected to from what may be an older URL, Microsoft DDK Debugging.)

MSDN page titled “Download and Install Debugging Tools for Windows” says “OS symbols are usually installed in the %SYSTEMDIR%Symbols directory.” (It looks like that there may be a typo there, and perhaps what is really meant is %SYSTEMDIR%\Symbols\. However, %SYSTEMDIR% seems to not be set: Perhaps %SystemDrive% or %SystemRoot% was intended?)

Note that symbols different operating systems, or even service packs of operating systems, may require different symbols. If the machine is debugging a crash dump created by another machine, the symbols to use will be the symbols related to the software running on the machine that created the crash dump. Downloading Symbol Files might provide the symbols needed.

Analyzing a crash dump

Make sure symbols are handled. Open a crash dump file. Perform some commands.

Some good commands to use in very many scenarios

Effectively using these commands is a great idea. Even if a person doesn't fully understand how to interact with each command, simply knowing enough to be able to run the commands, copy the output, and paste the output (into a text file, an E-Mail program, or some other destination) may help. Then the person who knows how to interpret the output may be able to make some determinations just by using the output, without needing to obtain the minidump file, and without needing to possibly install and set up debugging software (including grabbing symbols, etc.)

Some of the following may be good to use:

.reload /f /v

magicandre1981's answer to yilduz's SuperUser.com question on WinDbg symbols suggests running .reload /f /v to load symbols (and before using “!analyze -v”). Though, that instruction was placed after details about using a menu to specify where symbols were, so this may not be necessary in all situations.

!analyze -v

Wikipedia's article on WinDbg: section about “!analyze” states, “The most commonly-used extension is !analyze -v, which analyzes the current state of the program being debugged and the machine/process state at the moment of crash or hang. This extension is often able to debug the current problem in a completely automated fashion.”

Here is some details about how some of the output may be helpful:

Bugcheck details
See the bugcheck name (e.g. KERNEL_MODE_EXCEPTION_NOT_HANDLED) and the bugcheck code (e.g. 0x8e). (Examples came from How to solve Windows system crashes in minutes (part 3 of 4).)
Debugging Details: Default Bucket ID

The following summary comes from information at How to solve Windows system crashes in minutes (part 3 of 4). That web page suggests finding the text “Debugging Details:”. Then search for the text “DEFAULT_BUCKET_ID”. The Default Bucket ID “provides the general category of the failure.”

If the Default Bucket ID shows...

DRIVER_FAULT

Then the issue seems to be due to a driver's behavior. If hardware has problems, then often the associated drivers may have problems, so maybe this is a hardware problem. However, let's start by seeing if the driver is outdated and can be replaced.

The first step will be to try to find out which driver is causing the problem. To do that, locate following line(s) that may refer to an “IMAGE_NAME”.

CODE_CORRUPTION
e.g. Dims's SuperUser.com question showing CODE_CORRUPTION has many other occurrences of the word “corruption”, including “memory_corruption”.
0xBE_nt!MiRaisedIrqlFault

e.g., SuperUser page/answer showed an example. The IMAGE_NAME was “memory_corruption”

Basically, the “Bucket ID” is often the result of Microsoft categorizing problems after solutions were identified. By seeing that bucket ID, you may quickly receive a categorization that Microsoft has been able to determine based on prior analyzed memory dumps.

IMAGE_NAME

As described in the prior section about interpreting bucket IDs, this may have useful names like “memory_corruption” (indicating a RAM issue), or a filename which might be a file at fault.

“PROCESS_NAME”

This might be rather useful for finding a related filename (similar to “IMAGE_NAME”).

Syntax can include showing some memory addresses, followed by something like “PROGRAM_NAME|FUNCTION_NAME|more-details...”. For instance, the stack trace may use that.

!analyze -vv
This may be even more verbose. (It also does not work in WinDbg 6.12.0002.633 X86 so maybe this is only for some newer versions?)
.lastevent
A period followed by the word “lastevent” might be helpful, perhaps being similar to “!analyze -v”.
lmv

This is used by How to solve Windows system crashes in minutes (Page 4 of 4). The “lmv” command is basically two parts: “lm” specifies a command to show loaded modules, and “v instructs the debugger to output in verbose (detail) mode, showing all known details for the modules.”

This probably won't fully resolve, but it provides information. For details on how to use this information, see the just-referenced web page (“Page 4 of 4”).

lm tn
This seems to be recommended by TechReplublic's guide to using WinDbg. Other websites may reference “lm t n”, so try using that if “lm tn” doesn't seem to be accepted.
!chkimg -lo 50 -d !nt

This can detect mismatches between an executable on a disk, and the executable in RAM. As those bytes seem like they may be a relatively small number of bytes, if a problem is detected, such a problem is probably quite significant.

MSDN on !chkimg may provide more info, while magicandre1981's answer to Dims's SuperUser.com question: “How to find a reason of often system hangs and BSODs?” shows an example of this check (specifically, “!chkimg -lo 50 -d !nt”) succeessfully noting a problem.

Some other commands

Other commands are shown in the “Commands” section (which starts with “Basic Commands”) of CodeProject.com's “Windows Debuggers: Part 1: A WinDbg Tutorial. CodeProject.com: Auto Memory Dump on Crash of an Application cites a “Using Debugging tools for windows” help file and describes some other commands. The section on system panics: “Debugging bugchecks” section has some more commands, too.

Channel9 MSDN Defrag Tools: WinDbg Bugchecks mentions a number of other commands:

.hh

It seems this just opens the “Debugging Tools for Windows” help file. For instance, just running “.hh” may be equivilent to opening C:\Program Files (x86)\Debugging Tools for Windows (x86)\debugger.chm and scrolling to the bottom of the first page, and choosing “List of Tools and Documentation” (which may help to fully populate the left frame) and then choosing the first sub-topic in the left frame (“Legal Information”).

Using “.hh !analyze” will lead to available help for the “!analyze” command. (Actually, “!analyze” is called an “extension”, so the correct help topic will be “!analyze extension”. However, looking up “.hh !analyze” will end up correctly suggesting the “!analyze extension” topic.)

!analyze -show 0x#
may show help for a Bugcheck code (specified after the “0x” part)
.trap and !pte

Channel9 MSDN Defrag Tools: WinDbg Bugchecks (about 19min27sec) discusses how the “STACK_TEXT” leads with a line that says “Trap0E”. To get more info, he uses the .trap command shown by a “TRAP_FRAME” text. e.g., “.trap 0xffffffff927c4b3c”. He then interpreted an Assembly command to figure out that EAX had a memory address (of 9906f3f8). Then, he used “!pte” to translate that virtual address into a physical memory address. (Although, if the result shows as “not valid”, that may not be possible.)

!process
Results may be more thorough when using a complete memory dump. When using a complete memory dump, using “!process 0” shows all processes running at the time. The second parameter can specify how much information to show. Using “!process 0 0” may show a minimal amount of information, while “!process 0 1” may show more.

Using “!process 0 0” shows information on each process, starting with something like “PROCESS #”.

That memory address can be useful to get more details. You can then type “process # 17” for information on threads. Using “.hh !process” will lead to available help for the “!process” command. (Actually, “!process” is called an “extension”, so the correct help topic will be “!process extension”. However, looking up “.hh !process” will end up correctly suggesting the “!process extension” topic.)

The “Cid” reported is the process's ID, often called a PID. The reported Cid value is in hexadecimal. (You can type “.format #” to quickly convert to decimal.)

.process and .thread

gets into the context of a process/thread. “.process” causes certain details, like the stack history, will be process-specific. Similarly, “.thread” moves into a specified thread.

rephrased: By using .process, the host “moved into” (or “zoomed into”) a process.

e.g.: 32min17sec of Channel9 MSDN Defrag Tools: WinDbg Bugchecks

e.g., “.process /p /r #” /p is for going into the “process address space” (even for thread, as mentioned by 33m46sec which started with “.thread /p /r ”).

The “/r” may be needed to reload the user symbols. (The effect of /r is mentioned at 33min20sec, and then more clearly a short moment later at about 33m47s).

.thread with no parameters will leave the thread's context (basically exiting/reversing the prior actions of zooming in).

The number specified is the address that can be seen by the desired process's information as reported by “!process 0 0”.

k
Shows stack
~
choose processor. To the left of the prompt may be text like “0 kd>”. That is specifying CPU 0. Using “~1s” (to switch to CPU 1). Then, “.reload user” (reloads user space)

Some debug commands may often be used if trying to check for NPP memory leaks. (More info may be readily available Responding to the issue(s) of low pool memory may contain more information about working with NPP.)

Remaining steps are not thoroughly covered by this tutorial. (Sorry... however, at least this tutorial helped to take care of some of the early steps.) The following resource(s) might help to assist further:

Debugging software in Unix
[#ddbusage]: Interacting with ddb

(Some of this section may not have CSS fully applied at this time. Specifically, references to “ddb” might be something that should be treated as a command line program, and references to “trace” may need to be identified as input. There may be more.)

Intro/overview

First, here's a bunch of resources that may be available. (These may not have been fully viewed to determine how useful they are.) All of these may be skipped by someone who just wishes to follow this guide, but these references are provided anyone in case anyone is interested in learning more.

OpenBSD man page for ddb, FreeBSD Developers Handbook: Kernel Debug Online: On-Line Kernel Debugging Using DDB, FreeBSD manual page for ddb (which is different than FreeBSD manual page for ddb (configuration)), Live Debugging with DDB

Ubuntu 9.04 (Jaunty Jackalope) Man Page about ddb says, “The ddb kernel debugger has most of the features of the old kdb, but with a more rational syntax inspired by gdb”.

It is true that OpenBSD's ddb (starting with OpenBSD 1.2) has a “hangman” command. Despite what the OpenBSD man page for ddb might suggest, it is genuinely a game of hangman. To quit, at the --db_more-- prompt, press q. (Pressing Ctrl-C earlier might help to cause that prompt to appear between games.)

As an example, we will take a look at a crash reported by OpenBSD. Here is some example
text: bad_directory_entry: rec_len % 4 != 0
offset=0, inode=946924148, rec_len = 21182, name_len=88
panic: ext2fs_dir_entry
Stopped at      Debugger+0x5    leave
RUN AT LEAST `trace' AND `PS' AND INCLUDE OUTPUT WHEN RECORDING THIS PANIC!
DO NOT EVEN BOTHER REPORTING THIS WITHOUT INCLUDING THAT INFORMATION!
ddb>

Specifically, this guide will gather the types of information which will be most useful to a developer of the operating system. This may help to identify which line of source code (including specifying which file of source code) was related to the assembly commands that were executing when a problem occurred. This may be much more detailed than what will be useful if somebody with relevant software development experience won't be assisting. OpenBSD man page for securelevel notes that, “Because securelevel can be modified with the in-kernel debugger” called ddb, “a convenient means of locking it off (if present) is” effective if the securelevel is set to 1 or higher. This is generally going to be the case (on most kernels) except for when the system is restarting.

[#ddbgetez]: Information to gather during a system panic. How to approach a system panic
Auto-rebooting

Something to keep in mind later: The operating system likely became largely unresponsive when entering the debugger. If there is an interest to auto-reboot, perhaps this can be accomplished as easily as changing a sysctl, although such a change might not be as easy to reverse (compared to the ease of making the initial change) because rebooting may be required to effectively reverse a change to this setting. For details, perhaps see: handling system panics. (For example, for OpenBSD, see: how to have OpenBSD reboot automatically during a system panic.)

This may make a lot more sense in some cases (such as a firewall that crashes twice a week, when getting the firewall functioning again quickly may be an important priority), while not making nearly as much sense in other cases (if this happens during the system's bootup process, then automatically rebooting might not do much good).

Some official resources

Especially if this guide does not seem to be successfully describing a process that is usefully heading towards a resolution, feel free to check out the information in the official guides. Even if this guide does seem to be making headway, knowing about additional resources can be comforting. Some documentation about dealing with a problem of this sort of nature may be at:

Goals

In general, the first step is going to be to try to gather some information.

Then, as noted in the section on leaving ddb, there are multiple ways to leave ddb. However, in many cases, the best (and possibly only feasible) way might be to reboot (which should only be done after gathering information).

Brief overview of desired information

The most important info will often be the following items of information:

  • the kernel panic message,
  • the “Stopped at” line,
  • the first line of each trace where the first line matches the “Stopped at” line,
  • and the ps output.

(Note: this is referring to the output of the ps command that gets run within the “ddb” program. This is not the same thing as the ps command that may be run from a normal OpenBSD command prompt.)

Challenges

The most comdemning challenge may be that the operating system may become less functional when it is in the “ddb” prompt:

USB keyboard support may be disabled

If the system does not have a PS/2-style keyboard plugged in before the computer goes into “ddb”, there may be very little options. Plugging in a PS/2 keyboard, after the system is powered on, is documented to be a dangerous activity, as hardware damage may occur.

Networking may essentially become unavailable

The functionality of the system may be substantially reduced, which reduces the options for remote system management

Verbose output

Even though some of sections of information may be important for helping to debug, these sections might not be easy to and write down. The additional trace info, and the ps output can be very verbose. This may be quite unpleasant if needing to manually type (or write down) information.

The system may effectively be placed into a single-user mode

The functionality of the system may be substantially reduced, so do not expect to just switch to a new terminal to get a command prompt. (At the time of this writing, it is not remembered whether switching terminals is possible. However, if so, the only effect would be to check output on different terminals. Interacting with the termainals, such as logging in, is not likely to be possible.) The copy-and-paste functionality of an environment that provides multiple terminals (terminal multiplexer software, or a GUI) may be unavailable.

Because of these limitations (especially including USB keyboards becoming non-responsive), there may really be an inability to gather much information. Information shown on the initial screen will likely be available, so the “kernel panic” message and the initially-displayed “Stopped at” line may be useful. The output of the commands like trace and ps may be unavailable. (In such cases, developers may be less interested in helping to debug a problem, preferring that the problem be resolved by first figuring out a way to gather such information from a computer that does have a way to interact with the debugger.)

Ways to get information from the debugger
Accessing the system buffer post-reboot

This would likely be the nicest, allowing the system to be operational and being able to run commands. However, it might not be supported by all systems? Also, perhaps the system buffer might be overflowed? Therefore, this might not be the most reliable method, but just may be the most pleasant when it does work.

It is nice to know ahead of time if dmesg may store info post-reboot. (Perhaps this is settable, as noted by Logging to the system log when the operating system makes dump information.)

Information seen on nabble.com: “how I can save ddb trace information” discusses approaches: “First see if your machine preserves dmesg between boots. Not all machines do, but it's worth checking this first (if your machine is one of those where dmesg shows more than one set of boot messages after a reboot, then this applies).” (Hmm... this sounds most convenient. Is there a way to cause dmesg to store multiple boots?)

If this works, the first step is to gather information (by running all of the commands that show the needed information). Then reboot. Then (probably as soon as possible) try to copy the system buffer into a text file (using “ dmesg | tee -a ~/sysbuffr.txt ”).

Using a serial port

A popular approach is to use a serial (a.k.a. null-modem) cable: perhaps see OpenBSD FAQ 7: section on using a serial console. (Then capture the info over the serial cable. Perhaps minicom, or another approach of having the computer be acting like a dumb terminal.)

This might be the next-nicest approach. One advantage is that captured information may be stored before any need to reboot the system. Since the captured text may be viewed/verified before a reboot, so there's less possibility of missing an opportunity because of an incorrect belief that data will be visible later.

Disadvantages include the need for a remote computer. This approach may require some setup (of software and perhaps also hardware). This might not be as feasible on systems that do not have a serial port. (Keep in mind that if USB keyboards don't work, USB serial ports probably also don't work.) Due to the need to set things up ahead of time, this might not be a very available option for figuring out what happened when there is an unexpected crash.

Basically, there's two parts to using a serial cable:

using serial during boot process

This is platform dependant. For i386/amd64 (and others?), this may involve having the boot loader locate /etc/boot.conf and see that serial should be used.

Basically, to set up serial cable booting, use /etc/boot.conf which gets loaded by the boot loader (as described by information specific to OpenBSD: Second stage boot loader and OpenBSD/i386 manual page for boot/boot.conf (and OpenBSD/amd64 manual page for boot/boot.conf).)

OpenBSD FAQ 7: section on using a serial connection (FAQ 7.6) notes, “OpenBSD numbers the serial ports starting at tty00, DOS/Windows labels them starting at COM1. So, keep in mind tty02 is COM3, not COM2.” (Note that some BIOS setup programs may also use the terminology that was just called "DOS/Windows" labels. So it's more of a platform thing than a DOS/Win OS thing.) (Perhaps EFI is like BIOS in this sense?) However, OpenBSD is rather platform independent, so uses a different vernacular, like tty00.

Using serial after boot process

OpenBSD FAQ 7: section on using a serial connection (FAQ 7.6) says, “On some platforms and some configurations, you must bring the system up in single user mode to make this change if a serial console is all you have available.”

For terminal sessions, the OpenBSD FAQ 7: section on using a serial connection (FAQ 7.6) may have some more info. Specifically, OpenBSD FAQ 7: section on using a serial connection: subsection about changing the /etc/ttys file may be needed.

If there is no need/purpose to max out speed, OpenBSD FAQ 7: section on using a serial connection (FAQ 7.6) recommends against it.

Using a PS/2 keyboard and manually gathering information

This may be slightly unpleasant when needing to type or write a small amount of information, like the kernel panic message and “Stopped at” line. Some of the other information may be quite verbose, and so may take minutes to manually record, making this one of the least pleasant options. However, in some cases this may be the most feasible option.

Gathering data

Depending on how important the issue is, it may be worthwhile to just record the easily available information, and see if that is useful for troubleshooting. For instance, if a system is not booting up fully, gather the panic message and the “Stopped at” line, and use those details for trying to search for troubleshooting steps. Other details might be less necessary if troubleshooting is successful with just those few details. Such an approach may make sense if the problem is easily repeatable, such as if the system is not booting up successfully. In other cases, such as if a critical server stops responding twice a week, there may be more incentive to gather all of the details (even if it takes some time to do so).

Stopping point

Note the “Stopped at” info.

Note: If this has scrolled off, perhaps (and this may just be speculation) the information can be shown by using “machine ddbcpu 0”. However, that may just show what the CPU was doing with on CPU core number zero, whereas the initially-displayed “Stopped at” line might show information from a different CPU?

Panic message

Also, it is probably also recommended to record the “panic message”. If the panic message scrolled off, run “show panic”.

trace data

Run trace on the first processor, by running: trace.

OpenBSD FAQ 2: section on Reporting Bugs (OpenBSD FAQ 2.4) has a note for SMP systems: run trace for each processor. So, after running trace on one processor, then switch processors with “machine ddbcpu #”. Then run trace again. See if any of those traces start with a reference to a function shown in a “Stopped at” line.

Process list

At some point, will one should run the ps ddb command, which is NOT the same thing as running the ps command from a normal command line shell. Running the ps command from within ddb is, according to the OpenBSD man page for ddb, the same thing as running the “ show all procs command within ddb.

Registers
Reporting a problem with OpenBSD (“How to create a problem report” section, item #5) notes, “The output of show registers might be of interest as well.” (So, run “ show registers ”.)
[#ddbquit]: End the ddb debugging session, intelligently
Dumping

Perhaps run dump: Making a dump might happen automatically (see OpenBSD manual page for the topic “crash”). OpenBSD Manual Page for ddb shows that boot dump or boot crash will dump data and then reboot; Reporting a problem with OpenBSD (“How to create a problem report” section, item #5) suggetss rebooting with boot dump.

OpenBSD 4.4 FAQ (archived): FAQ 4 and OpenBSD 4.5 Installation Guide: FAQ 4 (section 4.7) state, “Be realistic -- few developers will want to look at your 1GB dump file, so if you aren't planning on investigating a crash locally, this is probably not a concern.” (This text seems to have been removed at some point by archived OpenBSD 4.6 Installation Guide: FAQ 4.)

(OpenBSD 4.3 changes: "Avoid creating ridiculously large core dumps"...)

If a dump was made, the location of the dump was determined by the kernel configuration option called “dumps on”. After the reboot, that will cause savecore to run, and put data into /var/crash/.

Continue

Perhaps another option is to select the “c” command, which is the same thing as the “continue” command. If ddb was entered intentionally, manually, this may result in the operating system simply working normally. Otherwise, particularly if ddb was entered due to apparent instability, results might be more undesirable.

Rebooting

It may be possible to reboot without making a dump file.

Once this information is gathered, use it to solve the problems. One possible way may be to follow the guidelines in the section about reporting details about crashes to software developers.

[#gnudebug]: Using the GNU Debugger, GDB

NetBSD Documentation: Debugging the NetBSD kernel with GDB HOWTO” discusses KGDB. OpenBSD's manual page for gdb

“Magic System Request” key

The term “Magic System Request” refers to holding Alt, and then pressing and releasing the “SysRq” key. Then, while still holding Alt, other keys may be pressed.

See: HowToGeek guide on using Magic SysRq, Linux Kernel documentation about Using SysRq.