System Panics

This is generally an automatic response to a slightly-controlled problem, where the system handles a situation where it otherwise might not be able to continue to operate normally. A panic might be caused when an operating system sees a certain threat that could cause severe data loss, and probably more data loss if the system keeps operating. The whole incident may often/always be caused by misbehaving software, and the system often does not work normally until it is restarted. A system panic may be considered to be a form of software crashing. (For more details about software crashing, see handling crashing.)

Microsoft Windows
[#bugcheck]: BugCheck (Code), a.k.a. “STOP Error”

(See also: Bugcheck)

The BSOD is witnessed a lot less often now that newer versions of Windows will, by default, automatically reboot instead of showing an error screen.

Basically, a “bugcheck” indicates that some software decided to report a “bugcheck” situation. This is very often an indicator that a system failure in Microsoft Windows occurred, and that the system may have shown a BSOD unless the system was set to automatically reboot. However, even that can be a bit speculative. The basic information that is really known is just that a “bugcheck” flag was set.

The results of the “bugcheck” should appear in the system logs after the system reboots. Note that a log entry showing “bugcheck” information does not necessarily mean that a problem occurred at the time the log entry was made. Supposedly this may happen even if the reboot does not occur for days after the time when the “bugcheck” was caused by any software. During the time between the “bugcheck” and the system rebooting, the system may have visibly appeared to work just fine (for days). The fact that the system did eventually restart may be due to some reason entirely unrelated to the “bugcheck”. So, view a “bugcheck” as a strong indicator that there may have been an unideal user experience, but not a definite indicator of that.

Find the bugcheck code, which generally will be followed by some bugcheck parameters. See what information is provided about the bugcheck code, from MSDN: Interpreting a Bug Check Code (and more specifically, from the subsection MSDN: Bug Check Code Reference). That web page is likely to provide a description of what typically causes a specific bugcheck code to be seen, and a description of what the parameters may mean. The meaning, and even usefulness, of the parameters may vary based on which bugcheck code is being used.

If the Bug Check Code is documented

The answer to the question of “What does the bugcheck code mean?” is likely to be answered by viewing the above web pages, so once that answer is obtained, it is likely best to quit trying to search for an answer to that question. The remaining question to focus on may be, “What caused the bugcheck to happen”, and “What is the best way to proceed with the situation?”

Very often, the information available online from the MSDN list of Bugcheck codes may seem too vague to provide a clear solution. If the documented information is a big vague, then there might not be a lot of additional useful information that can be extracted from the bugcheck code. Some people might think that the bugcheck data must be very useful if a person just understands how to interpret the data of the bugcheck code, as if the clear and simple answer is definitely available but only to those who know the secret. Sometimes there might be some truth to that idea: people who know how to gather and interpret information may be able to obtain useful results. However, sometimes there just isn't a lot of additional useful information that is available. If the bugcheck code suggests that a driver reported a problem related to an IRQ status, then the actual information available from the bugcheck code and its parameters may actually be that unspecific. Trying harder to extract information from the bugcheck code may not end up being fruitful.

Determining the best way to proceed with a situation may involve using troubleshooting skills beyond just seeing the description of the bugcheck code. For example, Microsoft may have a KB article that mentions the bugcheck code when it is seen in specific circumstances. Search engines may be helpful.

If the bugcheck code is not documented

MSDN: Bug Check Code Reference says, “If a specific bug check code does not appear in this reference,” there is a recommended process. The process involves using “debugging tools” released by Microsoft.

If further information is needed, consider using the debugging tools.

Debugging bugchecks

See: crash handling: Debugging software for Microsoft for information about Microsoft's debugging software (kd/WinDbg) and using the “symbols”.

After the debugging software is installed, and using the correct symbols, MSDN: Bug Check 0xA: IRQL_NOT_LESS_OR_EQUAL shows an example of entering several commands:

  • .bugcheck
  • kb
  • kv
  • .trap
  • kb

MSDN: Bug Check Code Reference says to use “!analyze -show ” followed by the Bug Check code. (For example: “!analyze -show 0x0x00000082”)

Tom Wijsman's Apr 12 '11 at 13:14 comment to his own answer to ben950's Superuser.com question about a Blue Screen recommends using “!analyze -v” and “lm t” (to get time stamps) and “lm v” (but, he notes about that last command, it provides “a lot more information but takes a long time”).

The information gathered can be useful for some of the most skilled troubleshooters and programmers, including those who can effectively use “reverse engineering”/“disassembly” techniques/processes. For some other people, a lot of the information will just look like hexadecimal gibberish. However, you may be able to find some filenames, which may be helpful to determine what software (possibly including which driver) is being active at the time that problems occur.

[#bsod]: “Blue screen of Death” (“BSOD”)

In operating systems that use the Windows NT kernel, the most common approach when a BSOD is involved is to issue a BugCheck/“STOP Error”, and then possibly reboot the system.

Handling BSODs in modern Microsoft Windows operating systems

It is best if the system has been properly configured to record useful information. In many cases, a rather suitable configuration may be set up by default. (See the section about System failure in Microsoft Windows for further details.)

  • If it is known that all of the useful information on the BSOD is also being written to the hard drive, in a way that the information may be easily retrieved after the system reboots, then there's probably not much point in remaining on the BSOD screen. If that is the case, go ahead and end the BSOD, which may/does involve rebooting the system.
  • If it is not known whether the information on the screen may be recorded somewhere else, then go ahead and start using the bugcheck information (by using another computer to do research) and/or recording the bugcheck information. Such information may be useful when trying to troubleshoot the system. (In particular, the most useful details are generally the bugcheck code, and the parameters.) Note that writing down all of this information may take some time, as there may be quite a few hexadecimal digits on the screen. (The preferred way to handle this is to know, ahead of time, that the system is configured to store the BSOD information on the hard drive. Then, as long as that information can safely be counted on to stably write the needed information to the hard drive, writing down such details is unnecessary. If a BSOD happens on an important system, then writing down the numbers may be necessary.)

When the system is working normally, it may be worthwhile to ensure the system writes useful data to the log during any future bugchecks. Windows System Panic Logging may have further info.

To find out more information, see the section about the bugcheck code.

BSODs of Win9x

Wikipedia's page on “Bug check” says “The corresponding system routine in Windows 9x, named SHELL_SYSMODAL_Message, doesn't halt the system like bug checks do; it just displays a BSoD and allows the user to continue execution.” However, in such operating systems, often the cause of a BSOD is self-repeating, so if the user decides to continue, often the user has no options and simply experiences another BSOD.

Ideas which may or may not work well: record all of the numbers, and after the (generally inevitable) reboot, compare those numbers to see if any of them are very close (perhaps a bit higher) than some of the I/O memory addresses which can be viewed from Device Manager. To see these I/O memory addresses or I/O memory address ranges, see the Troubleshooting Guide: section on I/O ports.

BSODs in Microsoft Windows 3.1
These may exist. (They may happen more frequently if running MS-DOS programs.)

For other information about handling BSOD's, see: Microsoft KB Q129845: Blue Screen Preparation Before Contacting Microsoft (for Win NT/2K) and/or Microsoft KB Q314103: Preparation Before You Contact Microsoft After Receiving a STOP Message on a Blue Screen (for WinXP) (although the advice of sending a compressed dump file via FTP to ftp.microsoft.com is probably not really meant to be a currently recommended technique).

[#wnsysfal]: System failure

This may often result in a “bugcheck”.

Options for automatic handling

Get to the “System Properties” screen. To do this, go to the Control Panel applet called “System”. In Windows Vista or newer, this shows a summary screen, and so to get to the “System Properties” screen, one then needs to choose “Advanced system settings” from the left frame.

Once on the “System Properties” screen, check teh “Advanced” tab and then look for a “Startup and Recovery” section with a “Settings...” button.

That screen shows a “System failure” section, with options about writing a “dump file” and then optionally proceeding to “Automatically restart” the system.

Of course, not all system failures will necessarily follow these procedures. If the system completely freezes up, possibly due to electrical problems, the operating system may not get a chance to choose to perform actions such as writing a file and then rebooting. When the operating system is able to respond to a system failure, a bugcheck may be issued.

[#pancwnlg]: Recommended setting for Write an event to the system log

This probably is a good idea.

TechNet: Windows NT 4.0 Svr Registry Tweaks notes this may be set using HKLM\SYSTEM\CurrentControlSet\CrashControl\LogEvent (setting it to 1 to reboot).

Recommended actions for a memory dump

If disk space is abundant, or a system failure is something that is expected as a strong possibility, then it may make sense to use a fairly large dump file. Smaller dump files may contain only some of the needed information. The other needed information may be available from external sources, such as files that contain details about “debugging symbols”, although effectively using those files correctly may require additional setup/effort. Just having the information right in the dump file can be more convenient.

Information about the various types of memory dumps may be at: crash handling: dump types.

Recommended setting for the “Automatically restart option”

TechNet: Windows NT 4.0 Svr Registry Tweaks notes this may be set using HKLM\SYSTEM\CurrentControlSet\CrashControl\AutoReboot (setting it to 1 to reboot).

For an interactive system where somebody is likely to be right in front of the computer, leaving that option off may make a lot of sense. The key reason for this is that if the system reboots, there may be some question as to whether the reboot occurred because of a software instruction, or because of hardware issues that caused the hardware to restart even though software did not request the restart. If the system is not set to automatically restart, then seeing the system reboot may be an indicator of a hardware-based problem. This detail may be helpful for troubleshooting.

For a system that is not likely to have somebody nearby and readily available to restart the system as needed, choosing to “Automatically restart” makes a lot more sense. This may cause the system to not show a visible BSOD, but that is okay if the information from the BSOD will also be available later.

Generating an alert

TechNet: Windows NT 4.0 Svr Registry Tweaks notes this may be set using HKLM\SYSTEM\CurrentControlSet\CrashControl\SendAlert (setting it to 1 to reboot).

An alert should be able to be generated anyway if this information is logged. To log this, see logging panics in Microsoft Windows. For more information about alerts, see the monitoring/reporting section of the text about activities which network administrators can implement “behind the scenes”.

Misc info

Some other tools are described by crash handling, including the reference to TechNet blog: capturing application crash dumps.

Some software that may be helpful:

  • DebugView (from Microsoft/Sysinternals)
  • BlueScreenView by Nirsoft (and related utilities mentioned on that page)
  • Debugging software

[#unixpanc]: Unix Panic
[#obsdpanc]: If OpenBSD panics...

With a default installation, generally what happens in response to a “system panic” is that the system may go into the debugger software called ddb. If this happened, information may be collected as described by crash handling: ddb usage: getting easy data. (One may then want to reboot and then continue to collect data as noted in the section about gathering details when handling OpenBSD crashes.)

[#obsdpnrb]: OpenBSD: Rebooting automatically in response to a system panic

Referencing some FreeBSD documentation (even though this section of text being about OpenBSD), FreeBSD Developers Handbook: Kernel Debug Online: On-Line Kernel Debugging Using DDB states, “any panic condition will branch to DDB if the kernel is configured to use it. For this reason, it is not wise to configure a kernel with DDB for a machine running unattended.” Well, OpenBSD's default kernel is configured to use DDB. Sure enough, this can lead to an unattended system becoming unresponsive to attempts to interact with it remotely. Understand that if the operating system drops into the debugger, it may not respond quite normally: locally switching to another terminal may not work, and (what may limit options even more is that) the SSH server may not respond.

If going into the debugger is not desirable, there is a way to get the system to automatically initiate a reboot instead of going into the debugger. However, before making this change, it may be nice to know that the setting cannot necessarily be changed back as easily as it is initially changed.

OpenBSD man page for securelevel notes that, “Because securelevel can be modified with the in-kernel debugger” called ddb, “a convenient means of locking it off (if present) is” effective if the securelevel is set to 1 or higher. This is generally going to be the case (on most kernels) except for when the system is restarting. So (at most times on most kernels), if the desired behavior changes back, the new setting will not take effect until the system is rebooted, at which point the setting may be set to what is stored in the /etc/sysctl.conf file. (That file may be edited even if the securelevel is set to 1 or higher.)

Having said all that, lowering the value is much easier. If there is an interest to auto-reboot, perhaps change the sysctl called ddb.panic from the default value of 1, to 0. “Dave's Picks” website: Automatically rebooting OpenBSD after a panic notes that by doing so, “instead of dropping to the debugger, the crashlog will be saved and the machine will automagically reboot.”

sysctl ddb.panic=0

That will set the currently used configuration. To have the configuration take effect each time the system is restarted, a standardized method to configure such sysctl values is to back up and modify the /etc/sysctl.conf file.

cpytobak /etc/sysctl.conf
echo ddb.panic=0 | sudo -n tee -a /etc/sysctl.conf
FreeBSD
... To auto-reboot, perhaps there is a debug.debugger_on_panic ( http://davespicks.com/writing/programming/openbsdpanic.html )
Linux

http://www.kernel.org/doc/Documentation/ describes the watchdog/ subdirectory as “how to auto-reboot Linux if it has "fallen and can't get up". ;-)”

The basic concept is that a “Watchdog” driver (or “Watchdog Timer” driver) seraches for signals sent to a device which is located at /dev/watchdog on the filesystem. If an expected signal does not occur, the Watchdog may attempt to reboot the system. Software may be able to indicate when such signals are expected, or software may not be able to control the expectations (A WatchDog man page refers to a CONFIG_WATCHDOG_NOWAYOUT option that may be specified when compiling a kernel.) LinuxJournal: web page about Watchdog—The Linux Software Daemon indicates support was added to Linux 1.3.51 (which was presumably released between Linux 1.2 in March of 1995, and Linux 2.20 by June 9 of 1996).

(Further reseach may be needed about how to get this set up.) Embedded Freaks: Howto Use Linux Watchdog has some information about setting up this software to work with some specific hardware.