Checking a disk

This section is related to hardware testing of a data storage device. A similar, but different, task would be testing/repairing filesystem volumes. (Despite its name, the famous utility called checkdsk or chkdsk is actually a tester of filesystem volumes.)

[#dsksmart]: Self-Monitoring, Analysis, and Reporting Technology: S.M.A.R.T.

S.M.A.R.T. is supported by modern hard drives. It has been one of the few acronyms in the industry that has typically been written with a period after every letter, although some people have also written it out periodless (as “SMART”).

Particularly when S.M.A.R.T. was a new feature, it has been marketed as a way to potentially detect a drive that is about to fail (and to succeed in detecting this event before the drive actually fails). However, before starting to place any significant amount of trust in such a system, note that S.M.A.R.T. has gained a reputation of not being very useful for determining if a specific drive is about to fail. Despite possibly not being nearly as useful as some other tests, S.M.A.R.T. is mentioned first (in this section on hard drive testing) because using S.M.A.R.T. may be one of the fastest ways to get *some* information about a hard drive's health. S.M.A.R.T. may report information based on previously-determined information, and so checking a report based on S.M.A.R.T. values is a process that may be a lot faster than other strategies that involve doing other checks which may be a lot more thorough but also time consuming. Also, it is reasonable to believe that technology improvements may occur over time, and these probable/possible technology improvements may someday result in more useful reporting by future devices.

So, S.M.A.R.T. may be useful in helping to identify some problems. However, do know that S.M.A.R.T. has often generated a report which seems to indicate no problems, even on a drive that did not function much longer. A perfect bill of health on a S.M.A.R.T. report does not guarantee perfection or even satisfactory operation.

Software overview
WMIC

This may not be as detailed as some other options that are a bit more specific to S.M.A.R.T., but has the advantage that it is pre-bundled with newer versions of Microsoft Windows (including Windows XP). Details are included later in this page: S.M.A.R.T. with WMI.

Smartmontools

An option is to use software from the Smartmontools package. The software is available for both Unix and Windows. (Details on using this in Windows may be added here at a later time.) In Unix, the first step is generally to see if it is pre-installed. Run “which smartctl ” to verify that command exists.

If S.M.A.R.T. software is not installed, add it. Details about adding software are provided in the section about software installation.

ide-smart
ide-smart
Manufacturer tools

Hard drive manufacturers may release software.

HDDScan

HDDScan\s home page describes the software as a “free HDD diagnostic utility”, and is available for Microsoft Windows. This closed-source software was mentioned by Seagate's Seek Error Rate, Raw Read Error Rate, and Hardware ECC Recovered SMART attributes as being able to extract all 48 bits of a SMART attribute.

Others

BackBlaze blog comment mentioning MHDD referred to MHDD positively, though noted that the drive needed to be taken offline to get the most useful test data. MHDD manual stated, “It is very important to understand that you have to spend several hours (minimum) before you will start using MHDD.” ... “Be extremely careful when running MHDD the first time.” MHDD's home page.

Also, BackBlaze blog comment mentioning MHDD mentioned DTrace as seeming to be a more convenient solution, though perhaps a solution that was still in development.

For Microsoft Windows, there is a freeware application called SpeedFan.

Performing an extended test
Process using smartctl from Smartmontools

To see how long a test may take, use -c. e.g.

smartctl -c /dev/sd0c

Note: The /dev/sd0c represents a device name, and may need to be customized. (For more details about what device name to use, see making filesystems: destination names.)

In the output, look for “Extended self-test routine” “recommended polling time”.

Check the logs. Run:

smartctl -l xselftest /dev/sd0c

There may be a line that starts with “# 1 ”. See the “LifeTime(hours)”, which specifies when the last test was started. Remember that value (noting it down) so that later that same old log can be identified, so that it isn't misinterpreted by being treated like a new log. (Later on, the log may provide a new result in log entry # 1. It may be helpful to know when the last test was, as that may help determine when the new test will be.)

To start the test, which may take multiple hours, use:

smartctl -t long /dev/sd0c

Note the text near the bottom of the output that says how long the test will take, and when the test will end. This estimated time should be pretty accurate. Make a note of that time now, because there may not be a real easy way to later figure out when the test is expected to complete (unless there is a record of when the test started). (There is, however, a way, as described earlier, to note how long the test is expected to take.)

# smartctl -t long /dev/sd0c
smartctl 5.40 2010-10-16 r3189 [x86_64-unknown-openbsd4.9] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 195 minutes for test to complete.
Test will complete after Mon Oct 31 11:02:27 2011

Use smartctl -X to abort test.
#

Note: Checking the results may result in results simply not being available yet. The “Remaining” column may be meant to show how much of a test did remain, when an aborted test was stopped. Do not expect to see preliminary results before the test is completed. To check the results, use:

smartctl -l xselftest /dev/sd0c

Look for the line that starts with “# 1  ”. It will show the type of test (e.g. “Extended offline” for ATA, or perhaps “long” for SCSI). Hopefully the status shows as “Completed without error”, and the precentage remaining shows “00%”. That means zero percent of the test remains. (It seems the percentage may be rounded, possibly to 10% increments?) The LBA_of_first_error column should also just show a hyphen. (The “Lifetime(hours)” column is an indication of how many hours the hard drive has been in use before the test started.)

Also, after running “smartctl -l xselftest”, check the error/return level of that program. To do so: For Unix, run “echo $?”. Users of a command shell form JP Software, use “echo %?” (or another applicable option, like the option for Microsoft Windows). Users of Microsoft Windows: Try “echo %ERRORLEVEL%”.

Note: Starting a test while one is running will cause the old one to be stopped. The old test will show up in the log showing a status of “Aborted by host”.

Checking for values that are below acceptable thresholds

Perhaps choosing a method from the “Quick approach” or one of the “Detailed methods” will be sufficient.

Quick approach
Using OpenBSD

Figure out the device name of the disklabel entry for the drive that will be checked. The section on Detecting disk devices (in BSD) will likely help with that. Then add a “c” to the end of the drive name. So, if the drive name is wd0 then the disklabel entry will be wd0c. That used to be the most common drive name: Perhaps sd0 has started to become more common. If there are multiple disk storage devices, then a name like sd1 may be used.

Run:

sudo atactl /dev/wd0c startstatus

That should output:

No SMART threshold exceeded

Immediately after running that command, run:

echo $?

That will output a number. When things are going well, that number will generally be zero.

0

Calomel's guide to SMART handling shows some details about how to set this up in cron (in the section called “Option 2: Run Periodically through Cron”), and notes, “You will only receive an email to root if there is a problem with the drive.” Earlier, that same page notes, “If an error is found then an email will be sent to root” ... “It only sends out an email if an error is found to reduce the spam.”

Detailed methods
Seeing the data
Using atactl

For example, in OpenBSD:

Figure out the device name of the disklabel entry for the drive that will be checked. The section on Detecting disk devices (in BSD) will likely help with that. Then add a “c” to the end of the drive name. So, if the drive name is wd0 then the disklabel entry will be wd0c. That used to be the most common drive name: Perhaps sd0 has started to become more common.

Customize the drive name as needed:

sudo atactl /dev/wd0c readattr
Using SmartmonTools

In Unix, run:

smartctl -A /dev/sd0c
echo Returned $?

Note: The /dev/sd0c represents a device name, and may need to be customized. (For more details about what device name to use, see making filesystems: destination names.)

Note: In other operating systems, the command to run is similar (start with smartctl -A), but the device name may be different. The purpose of the second command line shown is to report the “exit code”/errorlevel/”return value”. Customize as needed to determine the error level.

It should report:

=== START OF READ SMART DATA SECTION ===

If all is well, it will also show more, e.g.:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0

The output may commonly be word-wrapped. The underscore in WHEN_FAILED, and the hyphens underneath that underscore, show up in the 80th column of output. Since many terminal displays default to only being 80 columns wide, word wrapping may be commonly used for this output. For instance, the column header named RAW_VALUE may show up on a later output line than the column header named ATTRIBUTE_NAME. The value that corresponds to the RAW_VALUE column header, which is 0 (zero) in the above example, may show up on a line later than the value that corresponds to the ATTRIBUTE_NAME column header (which is Raw_Read_Error_Rate in the above example).

Interpreting the data

(Note: this particular section was written with SmartmonTools in mind. If the data was gathered from some other program, this guide does not currently provide details. So, for now, just adapt as necessary.)

There are two things to check for. One is that data did get reported. If things went very well, there should be lines of data that showed up after the line that says “=== START OF READ SMART DATA SECTION ===”. If so, the other thing to check quickly is to examine the return value. (This is the number that shows up after the word “Returned”, if the above example echo command was used). If it is zero, everything that was detectable and reported seemed good.

Here are some further details to understand the other information being shown:

Do not be alarmed by a TYPE of Pre-fail showing. That is simply a description of the values being reported. If the values are acceptable, the drive is not in a Pre-fail status. The THRESH column is the lowest acceptable value for each category. The WORST should be above the THRESH.

What to do if the output is less great

If the report stops after the “=== START OF READ SMART DATA SECTION ===” line, then try again later. Retrying in a few seconds may result in more information. If not, well, try again later: retrying again in a few seconds may result in more information. (It is unknown why this might be uncooperative.) Once it does show information, follow up by reporting the error/return code.

When it does show the values, check the error code. Zero is great. Other error codes have meanings that may be interpreted using the manual page. (See Smartmontools Manual Page for smartctl: section about return values. Understanding this documentation, a return value of 4 indicates that Bit 2 is set while others are cleared, so S.M.A.R.T. did not return a clean set of data. (This might, or might not, be acceptable. If this occurs, note the output but try just running the smartctl command again to see if better results show up.)

Some specific values to check

There seem to be some various opinions about what values mean. Part of the reason for this is that different hard drive manufacturers do use different values, at least for some of the S.M.A.R.T. categories that get reported. The following information is not based on any type of widely recognized industry standard that may be strictly followed. Instead, this is just some information that has been gathered, and is being re-shared in case it offers any useful insight.

A web page with Ubuntu tips indicates that Load_Cycle_Count is related to having a hard drive stop and start, which may be tolerable for laptops. However, even in those systems (at least for common usage), the web page cautions if the Load_Cycle_Count “is going up rapidly (more than 25 per day is too many).”

Info from BackBlaze

A company named BackBlaze, which provides online storage, has used tens of thousands of hard drives and has been known to share some data.

BackBlaze blog: Hard Drive SMART stats states, “There are over 70 SMART statistics available, but we use only 5.” Those are:

5 (Reallocated_Sector_Count)

187 (Reported_Uncorrectable_Errors)

BackBlaze's blog stated, “Drives with 0 uncorrectable errors hardly ever fail. This is one of the SMART stats we use to determine hard drive failure; once SMART 187 goes above 0, we schedule the drive for replacement.” ... “For SMART 187, the data appears to be consistently reported by the different manufacturers, the definition is well understood, and the reported results are easy to decipher: 0 is good, above 0 is bad.”

188 (Command_Timeout)

197 (Current_Pending_Sector_Count)

198 (Offline_Uncorrectable)

The blog from BackBlaze also discusses:

SMART 1 (Read_Error_Rate)

The blog describes that 0 is good, and higher numbers tend to represent problems. Different manufacturers use different meanings for those higher values. For instance, one manufacturer may have a number grow much more quickly than another manufacturer, so one manufacturer may use much higher numbers than another manufacturer. However, having any growth is bad. The simple thing to remember is that 0 is good, and more than zero tends to be bad.

The article also has some information about SMART 12 (Power_Cycle_Count).

The blog also generated some interesting comments. Among those include:

Sami Liedes's comment notes that some of the numbers that may show up may be made from collections of different bits that have quite different meanings. Some of the bits used for those numbers may have some information that could be quite interesting, but those interesting bits might not be quite as easily noticeable if people just look at the entire numbers which also include other bits that have other meanings. To quote Sami directly, “For example, if the high 8 bits contain some counter which does not correlate with failure and the low 16 bits contain something that does correlate, you are not going to see any significant correlation if you just interpret them as integers.”

Also, one of the comments referred to Seagate's Seek Error Rate, Raw Read Error Rate, and Hardware ECC Recovered SMART attributes.

List from Wikipedia

See: Wikipedia's page on S.M.A.R.T. : section on “ATA S.M.A.R.T. attributes”.

[#disktest]: Disk integrity testing

While “S.M.A.R.T.” is about checking for problems that have been reported, this approach is more about trying to find problems that may not have been reported yet (but which might still be issues that are causing issues, or may be very likely to cause problems soon). Basically, this process involves giving the storage device a series of steps that will hopefully trigger and detect a problem if a problem is likely to exist with normal usage. If the disk passes this test, everything is good. Otherwise, hopefully the detected problem can be carefully dealt with to minimize data loss and be rather minimally disruptive. The hope/intent is that if a problem exists, the problem may be able to be handled better in response to a test/diagnostic program that detects the issue, rather than waiting for a sudden catastrophic failure causing inconvenient downtime.

Disk integrity checking may often be implemented by using software that is familiar with a filesystem structure, and which often checks not just the physical drive but also details about the filesystem's structure. For such details, see testing/repairing filesystems. (Some of the reason for this being common practice may be due to historical precedent, when DOS came with software to test/repair the filesystem, and code to run simple brief checks of the disk may have been included in the software that handles working with MBR partitions and creates a filesystem, but which did not have any other checking. While such an approach may only check nearly all of the disk, the few sectors that remained unchecked represented a statistically small portion of the disk. (Also, these untested sectors might not be written to frequently at all, and except for a small number of key sectors like the boot sector, these sectors might also not even be read from.)

[#unxdskts]: Disk testing in Unix
Using badblocks

Note: Some people seem to like to check S.M.A.R.T. data before running badblocks, so that the data from the drive's pre-tested state may then be compared to later data. (e.g.: Slashdot Article on hard drive testing: Comment 42376087)

There may be a command called badblocks (as part of the e2fsprogs package). Perhaps this is rather specific to the Ext2 filesystem when using a non-destructive test? It shouldn't matter if using a destructive test.

There will be a need to know the name of the device that does or will have the file system. (Using tools like fdisk (or, in BSD, better may be disklabel) may help show information to help determine the names of the partitions. However, those tools may need to be supplied with the names of the hard drives. As needed, see information about detecting hardware.)

[#okosbdbl]: Using the right environment

Testing large partitions, which may naturally exist on large hard drives, or testing an entire large storage device, may take a fair amount of time. Hours is certainly not unheard of. Like any process that is going to take such a substantially long time, it may be best not to run such a task directly from a command prompt started by a remote access server (such as sshd). Instead, use a terminal multiplexer such as the newer tmux (if available and convenient) or the older program called screen (if that is more conveniently available). This way, the task does not rely on the remote connection, which generally has more points of possible failure. By using a terminal multiplexer, interaction may resume with the shell running the hard drive tests. This way, an unplanned task like a desire to reboot a client machine won't need to be held up while the disk test continues.

(This entire paragraph should be reviewed following some additional testing.) Some preliminary testing indicates that running this program under OpenBSD may take a significantly longer amount of time than using Ubuntu. However, this testing may have been done naively, running a non-destructive read/write test in OpenBSD and a read test in Ubuntu. Therefore, the results should not be considered finalized. The difference may be something like an eight-fold speed difference. The cause of this hasn't yet been determined (and verified) by the author of this text. Perhaps it is related to overhead from some sort of emulating some sort of Linux framework? If a solution to this is found, please don't hestitate to contact the site staff. (This seems like it ought to be a resolvable problem.) Remember to benefit by using the right tool for the job. Although OpenBSD may be a great choice for network security, it might be a less ideal choice for performing this task on a computer that is dedicated solely to perform this task. If the program is started and it appears that the task will be taking too many hours/weeks, consider if the computer may be rebooted into a Debian or Ubuntu boot disc.

Handling previously discovered bad blocks

If it is suspected that there are some bad blocks that have been found previously, first use “ dumpe2fs -b /dev/devName >> listOfKnownBadBlocks ”. Then when running badblocks, add the “ -i listOfKnownBadBlocks ” command line parameter.

The “ dumpe2fs -b /dev/someHDD ” may show some blocks that are marked as bad.

Handling the block size

The block size shouldn't matter too much when running badblocks on its own, unless there are actually some bad blocks detected. In that case, it may be useful to have badblocks use the same block size as the commands that make a filesystem volume (e.g. mkfs, newfs, or (for Ext2) perhaps mke2fs) and the programs that test/repair filesystems (e.g. for Ext2 it may be fsck, fsck_ext2fs, fsck.txt2, or e2fsck). By using a common block size, the generated list of blocks may be used by the other programs. Note, however, that a common belief is that once a drive is physically bad enough that problems can be detected by software, it is often compromised enough that the drive's physical problems have a high likelihood of spreading to other sectors. Therefore, using such a drive may not be a good idea. A time when file systems are being created may often be a time that is relatively convenient, compared to other times, to start using new hardware.

To take care of this, determine the desired block size. If there is an existing file system using ext2 or one of its successor formats, this might be able to be visible by using software that is designed to adjust filesystem options. Details may vary based on software, so reading a manual page may be wise.

(The following details specific to filesystem formats may go to the filesystem formats section.)

Getting the block size of an FFS drive

As an example in OpenBSD: the drive will need to be not mounted. Then, a command for a mounted FFS volume may be:

dumpfs -m /dev/DeviceName

Then look for the number right after the “ -b ” in the output. This is probably the same as what is shown by using:

dumpfs -m /dev/DeviceName | grep ^bsize
Getting the block size of an Ext2 drive

If a filesystem volume isn't mounted, specifying the device name may be needed.

As another example, for using software that came with e2fsprogs (for Ext2 and successor filesystem types), the following may work:

tune2fs -l /dev/DeviceName | grep "Block size"

Otherwise, mke2fs may be determined automatically: According to (3rd party) man page for mke2fs, “block-size is heuristically determined by the file system size and the expected usage of the filesystem”. The “expected usage” of the file system may be impacted by using the -T command line parameter for mke2fs, and may involve using a blocksize line from the /etc/mke2fs.conf file.

Once the desired block size is determined, use -b blocksize as recommended above. Smartmon Tools guide to handling bad blocks says the block size is “normally 4096 bytes for ext2”.

Remembering problems

Using the “ -o newListofKnownBadBlocksForMyDev ” parameter will allow the list of bad blocks to be saved so that it may be used later by badblocks (when it is run at a later time) and mkfs, if that is used, and e2fsck, which ideally does get used (regularly).

Note: the list out bad blocks should be a unique file per device. It is likely best to customize the filename to note which device the file is for. If the device's name is /dev/sd0a then perhaps a filename such as /tmp/sd0a may be most appropriate.

Suffering less, by increasing speed

The impact of the -c parameter could be major. It may cause substantially increased speed, due to useful usage of more memory that is available, although less useful values can substantially decrease speed, if the values unnecessarily cause virtual/swapped memory to be used. Some values could even cause the program to instantly error out.

At this time, this guide does not have any hard and fast rule on what to provide to the -c parameter. The desired value is likely to depend on the block size, since -c refers to a number of blocks. Calomel.org's guide to validating a hard drive suggests using -c 98304 and combined with -b 4096 for a machine with a gig of RAM.

If there is 256 MB (268,435,456 bytes) of memory free (see Troubleshooting / as noted by top), a substantial increase can become available by using 268,431,360 of those bytes. To do that (or a few less), if the block size is 1024, simply reserve 65,535 of those blocks by using “ -c 65535 ”. If the block size is 4096, it may make sense to use a smaller count of the larger blocks, by using something like “ -c 16383 ”. If the number specified is too high, there may be a message such as:

badblocks: Cannot allocate memory while allocating buffers

(Perhaps this error was caused by exceeding a setting that can be viewed with ulimit -a ?) (Admission: This text would be improved with further details to determine the optimum amount of memory to use. Presumably using up too much memory will just cause additional information to be read by some sort of loop, but not provide any noteworthy speed benefit. If there isn't a worthwhile speed benefit, using extra memory might just be wasteful, possibly depriving other software from being able to effectively use the memory.)

Disk Test & Health Monitoring web page says that too high of a -c value causes badblocks to exit with an error of being out of memory. Worse, the web page notes that setting the value too low “for a non-destructive-write-mode test, then it's possble for questionable blocks on an unreliable hard drive to be hidden by the effects of the hard disk track buffer.” (The accuracy of that has not been fully verified.)

A web page about badblocks suggests mutliplying the amount of RAM by 3 and dividing by 32, which is the same as multiplying by 0.09375 or a bit under a tenth of the RAM.

Note: If speed is still too slow, consider adjusting this -c parameter further, making sure the disk isn't being used for other purposes, and/or consider making a change to use the right environment for the task (of disk checking).

Determine destruction level

Determine the type of test to use. There are three types of tests. If neither -n nor -w are used, then a read-only test is performed. Specifying -n may do a non-destructive read/write test. This means that the disk is written to (so there could be some data loss if the computer is shut down), but it is intended to be non-destructive, so the software will attempt to write the correct data back. This may be the slowest method of testing, but provides the benefit of doing some testing of writing functionality, while making it unlikely for there to be the need to restore (from a pre-existing successful backup) all data on the filesystem volume. Note that this “non-destructive” write test should not be used on a mounted volume! The filesystem drivers may not be expecting data to be overwritten (even temporarily) by another process.

Take careful note that using -w is destructive. However, if the data on the filesystem is about to be overwritten by the creation of a new filesystem volume, then it really doesn't matter if that data is destroyed. Using the destructive test should be significantly faster than the non-destructive test. (It seems in some test cases that might not seem to be the case: the opposite may be true.)

Finally, adding -v -s will show more output.

In summary, leave off the “ -i listOfKnownBadBlocks ” if such a previously-made file doesn't exist, make sure the block size (specified after the -b is as desired (if possible), and use something like the following example. Also, if all of the data (including the filesystem structure) may be erased, go ahead and change -n to -w for a speed increase.

sudo time badblocks -b 1024 -c 32767 -i listOfKnownBadBlocks -o newListofKnownBadBlocks -n -t random -s -v /dev/deviceObjOfAPartition
echo $?

If the above (ignoring the output from running the time command) shows:

badblocks: Cannot allocate memory while allocating buffers
$ echo $?
1
$

... then try reducing the number after “-c ”.

If the badblocks command is showing a timer, make sure the timer keeps increasing every second. If it shows 0.00% done, 0:00 elasped or 0.00% done, 0:01 elasped, and stays that way for seconds or minutes, then the entire scan is probably going unnecessarily slow. A test could even take over a month. In that case, press Ctrl-C and then wait. (The system may be very slow to quit the badblocks program; even that might take a small number of minutes.) Then try reducing the -c value drastically. For instance, instead of -c 65535, try -c 64. When doing so, the elasped time might start to update every five seconds or every two seconds, instead of many, many seconds between intervals. If that happens, the substantial improvement is still not good enough, because further improvement can be made. The elasped counter can increase each second.

Specifying a drive instead of a partition could be done if the desire is to test the entire drive. Often this is not desirable: Statistically most data will be in the partitions, and the partitions other than the system partition can be checked after a multi-tasking operating system is installed. That operating system may be of limited use when a large partition is not mounted because it is being tested, but still it may be convenient to allow some useful multitasking that permits configuring the operating system on the system/boot partition.

The badblocks will show a percentage. That percentage is the percentage of the current “pass”. By default, -w will do four passes, using testing patterns of 0x55 followed by 0xaa followed by 0xff followed by 0x00. untested: If it is desired to do fewer passes, see about using -p 1.

e.g. of some results:

$ sudo time badblocks -b 4096 -c 64 -o /tmp/newListofKnownBadBlocksForsd0j -w -t random -s -v /dev/sd0j
BadBlocks -w -t random -s -v /dev/sd0j                                        <
Checking for bad blocks in read-write mode
From block 0 to 33554431
Testing with random pattern: done
Reading and comparing: done
Pass completed, 0 bad blocks found.
   589791.72 real       113.61 user      2111.53 sys

It might look to some like this took 592016.86 seconds (there are 604800 seconds in a week) to scan 128GB (33,554,432 blocks (since block number zero was scanned) at 4,096 bytes per block). Yup: That probably is accurate (for this particular scan that was done). If that seems slow, see the discussion about using the right environment.

Other disk testing options

The section of system-testing operating systems may describe specialized distributions for testing hardware. There might (or might not) be more software options as part of such distributions.

(Also, see the section about testing disks by testing filesystem volumes.)

A program called stressdisk appears to be another option for checking hard drives (and, according to the author, RAM errors also have been detected from this program).

Disk testing in DOS / Microsoft Windows

Perhaps see: AleGr MEMTEST version 2.00?

(Also, see the section about testing disks by testing filesystem volumes.)

[#fststdsk]: Testing filesystem volumes in order to test disk functionality

A popular method of effectively checking most areas of a disk (especially most areas that may be likely to be heavily used) is to use software that tests the integrity of filesystem volumes. Operating systems have typically come with software to check the filesystem volumes, which has historically been an advantage to this approach (because operating systems may have often been bundled without disk checking software). For such details, see testing/repairing filesystem volumes.

Other details of checking disk status

To check if each hard drive in a RAID array is online, see the section about RAID.

[#popdskdi]: Information about disk failures

(This section may currently be a bit of a data dump, and may be substantially changed upon further review.)

The following may be rather informative about the topic of disk failures:

Intro to Usenix article about hard drives points to multiple formats, including:

(perhaps see these also...)

Claims: MBTF and bath tub

Wikipedia's “Hard disk drive failure” page: “Metrics of failures” section notes, “MTBF is conducted in laboratory environments in test chambers and is an important metric to determine the quality of a disk drive, but is designed to only measure the relatively constant failure rate over the service life of the drive” on the drives that remain after the period of “infant mortality” (which measures drives that die quite early on).

This may be disputed by: Everything You Know About Disks Is Wrong.

There was a thought that hard drives tend to follow a “both tub” curve. That is, if drives are used for a period of time (perhaps 4-7 years?), the failure rate will start out fairly high because poorly manufactured drives will die early on. (Those drives experience “infant mortality”.) Then, the drives that remain will generally work for quite a while, so the failure rate remains low. After some time, though, really old drives tend to be unreliable, so the failure rate gets higher again.

This may be disputed by Google's study about drives. See: Google Research: 2007 study on drives which refers to Google's White Paper: Failure Trends in a Large Disk Drive Population (which refers to Google: Disk failures (mirror of Google's White Paper: Failure Trends in a Large Disk Drive Population), which was released as part of 5th Usenix Conference on “File and Storage Technologies” (“FAST”) 2007.

As noted by the section on fixed disks, AFR may be a better indicator than MBTF.

Google's white paper indicated, “Failure rates are known to be highly correlated with drive models, manufacturers and vintages” ... and the study's “results do not contradict this fact.” However, Google was not nice enough to point fingers, revealing just who makes bad drives. (The Google paper, section 3.1, cites some “data are not directly useful in understanding the effects of disk age on failure rates”. Gizmodo article cites that as a reason, although the google paper section 3.2 says, “in this paper, we do not show a breakdown of drives per manufacturer, model, or vintage due to the proprietary nature of these data.” Google might have been restricted by some NDAs (non-disclosure agreements).

However, Backblaze was nicer (to the general public, though perhaps not to certain hard drive manufacturers). Backblaze blog on hard drive reliability, PC World review, Slashdot.

Misc ideas/info
[#hdsmtwmi]: S.M.A.R.T. with WMI

MSDN: Physical disk status is not OK discusses when “ WMIC PATH Win32_DiskDrive Get Status ” returns a value other than “OK”. Other possible values include “Pred Fail” (indicating that S.M.A.R.T. is predicting a failure) or “Degraded” (indicating a problem with a drive that is part of a RAID setup). Perhaps see also: MSDN: WMI Disk_Drive documentation about other statuses.

MSDN: Physical disk status is not OK lists some possible values, and basically recommends the following troubleshooting: Try again in 10 minutes; if error persists, follow recommendations by drive manufacturer. The article mentions Windows Server 2003 and WindowsServer 2008; this property also seemed to exist in Windows 7.

A quick check

See: S.M.A.R.T. status reporting (script version 1.1), which was created by doing some minor tinkering to code by surfasb's answer to Borek's SuperUser.com question about S.M.A.R.T.. Run it (with “ CScript smrtrp11.js ”). (Code is cc-wiki, per SuperUser.com terms.)

Some WMIC
WMIC /namespace:\\root\wmi PATH MSStorageDriver_FailurePredictStatus Get Active,PredictFailure,Reason
WMIC /namespace:\\root\wmi PATH MSStorageDriver_FailurePredictStatus Get Active,InstanceName,PredictFailure,Reason /FORMAT:LIST

Especially, out of that, see the value for PredictFailure.

You may be able to get some other information. To check if there are more fields/properties, try running:

WMIC /namespace:\\root\wmi PATH MSStorageDriver_FailurePredictStatus Get /?
WMIC /namespace:\\root\wmi PATH MSStorageDriver_FailurePredictStatus Get /FORMAT:LIST | more

WHDC archived documentation on drives, as presented by the Wayback Machine @ Archive.org referred to 4 objects:

  • MSStorageDriver_FailurePredictStatus
  • MSStorageDriver_FailurePredictData (Read Failure Predict Data)
  • MSStorageDriver_FailurePredictEvent (Failure Predict Event)
  • MSStorageDriver_FailurePredictFunction (Perform Failure Predict Function)

i-programmer.info article “Disk Drive Dangers - SMART and WMI” notes that MSStorageDriver_ATAPISmartData is “is only available in Vista and later.”

WMIC /namespace:\\root\wmi PATH MSStorageDriver_ATAPISmartData Get /?
WMIC /namespace:\\root\wmi PATH MSStorageDriver_ATAPISmartData Get | more

FixItScripts.com: SMART checks without smartmon... contains some VBScript.

For more details related to WMI, see: WMI.

Other resources/documentation that might be related/useful: MSDN: Clustering Service: CIM_ClusteringService class, MSDN: ManagedSystemElementClass, MSDN: CIM_LogicalDevice class (Common Information Model).

Frank Thomas's comment to Shaun Luttin's SuperUser.com question suggests looking at Reallocated Sector Count, Current Pending Sector count, and Raw Read Error Rate.

Posting(s)

thread : sub.mesa's post about some values.