Saving Data

This tutorial is largely about backups. Some similar topics may include: data recovery (recovering files by trying to “undelete” data), downtime reduction (RAID).

This tutorial is largely a collection of thoughts by someone who has set up backups on quite a few systems. This tutorial does not (yet) really extensively discuss some important topics, such as discussions of generations (grandfather-father-son, tower of Hanoi) or details on choosing a backup method (Full/differential/incremental). For now, this simply serves as a starting point.

General topic: backups

There are some different types of backups. For instance, a user (possibly a system administrator) might copy a file before making modifications to the file. This might be done simply to be able to quickly restore the file in case the changes end up breaking something. This sort of strategy is what is used by the cpytobak program.

However, this tutorial is about doing things a bit larger scale.

Purpose, and comparison to other technologies

The goal here is to be able to provide data as it existed some time ago. One feature to seek is to have multiple points of time that data may be restored to. RAID does not provide this feature. A method involving data synchronization is also unlikely to provide this feature at all. (The “feature” might be available, for a single older version, while data synchronization is not complete. Once the synchronization is up to date, the old version is presumably lost.)

Determining what to back up

Identify what data people need.

One way to determine what is important data is to check the network for any existing backup hardware, or configured backup software. The data used by a prior backup system is likley data that was, at one time, considered to be important. If a previous backup implementation is outdated, the configuration of the previous backup might not reveal all of the current details about what needs to be backed up. However, that may be an excellent starting point.

First and foremost, data that needs to be backed up will include data that people use regularly. One hint may be to consider what software has been installed. Check for configurations of common setups, including SMB file sharing, SQL server software, and other services (especially network services) such as an E-Mail server and a web server. Files which have recent dates may also be an indication of data that people are using. One possible exception may be logs: some logs will be nice to set up, while other logs might not be needing to be backed up.

Another type of data that may need to be backed up is data that an organization is required to keep due to regulartory requirements. Although such data might not be used regularly, there may be a reason why such data is valuable enough to be kept.

Finally, it will be good to check with people who work with the data. The exact approach may vary between organizations, but it will often be useful to consult an organization's management, as well as a representative from key departments such as the finance/bookkeeping department. It may also be worthwhile to check various positions, such as seeing how a person performs tasks when a purchase is made. The best time to be asking people about such things might be after an initial investigation, so that the person my sound knowledgeable while interacting with the staff members. However, do complete this portion of the research before offering any sort of finalized price/quote to an organization's management.

Determine where the data will be stored

Make sure that the destination can store at least one copy of the data. If multiple points of time are required, they could, in theory, require another amount of space equal to the data that is being backed up. (For instance, a 40 GB chunk of data needs at least 40 GB of space on the destination to back up one day. Backing up a second data's worth of data might require 80 GB of data (40 GB times two days). However, using techniques such as a “differential” backup might allow a second day's worth of data to be stored using only about 4 GB more. This can be the case if about 36 GB of data will be identical between the two days, and if about 4 GB is likely to be updated.) The exact amount of data may vary.

Determine additional requirements

What sort of encryption will be used?

Select some software to use for backup

This should typically be done after identifying what data needs to be backed up. Some data may be able to be backed up using specialized software, and that software may provide some advantages. For example, some backup software has been known to come with a special module to interact with an E-Mail server. This special module allowed the backup software to back up E-Mail while the E-Mail server is still running, which greatly reduces any perceived downtime. For some commercial backup software, such a module might be considered an enhanced feature which has a related cost/price. Knowing what data needs to be backed up will help to determine what software is needed for backup.

Obviously, software compatability is required. The operating system on the machine that stores the data may impact what software, or version of the software, gets used to back up the data. That operating system may be less important of an issue if the data is getting backed up by another computer that accesses the data over the network.

Back up some data

If there is a large amount of data, then simply backing up a small portion of data may provide some useful information, such as whether the backup hardware is working and how fast the data is being transmitted. Note that some backup software might have some rather extensive startup, or completion times. Those times might not be multiplicative, meaning that those times might not take five times as long when fives times as much data is being backed up. See if the software provides an actual data transfer rate, and use that as an estimate on how long a full backup may take.

Determine what happens when a backup succeeds. Is there a clear log that indicates success?

Try to perform a backup that may fail. (Specify that a file needs to be backed up, but then make that file unavailable. Or, the destination may become unavailable. Or the backup program might be forcibly quit, using the operating system. Just make sure that the process taken is safe. For some hardware, pressing the eject button might cause the drive to safely become unavailable, and then eject removable media. Removing power from an external drive might also work with some hardware. For other hardware, either of those processes might be less safe, and might lead to some damage.) What happens when the backup fails? As an example of a good answer, a backup program might make a log every time it starts the backup. If the backup program is terminated, then the next time the backup is started, it may notice a log that showed that a backup was started but not completed. When that happens, the backup program might send an E-Mail to an address that a human will notice.

Restoring data

Have a good idea of how data is restored.

Additional considerations

What is the best time to perform a backup? (One popular answer is: in the middle of the night, when people are not using the network. However, if two backups are supposed to run each day, then running them both at night probably won't be nearly as useful.)

How many points of time can be used to restore? For instance, if removable media can store three and a half copies of the data, using a “full backup” method might allow three copies of the data. However, using a “differential” method might allow seven or twenty copies of the data. (The actual number of copies might vary greatly depending on some details of the exact data, including how often some data gets updated.) The person handling backups may be able to restore data from a longer period of time. A person can say “I made a change to a file on Monday from two weeks ago. Can you restore that version of the file?” With so many different points of time being supported, the answer may be “yes”.

Is backup size reduced by using a “data de-duplication” feature? Does such a feature significantly reduce space, or does it mostly result in hardware being more busy and increased “wear and tear” while not providing any real benefit?

Is data compression desirable?