This chapter describes how to manage Prestoserve under abnormal conditions. These conditions include cases when the system is shut down abnormally and when disks accelerated by Prestoserve encounter errors or failure.
Processor-specific information about recovering from system crashes and descriptions of any Prestoserve console commands are contained in the hardware documentation for your processor.
A normal (clean) shutdown occurs when the system is halted by using
either the
shutdown,
the
halt,
or the
reboot
command. If a normal shutdown occurs or if you unmount a device that was
accelerated, the contents of the Prestoserve
cache are flushed (moved) to the appropriate disks.
In addition, you can cleanly shut down Prestoserve by using the
presto
command with the
-d
or
-D
option before you halt a running system. The
command flushes the Prestoserve cache, and Prestoserve
enters the
DOWN
state. When Prestoserve is in the
DOWN
state, all
requests are directly passed to the device drivers, and other forms
of system shutdown do not affect system operation with respect
to Prestoserve. Refer to
Chapter 3
for more information about Prestoserve states.
An abnormal (unclean) shutdown results from a power or
hardware failure, operating system software failure, or by
manually halting or restarting the system when Prestoserve
is still in the
UP
state. After an abnormal shutdown,
the Prestoserve cache may contain data that Prestoserve
was unable to flush to disk. In this case, it is important to
ensure that the cache data is not lost or does not corrupt your disks.
In most cases, after an abnormal shutdown, data in the Prestoserve cache is recovered automatically when you reboot; that is, the data is flushed to the appropriate disks. However, if you reboot a different kernel or change your system configuration, you may encounter problems recovering the cache data. The following sections describe how to handle cache data.
The Prestoserve cache usually contains data when it is in the
UP
state. If your system shuts down abnormally, data may remain in the
Prestoserve cache. A Prestoserve cache that contains
data is referred to as a dirty cache.
Usually, if you reboot the system, the system startup procedure repairs file system inconsistencies. This process flushes the cache and moves the data to the appropriate disks. Therefore, Prestoserve can usually recover easily after an abnormal shutdown, and no user action is necessary.
However, if your system shut down abnormally, and then you changed your system or hardware configuration, you may encounter some problems when the system reboots. Prestoserve uses physical device numbers internally to identify data blocks. If you reconfigure your system or hardware after an abnormal shutdown, data in the Prestoserve cache may be flushed to the wrong device or lost, or file systems may be corrupted. This could happen in the following cases:
Note
To ensure that you can recover the Prestoserve cached data after an abnormal shutdown, do not reboot the generic kernel (
genvmunix) if the target kernel (vmunix) will not boot. If you have renumbered device numbers, the generic kernel will not be aware of those changes and, as a result, when Prestoserve attempts to access the filesystem drivers to restore its cashed data, the data may be lost or written to the wrong place.To avoid this problem, Digital recommends that you create a copy of your running target kernel with Prestoserve configured into it and boot that kernel in the event that your target kernel is corrupted after an abnormal shutdown and cannot be booted.
If you want to reconfigure your system, you should ensure that no data is in the Prestoserve cache and shut down the system cleanly.
If you cannot recover the Prestoserve cache data when the system reboots, a diagnostic message is displayed. Prestoserve prompts you to confirm that you want to continue rebooting the system. You are given the option to do one of the following:
If you choose to continue rebooting, the system startup procedure
checks the file systems and performs any corrections that it
knows are correct. During the reboot, you can note the extent of the
disk data corruption. You may have to use a file system repair
program such as the
fsck
command after the system reboots to repair any file system inconsistencies.
You can then recover data by restoring file systems from backups,
by rerunning programs, or by reentering data if necessary.
If the system was shut down abnormally, and you installed a new CPU board, the power-up diagnostics will indicate that the CPU board identification number does not match the Prestoserve cache identification number.
If you reboot the system and the Prestoserve cache contains data, you are given the option to do one of the following:
Usually, you can continue to reboot the system with no adverse affects.
If the Prestoserve hardware fails the power-up diagnostics, install new hardware. If the Prestoserve cache contained data when it failed, the data is lost.
You may have to use a file system repair program such as the
fsck
command after the system reboots to repair any file system inconsistencies.
You can then recover data by restoring file systems from backups,
by rerunning programs, or by reentering data if necessary.
If the Prestoserve cache contains data, and the Prestoserve hardware is moved to another system along with the disks, the power-up diagnostics will indicate that the CPU board identification number does not match the Prestoserve hardware identification number.
When you boot the system and the Prestoserve cache contains data, you are given the option to do one of the following:
Usually, you can continue to reboot the system with no adverse affects. To avoid any problems, you should shut down the system cleanly before moving the Prestoserve hardware.
The following sections describe how Prestoserve manages disk failures. Temporary disk failures are those that can be fixed without requiring major repair, such as a disk being off line or write protected. Serious disk failures (such as a disk head crash) entail significant repair and may cause data to be lost.
Because Prestoserve caches disk blocks, data used by an
application may not be written to disk for some time. If a disk fails
with Prestoserve enabled,
the system will not notice the failure until Prestoserve attempts
to flush its cache. When this occurs, Prestoserve enters the
ERROR
state and attempts to flush its entire cache immediately. If the
cache is flushed successfully, Prestoserve leaves the
ERROR
state, and no other user action is necessary.
However, if the cache cannot be completely flushed, Prestoserve effectively becomes a read-only data repository, and subsequent writes that do not match blocks already in the Prestoserve cache are passed directly through to the actual disk driver.
When Prestoserve is in the
ERROR
state, new data written to a block already
in the Prestoserve cache replaces the existing block within the cache.
This block is then
flushed synchronously to the disk to see if the error condition still
exists. If the error still exists, the application receives the error from
the failed write operation.
If the write succeeds, Prestoserve leaves the
ERROR
state if it can successfully
flush all of its buffers. The first time Prestoserve enters the
ERROR
state, a message similar to the following is displayed on the console terminal,
listing the major and minor numbers of the actual device:
presto: error on dev (%d, %d)
A device-specific error message from the actual device driver may have been previously displayed. Note that any retries normally performed by a disk driver in an error condition are still performed for each I/O request by Prestoserve.
Prestoserve exits the
ERROR
state only when it can successfully flush
its entire cache to the disk. It only attempts to flush its cache when a
request is made to write a block that is already in the cache and when this
block is successfully written to disk. Requests to write blocks not
already in the cache are passed directly to the actual disk driver.
Thus, Prestoserve does not accelerate writes when it is in the
ERROR
state, and Prestoserve may remain in the
ERROR
state even after the disk problem is corrected if the cache data
cannot be moved to disk.
If you can locate the cause of the I/O failure and fix it, reenable
Prestoserve so it can verify that the error was corrected and
exit the
ERROR
state. You can accomplish this by issuing the following command:
#
presto -F
Rebooting the system also causes Prestoserve to flush its cache to the appropriate disks if they are available.
If you must replace a disk because of a major I/O failure that is not
easily repaired, you can use the
presto -R
command, which attempts to
flush all cached data and then destroys any data that cannot be written
to disk. Before you replace a bad disk, use the
presto -R
command to ensure that you do not flush disk blocks
logically belonging to the bad disk to the new disk device,
thus corrupting the data on the new disk. However, if you
install a new disk that contains no valid data, you can flush the
cached data to it because there is no data on the disk to corrupt.
If there are disk errors but you want to continue running with the faulty disk disabled, perform the following steps:
presto -R
command to write as much of the Prestoserve cache data as possible
to the appropriate disks, discard any data it could not write,
purge the Prestoserve buffers, and disable Prestoserve.
presto -u
command to enable Prestoserve on the viable disks.
The Prestoserve
ERROR
state affects all accelerated disks,
so you must disable the defective disk before reenabling Prestoserve
on the viable disks. Refer to
Chapter 3
for information about the
presto
command.