This chapter describes how to manage Prestoserve under abnormal conditions. These conditions include cases when the system is shut down abnormally and when disks accelerated by Prestoserve encounter errors or failure.
Processor-specific information about recovering from system crashes
and descriptions of any Prestoserve console commands are contained in the
hardware documentation for your processor.
4.1 Normal and Abnormal System Shutdowns
A normal (clean)
shutdown occurs when the system is halted by using either the
shutdown
, the
halt
, or the
reboot
command.
If a normal shutdown occurs or if you unmount a device that was
accelerated, the contents of the Prestoserve cache are flushed (moved) to
the appropriate disks.
In addition, you can cleanly shut down Prestoserve by using the
presto
command with the
-d
or
-D
option before you halt a running system.
The command flushes the Prestoserve
cache, and Prestoserve enters the
DOWN
state.
When Prestoserve
is in the
DOWN
state, all requests are directly passed
to the device drivers, and other forms of system shutdown do not affect system
operation with respect to Prestoserve.
Refer to
Chapter 3
for
more information about Prestoserve states.
An abnormal (unclean) shutdown results from a power or hardware
failure, operating system software failure, or by manually halting or restarting
the system when Prestoserve is still in the
UP
state.
After an abnormal shutdown, the Prestoserve cache may contain data that Prestoserve
was unable to flush to disk.
In this case, it is important to ensure that
the cache data is not lost or does not corrupt your disks.
In most cases, after an abnormal shutdown, data in the Prestoserve cache
is recovered automatically when you reboot; that is, the data is flushed to
the appropriate disks.
However, if you reboot a different kernel or change
your system configuration, you may encounter problems recovering the cache
data.
The following sections describe how to handle cache data.
4.1.1 Recovering Cache Data After an Abnormal Shutdown
The Prestoserve cache usually contains
data when it is in the
UP
state.
If your system shuts down
abnormally, data may remain in the Prestoserve cache.
A Prestoserve cache
that contains data is referred to as a dirty cache.
Usually, if you reboot the system, the system startup procedure repairs file system inconsistencies. This process flushes the cache and moves the data to the appropriate disks. Therefore, Prestoserve can usually recover easily after an abnormal shutdown, and no user action is necessary.
However, if your system shut down abnormally, and then you changed your system or hardware configuration, you may encounter some problems when the system reboots. Prestoserve uses physical device numbers internally to identify data blocks. If you reconfigure your system or hardware after an abnormal shutdown, data in the Prestoserve cache may be flushed to the wrong device or lost, or file systems may be corrupted. This could happen in the following cases:
You installed a kernel that has different disk device numbers than the kernel that was last used with Prestoserve.
You booted a non-Prestoserve kernel with disks that previously were accelerated.
You changed your device configuration.
You removed or added a disk controller.
Note
To ensure that you can recover the Prestoserve cached data after an abnormal shutdown, do not reboot the generic kernel (
genvmunix
) if the target kernel (vmunix
) will not boot. If you have renumbered device numbers, the generic kernel will not be aware of those changes and, as a result, when Prestoserve attempts to access the filesystem drivers to restore its cashed data, the data may be lost or written to the wrong place. To avoid this problem, Compaq recommends that you create a copy of your running target kernel with Prestoserve configured into it and boot that kernel in the event that your target kernel is corrupted after an abnormal shutdown and cannot be booted.
If you want to reconfigure your system, ensure that no data is in the Prestoserve cache and shut down the system cleanly.
If you cannot recover the Prestoserve cache data when the system reboots, a diagnostic message is displayed. Prestoserve prompts you to confirm that you want to continue rebooting the system. You are given the option to do one of the following:
Discard the Prestoserve cache data
Write the data to the intended disks
Halt the machine
If you choose to continue rebooting, the system startup procedure checks
the file systems and performs any corrections that it knows are correct.
During the reboot, you can note the extent of the disk data corruption.
Use
a file system repair program such as the
fsck
command
after the system reboots to repair any file system inconsistencies.
You can
then recover data by restoring file systems from backups, by rerunning programs,
or by reentering data if necessary.
4.1.2 Recovering Cache Data After Replacing a CPU Board
If the system was shut down abnormally, and you installed a new CPU board, the power-up diagnostics may indicate that the CPU board identification number does not match the Prestoserve cache identification number.
If you reboot the system and the Prestoserve cache contains data, you are given the option to do one of the following:
Discard the Prestoserve cache data
Write the data to the intended disks
Halt the machine
Usually, you can continue to reboot the system with no adverse affects.
4.1.3 Handling Failed Prestoserve Hardware
If the Prestoserve hardware fails the power-up diagnostics, install new hardware. If the Prestoserve cache contained data when it failed, the data is lost.
Use a file system repair program such as the
fsck
command after the system reboots to repair any file system inconsistencies.
You can then recover data by restoring file systems from backups, by rerunning
programs, or by reentering data if necessary.
4.1.4 Moving the Prestoserve Hardware
If the Prestoserve cache contains data, and the Prestoserve hardware is moved to another system along with the disks, the power-up diagnostics will indicate that the CPU board identification number does not match the Prestoserve hardware identification number.
When you boot the system and the Prestoserve cache contains data, you are given the option to do one of the following:
Discard the Prestoserve cache data
Write the data to the intended disks
Halt the machine
Usually, you can continue to reboot the system with no adverse affects.
To avoid any problems, shut down the system cleanly before moving the Prestoserve
hardware.
4.2 Disk Failures
The following sections describe how Prestoserve manages disk failures.
Temporary disk failures are those that can be fixed without requiring major
repair, such as a disk being off line or write protected.
Serious disk failures
(such as a disk head crash) entail significant repair and may cause data to
be lost.
4.2.1 Temporary Disk Failures
Because Prestoserve caches disk blocks, data used by an application
may not be written to disk for some time.
If a disk fails with Prestoserve
enabled, the system will not notice the failure until Prestoserve attempts
to flush its cache.
When this occurs, Prestoserve enters the
ERROR
state and attempts to flush its entire cache immediately.
If the
cache is flushed successfully, Prestoserve leaves the
ERROR
state, and no other user action is necessary.
However, if the cache cannot be completely flushed, Prestoserve effectively becomes a read-only data repository, and subsequent writes that do not match blocks already in the Prestoserve cache are passed directly through to the actual disk driver.
When Prestoserve is in the
ERROR
state, new data
written to a block already in the Prestoserve cache replaces the existing
block within the cache.
This block is then flushed synchronously to the disk
to see if the error condition still exists.
If the error still exists, the
application receives the error from the failed write operation.
If the write succeeds, Prestoserve leaves the
ERROR
state if it can successfully flush all of its buffers.
The first time Prestoserve
enters the
ERROR
state, a message similar to the following
is displayed on the console terminal, listing the major and minor numbers
of the actual device:
presto: error on dev (%d, %d)
A device-specific error message from the actual device driver may have been previously displayed. Note that any retries normally performed by a disk driver in an error condition are still performed for each I/O request by Prestoserve.
Prestoserve exits the
ERROR
state only when it can
successfully flush its entire cache to the disk.
It only attempts to flush
its cache when a request is made to write a block that is already in the cache
and when this block is successfully written to disk.
Requests to write blocks
not already in the cache are passed directly to the actual disk driver.
Thus,
Prestoserve does not accelerate writes when it is in the
ERROR
state, and Prestoserve may remain in the
ERROR
state even
after the disk problem is corrected if the cache data cannot be moved to disk.
If you can locate the cause of the I/O failure and fix it, use the following
command to reenable Prestoserve so it can verify that the error was corrected
and exit the
ERROR
state:
#
presto -F
Rebooting the system also causes Prestoserve to flush its cache to the
appropriate disks if they are available.
4.2.2 Serious Disk Failures
If you must replace a disk because of a major I/O failure that
is not easily repaired, you can use the
presto
-R
command, which attempts to flush all cached data and then destroys
any data that cannot be written to disk.
Before you replace a bad disk, use
the
presto
-R
command to ensure that you
do not flush disk blocks logically belonging to the bad disk to the new disk
device, thus corrupting the data on the new disk.
However, if you install
a new disk that contains no valid data, you can flush the cached data to it
because there is no data on the disk to corrupt.
If there are disk errors but you want to continue running with the faulty disk disabled, perform the following steps:
Use the
presto
-R
command
to write as much of the Prestoserve cache data as possible to the appropriate
disks, discard any data it could not write, purge the Prestoserve buffers,
and disable Prestoserve.
Unmount the bad disk.
Use the
presto
-u
command
to enable Prestoserve on the viable disks.
The Prestoserve
ERROR
state affects all
accelerated disks, so you must disable the defective disk before reenabling
Prestoserve on the viable disks.
Refer to
Chapter 3
for information
about the
presto
command.