This chapter describes how to manage Prestoserve under abnormal conditions. These conditions include cases when the system is shut down abnormally and when disks accelerated by Prestoserve encounter errors or failure.
Processor-specific information about recovering from system crashes and descriptions of any Prestoserve console commands are contained in the hardware documentation for your processor.
A normal (clean) shutdown occurs
when the system is halted by using either the
shutdown, the
halt, or the
reboot
command.
If a normal shutdown occurs
or if you unmount a device that was accelerated, the contents of the Prestoserve
cache are flushed (moved) to the appropriate disks.
In addition, you can cleanly shut down Prestoserve by using the
presto
command with the
-d
or
-D
option
before you halt a running system.
The command flushes the Prestoserve cache,
and Prestoserve enters the
DOWN
state.
When Prestoserve is in
the
DOWN
state, all requests are directly passed to the device
drivers, and other forms of system shutdown do not affect system operation
with respect to Prestoserve.
Refer to
Chapter 3
for more information
about Prestoserve states.
An abnormal (unclean) shutdown
results from a power or hardware failure, operating system software failure,
or by manually halting or restarting the system when Prestoserve is still
in the
UP
state.
After an abnormal shutdown, the Prestoserve cache
may contain data that Prestoserve was unable to flush to disk.
In this case,
it is important to ensure that the cache data is not lost or does not corrupt
your disks.
In most cases, after an abnormal shutdown, data in the Prestoserve cache is recovered automatically when you reboot; that is, the data is flushed to the appropriate disks. However, if you reboot a different kernel or change your system configuration, you may encounter problems recovering the cache data. The following sections describe how to handle cache data.
The
Prestoserve cache usually contains data when it is in the
UP
state.
If your system shuts down abnormally, data may remain in the Prestoserve cache.
A Prestoserve cache that contains data is referred to as a dirty cache.
Usually, if you reboot the system, the system startup procedure repairs file system inconsistencies. This process flushes the cache and moves the data to the appropriate disks. Therefore, Prestoserve can usually recover easily after an abnormal shutdown, and no user action is necessary.
However, if your system shut down abnormally, and then you changed your system or hardware configuration, you may encounter some problems when the system reboots. Prestoserve uses physical device numbers internally to identify data blocks. If you reconfigure your system or hardware after an abnormal shutdown, data in the Prestoserve cache may be flushed to the wrong device or lost, or file systems may be corrupted. This could happen in the following cases:
You installed a kernel that has different disk device numbers than the kernel that was last used with Prestoserve.
You booted a non-Prestoserve kernel with disks that previously were accelerated.
You changed your device configuration.
You removed or added a disk controller.
Note
To ensure that you can recover the Prestoserve cached data after an abnormal shutdown, do not reboot the generic kernel (
genvmunix) if the target kernel (vmunix) will not boot. If you have renumbered device numbers, the generic kernel will not be aware of those changes and, as a result, when Prestoserve attempts to access the filesystem drivers to restore its cashed data, the data may be lost or written to the wrong place. To avoid this problem, Digital recommends that you create a copy of your running target kernel with Prestoserve configured into it and boot that kernel in the event that your target kernel is corrupted after an abnormal shutdown and cannot be booted.
If you want to reconfigure your system, you should ensure that no data is in the Prestoserve cache and shut down the system cleanly.
If you cannot recover the Prestoserve cache data when the system reboots, a diagnostic message is displayed. Prestoserve prompts you to confirm that you want to continue rebooting the system. You are given the option to do one of the following:
Discard the Prestoserve cache data
Write the data to the intended disks
Halt the machine
If you choose to continue rebooting, the system startup procedure checks
the file systems and performs any corrections that it knows are correct.
During the reboot, you can note the extent of the disk data corruption.
You
may have to use a file system repair program such as the
fsck
command
after the system reboots to repair any file system inconsistencies.
You can
then recover data by restoring file systems from backups, by rerunning programs,
or by reentering data if necessary.
If the system was shut down abnormally, and you installed a new CPU board, the power-up diagnostics will indicate that the CPU board identification number does not match the Prestoserve cache identification number.
If you reboot the system and the Prestoserve cache contains data, you are given the option to do one of the following:
Discard the Prestoserve cache data
Write the data to the intended disks
Halt the machine
Usually, you can continue to reboot the system with no adverse affects.
If the Prestoserve hardware fails the power-up diagnostics, install new hardware. If the Prestoserve cache contained data when it failed, the data is lost.
You may have to use a file system repair program such as the
fsck
command after the system reboots to repair any file system inconsistencies.
You can then recover data by restoring file systems from backups, by rerunning
programs, or by reentering data if necessary.
If the Prestoserve cache contains data, and the Prestoserve hardware is moved to another system along with the disks, the power-up diagnostics will indicate that the CPU board identification number does not match the Prestoserve hardware identification number.
When you boot the system and the Prestoserve cache contains data, you are given the option to do one of the following:
Discard the Prestoserve cache data
Write the data to the intended disks
Halt the machine
Usually, you can continue to reboot the system with no adverse affects. To avoid any problems, you should shut down the system cleanly before moving the Prestoserve hardware.
The following sections describe how Prestoserve manages disk failures. Temporary disk failures are those that can be fixed without requiring major repair, such as a disk being off line or write protected. Serious disk failures (such as a disk head crash) entail significant repair and may cause data to be lost.
Because Prestoserve caches disk
blocks, data used by an application may not be written to disk for some time.
If a disk fails with Prestoserve enabled, the system will not notice the
failure until Prestoserve attempts to flush its cache.
When this occurs,
Prestoserve enters the
ERROR
state and attempts to flush its entire
cache immediately.
If the cache is flushed successfully, Prestoserve leaves
the
ERROR
state, and no other user action is necessary.
However, if the cache cannot be completely flushed, Prestoserve effectively becomes a read-only data repository, and subsequent writes that do not match blocks already in the Prestoserve cache are passed directly through to the actual disk driver.
When Prestoserve is in the
ERROR
state, new data written
to a block already in the Prestoserve cache replaces the existing block within
the cache.
This block is then flushed synchronously to the disk to see if
the error condition still exists.
If the error still exists, the application
receives the error from the failed write operation.
If the write succeeds, Prestoserve leaves the
ERROR
state
if it can successfully flush all of its buffers.
The first time Prestoserve
enters the
ERROR
state, a message similar to the following is displayed
on the console terminal, listing the major and minor numbers of the actual
device:
presto: error on dev (%d, %d)
A device-specific error message from the actual device driver may have been previously displayed. Note that any retries normally performed by a disk driver in an error condition are still performed for each I/O request by Prestoserve.
Prestoserve exits the
ERROR
state only when it can successfully
flush its entire cache to the disk.
It only attempts to flush its cache when
a request is made to write a block that is already in the cache and when this
block is successfully written to disk.
Requests to write blocks not already
in the cache are passed directly to the actual disk driver.
Thus, Prestoserve
does not accelerate writes when it is in the
ERROR
state, and Prestoserve
may remain in the
ERROR
state even after the disk problem is corrected
if the cache data cannot be moved to disk.
If you can locate the cause of the I/O failure and fix it, reenable
Prestoserve so it can verify that the error was corrected and exit the
ERROR
state.
You can accomplish this by issuing the following command:
#presto -F
Rebooting the system also causes Prestoserve to flush its cache to the appropriate disks if they are available.
If you must replace a disk because
of a major I/O failure that is not easily repaired, you can use the
presto -R
command, which attempts to flush all cached data and
then destroys any data that cannot be written to disk.
Before you replace
a bad disk, use the
presto -R
command to ensure that you
do not flush disk blocks logically belonging to the bad disk to the new disk
device, thus corrupting the data on the new disk.
However, if you install
a new disk that contains no valid data, you can flush the cached data to it
because there is no data on the disk to corrupt.
If there are disk errors but you want to continue running with the faulty disk disabled, perform the following steps:
Use the
presto -R
command to write as much
of the Prestoserve cache data as possible to the appropriate disks, discard
any data it could not write, purge the Prestoserve buffers, and disable Prestoserve.
Unmount the bad disk.
Use the
presto -u
command to enable Prestoserve
on the viable disks.
The Prestoserve
ERROR
state affects all accelerated
disks, so you must disable the defective disk before reenabling Prestoserve
on the viable disks.
Refer to
Chapter 3
for information about
the
presto
command.