This chapter discusses the Component Indictment and Automatic Deallocation facilities. Component indictment identifies system components that have a likelihood of future (potentially serious) failure based on a history of correctable non-fatal errors. This is done by analyzing specific failure patterns either immediately or over an extended period of time. The Automatic Deallocation facility provides the ability to automatically take an indicted component out of service.
This chapter discusses the following topics:
Indictment of CPUs and memory pages (Section 3.1)
Automatic Deallocation of CPUs and memory pages (Section 3.2)
Component indictment is a proactive error notification from a fault-analysis utility. The component indictment process is intended to identify components that are incurring high or abnormal incidence of correctable errors, so that these components can be removed or repaired prior to them potentially causing a system panic.
The following are requirements for component indictment support:
AlphaServer GS80/GS160/GS320
Compaq Analyze V4.0 (included as part of the Web-Based Enterprise Services V4.0 product)
A properly initialized binlog (/var/adm/binary.errlog
) file, see
binlogd
(8)
The binlog error log must be maintained correctly for component indictment
to function.
The correct procedure for cleaning the binlog file is documented
in
binlogd
(8).
If you simply move the error log file and create a new file
using
touch
without following the correct procedure, component
indictment will not work as expected.
An external analysis program (currently Compaq Analyze) can notify the operating system when a component has encountered enough correctable errors to indicate that the component may fail soon. Upon receipt of the indictment notification, the operating system posts an indictment event using the Event Management (EVM) subsystem. Administrators should investigate the source of any reported indictments and replace the indicted components as appropriate based on collaborative discussion with their service provider. Compaq Analyze currently supports indictments for CPUs and memory locations. Compaq Analyze supports EV6 and later processors.
Because indictment notification is posted to the Event Management (EVM) subsystem, any and all interested applications may subscribe to indictment events and take appropriate action. The Automatic Deallocation facility is one such application, which subscribes to indictment events and can be used to perform automatic deallocation of such indicted components. It also allows for execution of user-defined scripts, as discussed in Section 3.2.2 or Section 3.2.1 at the time of automatic deallocation. This avoids the need for an administrator to separately subscribe to these indictment events in order to handle them unless very specific processing is needed.
Additionally, if Compaq Analyze indicts a CPU, an immediate service
call typically will be made to Compaq Services to allow the expedient scheduling
of repair and replacement if a service obligation is in effect.
For more information
about Compaq Analyze, see
Section 5.2.
In addition to Compaq
Analyze's features to contact your service representative, you also may set
up pager or e-mail notification of component indictments based on EVM events
using the EVM forwarding facility.
See
evmlogger
(8)
and
evmlogger.conf
(4)
for more information.
If Compaq Analyze is unable to indict a specific component with certainty, but errors in a hardware subsystem are evident, there may be multiple indictments for a single failure source.
Every indictment event contains an urgency and probability value.
The
probability
event variable will have one of up to three associated
probabilities: high (100), medium (50), or low (1).
For more information,
see
Section 3.1.4.
The
urgency
event variable identifies the seriousness
of the problem.
If an indicted component is not placed off line within a 24 hour period, and correctable errors continue to be detected, another indictment may be issued by Compaq Analyze and another indictment event is posted if the urgency or probability of the indictment has changed.
The indicted state is persistent across system reboots and system initialization.
3.1.1 Indictment Process Overview
The process of component indictment follows this order:
A component such as a CPU or a memory location begins exhibiting correctable errors. These errors are written to the binary error log.
The fault analysis utility (Compaq Analyze) is notified automatically of each binary error log entry, reads the errors written to the binary error log, and performs an analysis of them. If the analysis concludes that the component potentially may have an unrecoverable error, the analysis program informs the operating system that the component should be considered for replacement, by issuing an indictment notification.
When the operating system receives an indictment, it sets the component's indictment attributes in the kernel and posts an indictment event using EVM. For an example indictment event, see Example 3-1.
The Automatic Deallocation facility listens to the indictment events and performs the appropriate deallocation dictated by the user-defined policy settings. This may include automatically putting off line a component or marking a memory page as bad.
The SysMan Station also listens to these events and updates its display with the state of the system components. For information on SysMan Station, see Section 4.7.
As a result of receiving an indictment notification from Compaq Analyze, the operating system posts an indictment event to the Event Management Subsystem (EVM). System Management applications subscribe to these indictment events. The SysMan Station (SMS) subscribes to indictment events so that it can change the indicted component's icon to show that the component is experiencing problems. An indictment event will cause a change in the status light for the System attention group in the SysMan Station Monitor View. For details on viewing indictment events, see Section 4.7. The Automatic Deallocation utility also subscribes to indictment events so that it can determine if automatic deallocation is required based on user-defined policy.
All indictment events have a prefix of
sys.unix.hw.state_change.indicted
.
An example event for a CPU, which has a hardware ID (HWID) of
59, being indicted with a probablity of high, follows:
sys.unix.hw.state_change.indicted.high.cpu._hwid.59._hwcomponent.CPU4
Indictment events can be viewed at the command line using typical EVM methods. An example to view these events as they are posted would be:
# evmwatch -f '[name sys.unix.hw.state_change.indicted]' | evmshow
An example to view the events in the EVM event log would be:
# evmget -f '[name sys.unix.hw.state_change.indicted]' | evmshow
See
EVM
(5)
or the
System Administration
manual for more information.
An example of a fully formatted event follows, using the command:
# evmget -f '[name sys.unix.hw.state_change.indicted]' | evmshow -D
Example 3-1
shows an indictment event for a
CPU.
It is indicted with a high probability.
Example 3-2
shows an indictment event with a medium probability initiated concurrently
with the CPU indictment.
Example 3-1: CPU Indictment Event
.
.
.
Formatted Message: Component State Change: Component "CPU0" has been indicted with a `high` probability of fault (HWID=2, FRUID=11529776898687173375) Event Data Items: Event Name : sys.unix.hw.state_change.indicted.high.cpu._hwid.2._hwc omponent.CPU0 Cluster Event : True Priority : 500 PID : 524288 PPID : 0 Event Id : 957 Member Id : 1 Timestamp : 08-May-2001 15:47:08 Host IP address : 16.69.242.74 Host Name : wild-one Cluster Name : wild-bunch Format : Component State Change: Component "$_hwcomponent" has been indicted with a `high` probability of fault (HWID=$_hwid, FRUID=$module_id) Reference : cat:evmexp.cat:800 Variable Items: current_state (STRING) = "indicted" category (STRING) = "cpu" urgency (INT32) = 8 probability (INT32) = 100 total_indictments (INT32) = 2 description (STRING) = "Excessive Correctable Memory Istream/Dstream Errors reported by CPU0, CPU Slot0 in SoftQBB0 (HardQBB0)" initiator (STRING) = "Compaq Analyze" report_handle (STRING) = "mdDeCOR::gen5766" [1] component_id (UINT64) = 18374966855287635968 component_type (UINT8) = 9 component_subtype (UINT8) = 35 module_id (UINT64) = 11529776898687173375 module_type (UINT8) = 21 module_subtype (UINT8) = 35 _hwid (UINT64) = 2 _hwcomponent (STRING) = "CPU0" previous_probability (INT32) = 0 previous_state (STRING) = "unknown"
This
report_handle
value can be matched
up with other events when there is more than one indictment per incident.
[Return to example]
.
.
.
Formatted Message: Component State Change: Component "" has been indicted with a `medium` probability of fault (HWID=0, FRUID=11962686508084822783) Event Data Items: Event Name : sys.unix.hw.state_change.indicted.medium._hwid.0 Cluster Event : True Priority : 400 PID : 546043 PPID : 524289 Event Id : 958 Member Id : 1 Timestamp : 08-May-2001 15:47:08 Host IP address : 16.69.242.74 Host Name : wild-one Cluster Name : wild-bunch Format : Component State Change: Component "$_hwcomponent" has been indicted with a `medium` probability of fault (HWID=$_hwid, FRUID=$module_id) Reference : cat:evmexp.cat:800 Variable Items: current_state (STRING) = "indicted" urgency (INT32) = 8 probability (INT32) = 50 total_indictments (INT32) = 2 description (STRING) = "Excessive Correctable Memory Istream/Dstream Errors reported by CPU0, CPU Slot0 in SoftQBB0 (HardQBB0)" initiator (STRING) = "Compaq Analyze" report_handle (STRING) = "mdDeCOR::gen5766" [1] component_id (UINT64) = 18374966859431673855 component_type (UINT8) = 7 [2] component_subtype (UINT8) = 38 module_id (UINT64) = 11962686508084822783 module_type (UINT8) = 21 module_subtype (UINT8) = 38 _hwcomponent (STRING) = "" _hwid (UINT64) = 0 category (STRING) = "" previous_probability (INT32) = 0 previous_state (STRING) = "unknown"
This
report_handle
value can be matched
up with other events when there is more than one indictment per incident.
[Return to example]
The value 7 for
component_type
indicates
that the QBB backplane has been called out as a possible failure suspect.
[Return to example]
Applications can be programmed to subscribe to indictment events.
Example
code showing how to write code that subscribes to EVM events is supplied in
/usr/examples/evm/evm_ex_olar_mon.c
.
3.1.3 Indictment Status
Enter the following command to look for components that have a non-good/non-normal status, including indicted components:
# hwmgr -status component -ngood STATUS ACCESS INDICT HWID: HOSTNAME SUMMARY STATE STATE LEVEL NAME ------------------------------------------------------------------------------ 113: provolone critical online available high CPU10 194: provolone critical online available high CPU7
If there is no output, then all components are in a normal state.
If
there is a value in the
INDICT LEVEL
column, the component
has been indicted with that indictment probability.
See
Section 3.1.4
for more information.
Enter the following command, using the HWID value, to view more detailed indictment status including the urgency of the indictment:
# hwmgr -get attr -id HWID | grep indict indicted = 1 indicted_probability = 100 indicted_urgency = 5
The
indicted_urgency
attribute is a value from 1-10,
the lower the value, the less urgent the removal of the component.
A value
of 10 indicates that you should remove the component as soon as possible.
To view the indictment information using SysMan Station, see
Section 4.7.
3.1.4 Indictment Probability and Urgency
Every indictment notification has an associated probability value and a corresponding indict level. The probability value indicates the likelihood that the component being indicted is at fault. The lower the probability value, the less likely that the component is at fault.
Compaq Analyze may indict more than one component if it cannot pinpoint which component is the source of a given error. The probability value is not a true percentage likelihood of probability of future failure, but simply a method of pointing to the relative likelihood of a potential for failure.
A summary of the indictment probability values and the corresponding indict levels is shown in Table 3-1.
Table 3-1: Indictment Probability
Probability | Indict Level | Description |
100 | High | The most likely source of the error |
50 | Medium | The second most likely source of the error |
1 | Low | The least likely source of the error. |
If this situation arises, the indictment events can be linked together
by examining the report_handle variable within the indictment events.
Multiple
indictment events for the same error will contain the same
report_handle
value.
Example 3-3
shows an example event.
The urgency is expressed as an integer value between 1 and 10.
The
lower the value, the less urgent the removal of the component.
An urgency
of 10 means the indicted component should be replaced as soon as possible.
An urgency of 1 means the indicted component more than likely will fail
at some future time but operator intervention may not be required immediately.
The
indicted_urgency
attribute can be checked by viewing
the event or with the
hwmgr
command as discussed in
Section 3.1.3.
Example 3-3: Memory Indictment Event
Formatted Message: Component State Change: Physical address 268435456 has been indicted Event Data Items: Event Name : sys.unix.hw.state_change.indicted.memory_page._physical _address.268435456._hwid.0 Cluster Event : True Priority : 200 PID : 530236 PPID : 530233 Event Id : 947 Member Id : 1 Timestamp : 05-Mar-2001 15:34:24 Host IP address : 16.69.242.74 Cluster IP address: 16.69.241.125 Host Name : provolone Cluster Name : deli Format : Component State Change: Physical address $_physical_address has been indicted Reference : cat:evmexp.cat:800 Variable Items: current_state (STRING) = "indicted" urgency (INT32) = 8 probability (INT32) = 100 total_indictments (INT32) = 1 description (STRING) = "Excessive Read Correctable Errors reported by Memory Module0 in SoftQBB0 (HardQBB0)" initiator (STRING) = "Compaq Analyze" report_handle (STRING) = "mdDeCOR::gen7853" [1] _physical_address (UINT64) = 268435456 _hwid (UINT64) = 0 previous_state (STRING) = "unknown" ======================================================================
This
report_handle
value can be matched
up with other events when there is more than one indictment per incident.
[Return to example]
There are two situations when it is necessary to clear the indicted state of a component.
The failed component has been replaced with a working component.
Multiple components have been indicted and you need to clear the indicted state for components known to be functioning correctly.
When a component has been serviced due to an indictment, you must clear the indicted state after you verify that the repaired component is operating properly. The indictment state is associated with the CPU slot, not the specific CPU module.
When a component has been serviced as the result of a previous component indictment, it is necessary to clear the indicted state when it has been verified that the repaired or replaced component is operating properly. Note that in the case of CPU indictments, indictment variables are associated with the CPU slot, not the specific CPU module. Therefore, when the newly replaced CPU module is inserted in a slot previously associated with an indicted CPU, it will still appear as indicted. After the newly replaced CPU module has its power on, and is verified as operating properly, you can clear the indicted state associated with the CPU slot.
Enter the following command to clear the indictment value:
# hwmgr -unindict [component] -id hardware-component-ID [-member cluster-member-name]
For example, do the following:
# hwmgr -unindict -id 58
3.2 Automatic Deallocation of Components
The Automatic Deallocation facility of the operating system subscribes
to the EVM events for component indictment and can take action immediately
on receipt of the notification.
CPUs and memory pages that have been indicted
can be taken off line automatically if wanted.
Automatic deallocation behavior
is defined by the variables and attributes defined in the
olar.config
and
olar.config.common
files located in
the
/etc
directory.
The
olar.config
file is used to define system specific policies and the
olar.config.common
is used to define cluster-wide policies.
The
olar.config
file is a context-dependent symbolic link (CDSL) that is specific
to the particular cluster member.
Any settings in a system's
olar.config
override cluster-wide policies in the
olar.config.common
file for that system only.
The values of the variables defined
in this file are case insensitive.
When the Automatic Deallocation facility is invoked as a result of a
component's indictment, it will post the results of its execution, including
specific policy variable evaluation, as one or more EVM events.
This provides
an audit trail for this automated process and allows user applications to
listen for (or subscribe to) these events if wanted.
See
EVM
(5)
for general information
on the Tru64 UNIX Event Management facility.
All Automatic Deallocation
Facility EVM events have a prefix of
sys.unix.sysman.auto_deallocate
.
3.2.1 Automatic Deallocation Policy for CPUs
The following are policies that you can set for automatically handling CPU indictments:
Whether or not to deallocate a CPU when it is indicted
Time window in which to allow deallocation
Indictment probability to allow automatic deallocation
User-supplied script to run before deallocation
Whether to deallocate when processes are bound
Automatic deallocation should be disabled whenever the
pfm
or
pcount
device drivers are configured into
the kernel, or vice versa.
For more information on these drivers, see
Section 4.4.
CPU Policy Variables
The following sections describe the policy variables defined for automatic
deallocation of a CPU.
Whether or Not to Deallocate CPU When Indicted
The default action is for CPUs not to be deallocated upon component indictment.
You can specify whether or not to automatically deallocate a CPU when
it is indicted with the
cpu_deallocate_allow
variable.
If this variable is left NULL or specified as
FALSE
, there
is no automatic deallocation attempt of hardware components that belong to
category CPU when a CPU is indicted.
All other
cpu_deallocate*
policy variables will not be considered if this attribute is not set to
TRUE
.
Allowed values are
TRUE
,
FALSE
.
Time to Perform Deallocation
You may decide to allow deallocation of a component only within a specified time window. Settings available to limit the times at which deallocation is allowed are described.
There are two variables that can be set in order to specify a time window
in which automatic deallocation of indicted CPUs can take place:
cpu_deallocate_start_time
and
cpu_deallocate_end_time
.
The variable
cpu_deallocate_start_time
denotes the time
(in 24 hour format) beginning with and after which automatic deallocation
is allowed.
If no start value is specified, a value of
00:00
is assumed.
Allowed values are
00:00
-
23:59
.
The variable
cpu_deallocate_end_time
denotes the
time (in 24 hour format) up to and including when automatic deallocation is
allowed.
This attribute is used in conjuction with
cpu_deallocate_start_time
.
The start and end times are allowed to cross a day boundary.
If
no end value is specified, a value of
23:59
is assumed.
Allowed values are
00:00
-
23:59
.
Indictment Probability for Automatic Deallocation
You can specify a single indicted probability or list of indicted probabilities
for which automatic deallocation should occur for an indicted CPU using the
variable
cpu_deallocate_probability.
Probabilities can
be any combination of the three discrete values
low
,
medium
and
high
.
Probabilities must be enclosed
in braces and multiple probabilities must be delimited by a comma (,).
If
no value is specified for this attribute, automatic deallocation will occur
only for components indicted with a high probability.
Allowed values are
low
,
medium
and
high
.
Script to Run Before Deallocation
You can specify a script that can be executed before a deallocation
is attempted using the
cpu_deallocate_user_supplied_script
variable.
This variable contains the full path to a user-supplied script.
If present, this script must be executable, be owned by root, and provide
a zero return status to indicate successful execution.
The script is passed
two parameters that can be used in the script; the CPU name and the hardware
ID (HWID) value.
A non-zero return value of the script prevents the automatic
deallocation from proceeding.
Whether to Deallocate Even When Processes Are Bound
This variable
cpu_deallocate_if_bound_processes
defines
whether automatic deallocation of a CPU should occur if processes have been
bound to run specifically on the indicted CPU or processes have been bound
to the Resource Affinity Domain (RAD) that the CPU belongs to.
If the value
of this policy variable is TRUE, the CPU is removed automatically from the
operating system (put off line) even under the following situations:
Processes are bound to run specifically on the CPU. Those bound processes will suspend until the CPU is brought back to the online state.
Processes are bound to the Resource Affinity Domain (RAD) that the CPU belongs to and this CPU is the last active CPU in the RAD. Those processes that are bound to the RAD will suspend execution until any of the CPUs that belong to the RAD are brought back to the online state.
Conversely, if this policy variable is not set to TRUE, an indicted CPU will not be deallocated if processes are bound to the CPU or if processes are bound to the RAD, which contains the indicted CPU and the CPU is the last active CPU in that RAD.
See
rad_bind_pid
(3)
and
runon
(1)
for information on binding processes
to CPUs or to RADs.
CPU Policy Examples
The following are examples of how the variables in
olar.config
or
olar.config.common
may be set to achieve
the wanted results.
Deallocate indicted CPUs immediately whenever they occur, including if processes are bound to a CPU or RAD:
cpu_deallocate_allow=TRUE cpu_deallocate_start_time=00:00 cpu_deallocate_end_time=23:59 cpu_deallocate_probability=high cpu_deallocate_user_supplied_script= cpu_deallocate_if_bound_processes=TRUE
Deallocate indicted CPUs only after 7:00 p.m. and before 5:00 a.m. if indictment probability is high or medium. Do not deallocate if processes are bound to a CPU or RAD:
cpu_deallocate_allow=TRUE cpu_deallocate_start_time=19:00 cpu_deallocate_end_time=04:59 cpu_deallocate_probability={high,medium} cpu_deallocate_user_supplied_script= cpu_deallocate_if_bound_processes=FALSE
Deallocate indicted CPUs immediately if the user-defined script
/var/checkcpu.sh
returns successfully.
Do not deallocate if processes
are bound to a CPU or RAD:
cpu_deallocate_allow=TRUE cpu_deallocate_start_time=00:00 cpu_deallocate_end_time=23:59 cpu_deallocate_probability=high cpu_deallocate_user_supplied_script=/var/checkcpu.sh cpu_deallocate_if_bound_processes=FALSE
3.2.2 Automatic Deallocation Policy for Memory
Memory locations that have been noted by Compaq Analyze as having too many errors can be indicted. The memory page (as defined by the Page Frame Number) that contains an indicted memory location may be deallocated for use by the operating system.
Compaq Analyze can identify a physical memory location that is experiencing a high incidence of correctable single-bit errors, such that Compaq Analyze believes the error rates to be outside of normal operation. In this case, the physical location will be indicted, which may result in the memory page, containing that location, to be deallocated automatically.
If the memory page is not currently in use, then it will be mapped out (marked as bad) the next time an attempt is made to allocate the page. If the memory page is currently in use, it will be mapped out the next time the page is deallocated.
The default setting is for a memory page to be mapped out upon indictment. Actual deallocation will occur only when the memory page is freed or subsequent access is attempted.
The following are policies that you can set for handling memory indictments:
Whether to attempt deallocation
Time to perform deallocation
Probability to deallocate
User-defined script to run on deallocation attempt
Memory Policy Variables
The following sections describe the policy variables defined for automatic
deallocation of a memory page (PFN).
Whether or Not to Attempt Deallocation
You can specify whether or not automatic deallocation is allowed when
a memory page is indicted with the
pfn_deallocate_allow
variable.
If this attribute is left NULL or specified as
FALSE
,
then there will be no automatic deallocation of memory pages when a memory
page is indicted.
All other
pfn_deallocate*
policy variables
will not be considered if this attribute does not have the value
TRUE
.
Allowed values are
TRUE
,
FALSE
.
Time Window to Perform Deallocation
You can specify the time (in 24 hour format) beginning with and after
which automatic deallocation is allowed with the
pfn_deallocate_start_time
variable.
This attribute is used in conjuction with
pfn_deallocate_end_time
to denote a time window in which automatic
deallocation of indicted memory pages can take place.
If no start value is
specified, a value of
00:00
is assumed.
Allowed values
are
00:00
-
23:59
.
The
pfn_deallocate_end_time
variable denotes the
time (in 24 hour format) up to and including which automatic deallocation
is allowed.
The start and end times are allowed to cross a day boundary.
If
no end value is specified, a value of
23:59
is assumed.
Allowed values are
00:00
-
23:59
.
Probability to Deallocate
You can specify the probability values for which automatic deallocation
should occur for an indicted memory page with the
pfn_deallocate_probability
variable.
This value refers to probabilities and could be any combination of the
three discrete values low, medium, and high.
Probabilities must be enclosed
in braces and multiple probabilities must be delimited by a comma (,).
If
no value is specified for this attribute, automatic deallocation will occur
only for memory pages indicted with a high probability.
Allowed values are
low
,
medium
and
high
.
User-defined Script to Run on Deallocation Attempt
You can specify a script that will run before a deallocation attempt
using the
pfn_deallocate_user_supplied_script
variable.
This variable defines a full path to a user-supplied script that will
execute prior to automatic deallocation of an indicted Page Frame Number (PFN).
If present, this script must be executable, be owned by root, and provide
a zero return status to indicate successful execution.
The script is started
with a parameter that can be used in the script, the decimal value of the
PFN.
A non-zero return value of the script will prevent the automatic deallocation
from proceeding.
Memory Deallocation Examples
The following are examples of how the variables in
olar.config
or
olar.config.common
may be set to achieve
the wanted results for memory deallocation.
Deallocate indicted memory pages immediately whenever indictment events occur:
pfn_deallocate_allow=TRUE pfn_deallocate_start_time=00:00 pfn_deallocate_end_time=23:59 pfn_deallocate_probability=high pfn_deallocate_user_supplied_script=
Deallocate indicted memory pages only after 7:00 p.m. and before 5:00 a.m.:
pfn_deallocate_allow=TRUE pfn_deallocate_start_time=19:00 pfn_deallocate_end_time=04:59 pfn_deallocate_probability=high pfn_deallocate_user_supplied_script=
Deallocate indicted memory pages immediately if the user-defined script
/var/checkmem.sh
returns successfully:
pfn_deallocate_allow=TRUE pfn_deallocate_start_time=00:00 pfn_deallocate_end_time=23:59 pfn_deallocate_probability=high pfn_deallocate_user_supplied_script=/var/checkmem.sh