3    Component Indictment and Automatic Deallocation

This chapter discusses the Component Indictment and Automatic Deallocation facilities. Component indictment identifies system components that have a likelihood of future (potentially serious) failure based on a history of correctable non-fatal errors. This is done by analyzing specific failure patterns either immediately or over an extended period of time. The Automatic Deallocation facility provides the ability to automatically take an indicted component out of service.

This chapter discusses the following topics:

3.1    Component Indictment

Component indictment is a proactive error notification from a fault-analysis utility. The component indictment process is intended to identify components that are incurring high or abnormal incidence of correctable errors, so that these components can be removed or repaired prior to them potentially causing a system panic.

The following are requirements for component indictment support:

The binlog error log must be maintained correctly for component indictment to function. The correct procedure for cleaning the binlog file is documented in binlogd(8). If you simply move the error log file and create a new file using touch without following the correct procedure, component indictment will not work as expected.

An external analysis program (currently Compaq Analyze) can notify the operating system when a component has encountered enough correctable errors to indicate that the component may fail soon. Upon receipt of the indictment notification, the operating system posts an indictment event using the Event Management (EVM) subsystem. Administrators should investigate the source of any reported indictments and replace the indicted components as appropriate based on collaborative discussion with their service provider. Compaq Analyze currently supports indictments for CPUs and memory locations. Compaq Analyze supports EV6 and later processors.

Because indictment notification is posted to the Event Management (EVM) subsystem, any and all interested applications may subscribe to indictment events and take appropriate action. The Automatic Deallocation facility is one such application, which subscribes to indictment events and can be used to perform automatic deallocation of such indicted components. It also allows for execution of user-defined scripts, as discussed in Section 3.2.2 or Section 3.2.1 at the time of automatic deallocation. This avoids the need for an administrator to separately subscribe to these indictment events in order to handle them unless very specific processing is needed.

Additionally, if Compaq Analyze indicts a CPU, an immediate service call typically will be made to Compaq Services to allow the expedient scheduling of repair and replacement if a service obligation is in effect. For more information about Compaq Analyze, see Section 5.2. In addition to Compaq Analyze's features to contact your service representative, you also may set up pager or e-mail notification of component indictments based on EVM events using the EVM forwarding facility. See evmlogger(8) and evmlogger.conf(4) for more information.

If Compaq Analyze is unable to indict a specific component with certainty, but errors in a hardware subsystem are evident, there may be multiple indictments for a single failure source.

Every indictment event contains an urgency and probability value. The probability event variable will have one of up to three associated probabilities: high (100), medium (50), or low (1). For more information, see Section 3.1.4.

The urgency event variable identifies the seriousness of the problem.

If an indicted component is not placed off line within a 24 hour period, and correctable errors continue to be detected, another indictment may be issued by Compaq Analyze and another indictment event is posted if the urgency or probability of the indictment has changed.

The indicted state is persistent across system reboots and system initialization.

3.1.1    Indictment Process Overview

The process of component indictment follows this order:

  1. A component such as a CPU or a memory location begins exhibiting correctable errors. These errors are written to the binary error log.

  2. The fault analysis utility (Compaq Analyze) is notified automatically of each binary error log entry, reads the errors written to the binary error log, and performs an analysis of them. If the analysis concludes that the component potentially may have an unrecoverable error, the analysis program informs the operating system that the component should be considered for replacement, by issuing an indictment notification.

  3. When the operating system receives an indictment, it sets the component's indictment attributes in the kernel and posts an indictment event using EVM. For an example indictment event, see Example 3-1.

  4. The Automatic Deallocation facility listens to the indictment events and performs the appropriate deallocation dictated by the user-defined policy settings. This may include automatically putting off line a component or marking a memory page as bad.

    The SysMan Station also listens to these events and updates its display with the state of the system components. For information on SysMan Station, see Section 4.7.

3.1.2    Indictment Events

As a result of receiving an indictment notification from Compaq Analyze, the operating system posts an indictment event to the Event Management Subsystem (EVM). System Management applications subscribe to these indictment events. The SysMan Station (SMS) subscribes to indictment events so that it can change the indicted component's icon to show that the component is experiencing problems. An indictment event will cause a change in the status light for the System attention group in the SysMan Station Monitor View. For details on viewing indictment events, see Section 4.7. The Automatic Deallocation utility also subscribes to indictment events so that it can determine if automatic deallocation is required based on user-defined policy.

All indictment events have a prefix of sys.unix.hw.state_change.indicted. An example event for a CPU, which has a hardware ID (HWID) of 59, being indicted with a probablity of high, follows:

sys.unix.hw.state_change.indicted.high.cpu._hwid.59._hwcomponent.CPU4
 

Indictment events can be viewed at the command line using typical EVM methods. An example to view these events as they are posted would be:

# evmwatch -f '[name sys.unix.hw.state_change.indicted]' | evmshow

An example to view the events in the EVM event log would be:

# evmget -f '[name sys.unix.hw.state_change.indicted]' | evmshow

See EVM(5) or the System Administration manual for more information.

An example of a fully formatted event follows, using the command:

# evmget -f '[name sys.unix.hw.state_change.indicted]' | evmshow -D

Example 3-1 shows an indictment event for a CPU. It is indicted with a high probability. Example 3-2 shows an indictment event with a medium probability initiated concurrently with the CPU indictment.

Example 3-1:  CPU Indictment Event


.
.
.
Formatted Message: Component State Change: Component "CPU0" has been indicted with a `high` probability of fault (HWID=2, FRUID=11529776898687173375)   Event Data Items: Event Name : sys.unix.hw.state_change.indicted.high.cpu._hwid.2._hwc omponent.CPU0 Cluster Event : True Priority : 500 PID : 524288 PPID : 0 Event Id : 957 Member Id : 1 Timestamp : 08-May-2001 15:47:08 Host IP address : 16.69.242.74 Host Name : wild-one Cluster Name : wild-bunch Format : Component State Change: Component "$_hwcomponent" has been indicted with a `high` probability of fault (HWID=$_hwid, FRUID=$module_id) Reference : cat:evmexp.cat:800   Variable Items: current_state (STRING) = "indicted" category (STRING) = "cpu" urgency (INT32) = 8 probability (INT32) = 100 total_indictments (INT32) = 2 description (STRING) = "Excessive Correctable Memory Istream/Dstream Errors reported by CPU0, CPU Slot0 in SoftQBB0 (HardQBB0)" initiator (STRING) = "Compaq Analyze" report_handle (STRING) = "mdDeCOR::gen5766" [1] component_id (UINT64) = 18374966855287635968 component_type (UINT8) = 9 component_subtype (UINT8) = 35 module_id (UINT64) = 11529776898687173375 module_type (UINT8) = 21 module_subtype (UINT8) = 35 _hwid (UINT64) = 2 _hwcomponent (STRING) = "CPU0" previous_probability (INT32) = 0 previous_state (STRING) = "unknown"    

  1. This report_handle value can be matched up with other events when there is more than one indictment per incident. [Return to example]

Example 3-2:  Indictment Event (medium probability)


.
.
.
Formatted Message: Component State Change: Component "" has been indicted with a `medium` probability of fault (HWID=0, FRUID=11962686508084822783)   Event Data Items: Event Name : sys.unix.hw.state_change.indicted.medium._hwid.0 Cluster Event : True Priority : 400 PID : 546043 PPID : 524289 Event Id : 958 Member Id : 1 Timestamp : 08-May-2001 15:47:08 Host IP address : 16.69.242.74 Host Name : wild-one Cluster Name : wild-bunch Format : Component State Change: Component "$_hwcomponent" has been indicted with a `medium` probability of fault (HWID=$_hwid, FRUID=$module_id) Reference : cat:evmexp.cat:800   Variable Items: current_state (STRING) = "indicted" urgency (INT32) = 8 probability (INT32) = 50 total_indictments (INT32) = 2 description (STRING) = "Excessive Correctable Memory Istream/Dstream Errors reported by CPU0, CPU Slot0 in SoftQBB0 (HardQBB0)" initiator (STRING) = "Compaq Analyze" report_handle (STRING) = "mdDeCOR::gen5766" [1] component_id (UINT64) = 18374966859431673855 component_type (UINT8) = 7 [2] component_subtype (UINT8) = 38 module_id (UINT64) = 11962686508084822783 module_type (UINT8) = 21 module_subtype (UINT8) = 38 _hwcomponent (STRING) = "" _hwid (UINT64) = 0 category (STRING) = "" previous_probability (INT32) = 0 previous_state (STRING) = "unknown"    

  1. This report_handle value can be matched up with other events when there is more than one indictment per incident. [Return to example]

  2. The value 7 for component_type indicates that the QBB backplane has been called out as a possible failure suspect. [Return to example]

Applications can be programmed to subscribe to indictment events. Example code showing how to write code that subscribes to EVM events is supplied in /usr/examples/evm/evm_ex_olar_mon.c.

3.1.3    Indictment Status

Enter the following command to look for components that have a non-good/non-normal status, including indicted components:

# hwmgr -status component -ngood
 
                   STATUS   ACCESS                          INDICT
 HWID:  HOSTNAME   SUMMARY  STATE              STATE        LEVEL   NAME
------------------------------------------------------------------------------
  113:  provolone  critical online             available    high    CPU10
  194:  provolone  critical online             available    high    CPU7      
 

If there is no output, then all components are in a normal state. If there is a value in the INDICT LEVEL column, the component has been indicted with that indictment probability. See Section 3.1.4 for more information.

Enter the following command, using the HWID value, to view more detailed indictment status including the urgency of the indictment:

# hwmgr -get attr -id HWID | grep indict
  indicted = 1
  indicted_probability = 100
  indicted_urgency = 5

The indicted_urgency attribute is a value from 1-10, the lower the value, the less urgent the removal of the component. A value of 10 indicates that you should remove the component as soon as possible.

To view the indictment information using SysMan Station, see Section 4.7.

3.1.4    Indictment Probability and Urgency

Every indictment notification has an associated probability value and a corresponding indict level. The probability value indicates the likelihood that the component being indicted is at fault. The lower the probability value, the less likely that the component is at fault.

Compaq Analyze may indict more than one component if it cannot pinpoint which component is the source of a given error. The probability value is not a true percentage likelihood of probability of future failure, but simply a method of pointing to the relative likelihood of a potential for failure.

A summary of the indictment probability values and the corresponding indict levels is shown in Table 3-1.

Table 3-1:  Indictment Probability

Probability Indict Level Description
100 High The most likely source of the error
50 Medium The second most likely source of the error
1 Low The least likely source of the error.

If this situation arises, the indictment events can be linked together by examining the report_handle variable within the indictment events. Multiple indictment events for the same error will contain the same report_handle value. Example 3-3 shows an example event.

The urgency is expressed as an integer value between 1 and 10. The lower the value, the less urgent the removal of the component. An urgency of 10 means the indicted component should be replaced as soon as possible. An urgency of 1 means the indicted component more than likely will fail at some future time but operator intervention may not be required immediately. The indicted_urgency attribute can be checked by viewing the event or with the hwmgr command as discussed in Section 3.1.3.

Example 3-3:  Memory Indictment Event

Formatted Message:
    Component State Change: Physical address 268435456 has been indicted
 
Event Data Items:
    Event Name        :
sys.unix.hw.state_change.indicted.memory_page._physical
                        _address.268435456._hwid.0
    Cluster Event     : True
    Priority          : 200
    PID               : 530236
    PPID              : 530233
    Event Id          : 947
    Member Id         : 1
    Timestamp         : 05-Mar-2001 15:34:24
    Host IP address   : 16.69.242.74
    Cluster IP address: 16.69.241.125
    Host Name         : provolone
    Cluster Name      : deli
    Format            : Component State Change: Physical address 
                        $_physical_address has been indicted
    Reference         : cat:evmexp.cat:800
 
Variable Items:
    current_state (STRING) = "indicted"
    urgency (INT32) = 8
    probability (INT32) = 100
    total_indictments (INT32) = 1
    description (STRING) = 
            "Excessive Read Correctable Errors reported by Memory
Module0 in 
             SoftQBB0 (HardQBB0)"
    initiator (STRING) = "Compaq Analyze"
    report_handle (STRING) = "mdDeCOR::gen7853"    [1]
    _physical_address (UINT64) = 268435456
    _hwid (UINT64) = 0
    previous_state (STRING) = "unknown"
 
======================================================================

  1. This report_handle value can be matched up with other events when there is more than one indictment per incident. [Return to example]

3.1.5    Clearing a Component Indictment

There are two situations when it is necessary to clear the indicted state of a component.

When a component has been serviced due to an indictment, you must clear the indicted state after you verify that the repaired component is operating properly. The indictment state is associated with the CPU slot, not the specific CPU module.

When a component has been serviced as the result of a previous component indictment, it is necessary to clear the indicted state when it has been verified that the repaired or replaced component is operating properly. Note that in the case of CPU indictments, indictment variables are associated with the CPU slot, not the specific CPU module. Therefore, when the newly replaced CPU module is inserted in a slot previously associated with an indicted CPU, it will still appear as indicted. After the newly replaced CPU module has its power on, and is verified as operating properly, you can clear the indicted state associated with the CPU slot.

Enter the following command to clear the indictment value:

# hwmgr -unindict [component] -id hardware-component-ID [-member cluster-member-name]

For example, do the following:

# hwmgr -unindict -id 58

3.2    Automatic Deallocation of Components

The Automatic Deallocation facility of the operating system subscribes to the EVM events for component indictment and can take action immediately on receipt of the notification. CPUs and memory pages that have been indicted can be taken off line automatically if wanted. Automatic deallocation behavior is defined by the variables and attributes defined in the olar.config and olar.config.common files located in the /etc directory. The olar.config file is used to define system specific policies and the olar.config.common is used to define cluster-wide policies. The olar.config file is a context-dependent symbolic link (CDSL) that is specific to the particular cluster member. Any settings in a system's olar.config override cluster-wide policies in the olar.config.common file for that system only. The values of the variables defined in this file are case insensitive.

When the Automatic Deallocation facility is invoked as a result of a component's indictment, it will post the results of its execution, including specific policy variable evaluation, as one or more EVM events. This provides an audit trail for this automated process and allows user applications to listen for (or subscribe to) these events if wanted. See EVM(5) for general information on the Tru64 UNIX Event Management facility. All Automatic Deallocation Facility EVM events have a prefix of sys.unix.sysman.auto_deallocate.

3.2.1    Automatic Deallocation Policy for CPUs

The following are policies that you can set for automatically handling CPU indictments:

Automatic deallocation should be disabled whenever the pfm or pcount device drivers are configured into the kernel, or vice versa. For more information on these drivers, see Section 4.4.

CPU Policy Variables

The following sections describe the policy variables defined for automatic deallocation of a CPU.

Whether or Not to Deallocate CPU When Indicted

The default action is for CPUs not to be deallocated upon component indictment.

You can specify whether or not to automatically deallocate a CPU when it is indicted with the cpu_deallocate_allow variable. If this variable is left NULL or specified as FALSE, there is no automatic deallocation attempt of hardware components that belong to category CPU when a CPU is indicted. All other cpu_deallocate* policy variables will not be considered if this attribute is not set to TRUE. Allowed values are TRUE, FALSE.

Time to Perform Deallocation

You may decide to allow deallocation of a component only within a specified time window. Settings available to limit the times at which deallocation is allowed are described.

There are two variables that can be set in order to specify a time window in which automatic deallocation of indicted CPUs can take place: cpu_deallocate_start_time and cpu_deallocate_end_time. The variable cpu_deallocate_start_time denotes the time (in 24 hour format) beginning with and after which automatic deallocation is allowed. If no start value is specified, a value of 00:00 is assumed. Allowed values are 00:00 - 23:59.

The variable cpu_deallocate_end_time denotes the time (in 24 hour format) up to and including when automatic deallocation is allowed. This attribute is used in conjuction with cpu_deallocate_start_time. The start and end times are allowed to cross a day boundary. If no end value is specified, a value of 23:59 is assumed. Allowed values are 00:00 - 23:59.

Indictment Probability for Automatic Deallocation

You can specify a single indicted probability or list of indicted probabilities for which automatic deallocation should occur for an indicted CPU using the variable cpu_deallocate_probability. Probabilities can be any combination of the three discrete values low, medium and high. Probabilities must be enclosed in braces and multiple probabilities must be delimited by a comma (,). If no value is specified for this attribute, automatic deallocation will occur only for components indicted with a high probability. Allowed values are low, medium and high.

Script to Run Before Deallocation

You can specify a script that can be executed before a deallocation is attempted using the cpu_deallocate_user_supplied_script variable. This variable contains the full path to a user-supplied script. If present, this script must be executable, be owned by root, and provide a zero return status to indicate successful execution. The script is passed two parameters that can be used in the script; the CPU name and the hardware ID (HWID) value. A non-zero return value of the script prevents the automatic deallocation from proceeding.

Whether to Deallocate Even When Processes Are Bound

This variable cpu_deallocate_if_bound_processes defines whether automatic deallocation of a CPU should occur if processes have been bound to run specifically on the indicted CPU or processes have been bound to the Resource Affinity Domain (RAD) that the CPU belongs to. If the value of this policy variable is TRUE, the CPU is removed automatically from the operating system (put off line) even under the following situations:

  1. Processes are bound to run specifically on the CPU. Those bound processes will suspend until the CPU is brought back to the online state.

  2. Processes are bound to the Resource Affinity Domain (RAD) that the CPU belongs to and this CPU is the last active CPU in the RAD. Those processes that are bound to the RAD will suspend execution until any of the CPUs that belong to the RAD are brought back to the online state.

Conversely, if this policy variable is not set to TRUE, an indicted CPU will not be deallocated if processes are bound to the CPU or if processes are bound to the RAD, which contains the indicted CPU and the CPU is the last active CPU in that RAD.

See rad_bind_pid(3) and runon(1) for information on binding processes to CPUs or to RADs.

CPU Policy Examples

The following are examples of how the variables in olar.config or olar.config.common may be set to achieve the wanted results.

Deallocate indicted CPUs immediately whenever they occur, including if processes are bound to a CPU or RAD:

cpu_deallocate_allow=TRUE
cpu_deallocate_start_time=00:00
cpu_deallocate_end_time=23:59
cpu_deallocate_probability=high
cpu_deallocate_user_supplied_script=
cpu_deallocate_if_bound_processes=TRUE

Deallocate indicted CPUs only after 7:00 p.m. and before 5:00 a.m. if indictment probability is high or medium. Do not deallocate if processes are bound to a CPU or RAD:

cpu_deallocate_allow=TRUE
cpu_deallocate_start_time=19:00
cpu_deallocate_end_time=04:59
cpu_deallocate_probability={high,medium}
cpu_deallocate_user_supplied_script=
cpu_deallocate_if_bound_processes=FALSE

Deallocate indicted CPUs immediately if the user-defined script /var/checkcpu.sh returns successfully. Do not deallocate if processes are bound to a CPU or RAD:

cpu_deallocate_allow=TRUE
cpu_deallocate_start_time=00:00
cpu_deallocate_end_time=23:59
cpu_deallocate_probability=high
cpu_deallocate_user_supplied_script=/var/checkcpu.sh
cpu_deallocate_if_bound_processes=FALSE

3.2.2    Automatic Deallocation Policy for Memory

Memory locations that have been noted by Compaq Analyze as having too many errors can be indicted. The memory page (as defined by the Page Frame Number) that contains an indicted memory location may be deallocated for use by the operating system.

Compaq Analyze can identify a physical memory location that is experiencing a high incidence of correctable single-bit errors, such that Compaq Analyze believes the error rates to be outside of normal operation. In this case, the physical location will be indicted, which may result in the memory page, containing that location, to be deallocated automatically.

If the memory page is not currently in use, then it will be mapped out (marked as bad) the next time an attempt is made to allocate the page. If the memory page is currently in use, it will be mapped out the next time the page is deallocated.

The default setting is for a memory page to be mapped out upon indictment. Actual deallocation will occur only when the memory page is freed or subsequent access is attempted.

The following are policies that you can set for handling memory indictments:

Memory Policy Variables

The following sections describe the policy variables defined for automatic deallocation of a memory page (PFN).

Whether or Not to Attempt Deallocation

You can specify whether or not automatic deallocation is allowed when a memory page is indicted with the pfn_deallocate_allow variable. If this attribute is left NULL or specified as FALSE, then there will be no automatic deallocation of memory pages when a memory page is indicted. All other pfn_deallocate* policy variables will not be considered if this attribute does not have the value TRUE. Allowed values are TRUE, FALSE.

Time Window to Perform Deallocation

You can specify the time (in 24 hour format) beginning with and after which automatic deallocation is allowed with the pfn_deallocate_start_time variable. This attribute is used in conjuction with pfn_deallocate_end_time to denote a time window in which automatic deallocation of indicted memory pages can take place. If no start value is specified, a value of 00:00 is assumed. Allowed values are 00:00 - 23:59.

The pfn_deallocate_end_time variable denotes the time (in 24 hour format) up to and including which automatic deallocation is allowed. The start and end times are allowed to cross a day boundary. If no end value is specified, a value of 23:59 is assumed. Allowed values are 00:00 - 23:59.

Probability to Deallocate

You can specify the probability values for which automatic deallocation should occur for an indicted memory page with the pfn_deallocate_probability variable.

This value refers to probabilities and could be any combination of the three discrete values low, medium, and high. Probabilities must be enclosed in braces and multiple probabilities must be delimited by a comma (,). If no value is specified for this attribute, automatic deallocation will occur only for memory pages indicted with a high probability. Allowed values are low, medium and high.

User-defined Script to Run on Deallocation Attempt

You can specify a script that will run before a deallocation attempt using the pfn_deallocate_user_supplied_script variable.

This variable defines a full path to a user-supplied script that will execute prior to automatic deallocation of an indicted Page Frame Number (PFN). If present, this script must be executable, be owned by root, and provide a zero return status to indicate successful execution. The script is started with a parameter that can be used in the script, the decimal value of the PFN. A non-zero return value of the script will prevent the automatic deallocation from proceeding.

Memory Deallocation Examples

The following are examples of how the variables in olar.config or olar.config.common may be set to achieve the wanted results for memory deallocation.

Deallocate indicted memory pages immediately whenever indictment events occur:

pfn_deallocate_allow=TRUE
pfn_deallocate_start_time=00:00
pfn_deallocate_end_time=23:59
pfn_deallocate_probability=high
pfn_deallocate_user_supplied_script=
 
 

Deallocate indicted memory pages only after 7:00 p.m. and before 5:00 a.m.:

pfn_deallocate_allow=TRUE
pfn_deallocate_start_time=19:00
pfn_deallocate_end_time=04:59
pfn_deallocate_probability=high
pfn_deallocate_user_supplied_script=
 
 

Deallocate indicted memory pages immediately if the user-defined script /var/checkmem.sh returns successfully:

pfn_deallocate_allow=TRUE
pfn_deallocate_start_time=00:00
pfn_deallocate_end_time=23:59
pfn_deallocate_probability=high
pfn_deallocate_user_supplied_script=/var/checkmem.sh