4    Online Addition and Removal

This chapter discusses the features of the operating system that support this process including the following topics:

4.1    Overview

This chapter describes how to add or replace components in a system while keeping an operating system instance and associated applications running.

Online Addition and Removal (OLAR) management is provided by the hwmgr command and the SysMan suite of System Management applications. Complete management is available from the operating system, eliminating the need to use lower level hardware monitor interfaces such as the System Reference Monitor (SRM) or System Control Monitor (SCM).

4.2    Reasons for Component OLAR

Online Addition and Removal (OLAR) management allows for the addition or removal of hardware while the operating system and applications continue to run. This provides the benefit of increased system up time and availability during both scheduled and unscheduled maintenance. OLAR is supported for CPUs on some symmetrical multiprocessing (SMP) platforms. Currently, the platforms which support CPU OLAR are the AlphaServer GS160, and GS320 series systems. Other SMP systems do not support physically adding or removing CPUs while the system is running, but do support placing a CPU in an offline state if it is not functioning properly.

The need for component OLAR may arise for one of the following reasons:

Computation Capacity Expansion

A system requires additional computational resource capacity. For example, a GS320 may have increased processing requirements. If the system has available CPU slots, the CPU capacity can be expanded by adding additional CPU modules to the system to improve system performance.

Maintenance Upgrade

A system manager wants to upgrade specific system components to the latest model or revision. As an example, a GS160 with earlier model CPU modules can be upgraded to later model CPUs with higher clock rates, while the operating system continues to run. In this example for GS series systems, all CPUs in a Quad Building Block (QBB) must be running the same model and speed CPU.

Failed Component Replacement

A system component is indicating a high incidence of correctable errors and the system manager wants to perform a proactive replacement of the failing component before it results in a hard failure.

4.3    Getting State Information

You can get information about components and their states using the hwmgr command, SysMan Station, or by viewing particular events. The following sections discuss information pertaining to component states during an Online Addition and Removal operation.

4.3.1    Component States and Status

There are three important attributes, which describe how a component is currently functioning:

These attributes can be displayed by using the hwmgr -status command or by viewing the properties of a component with the SysMan Station. For information on how to view the properties using SysMan Station, see Section 4.7.

Access State Attribute

The access state attribute of a component is an indication of the accessibility of a component to the operating system, as determined by a system administrator. The access state of a component is either on line or off line.

An online component is used actively by the operating system. An offline component is not used by the operating system. For example, offline CPUs will not have processes scheduled for execution by them.

State Attribute

The state attribute, in general, is an indication of the component's operational capabilities, as indicted by the controlling software for a given component. The following are the possible states of the components:

Available

The component is fully functional and ready for use although it might not be currently on line.

Unavailable

The component is unavailable for use.

Off

The component is turned off.

Unknown

The controlling software is unable to determine the status of the component. Use other hwmgr command options and diagnostic or service tools to determine its status.

Status Attribute

The status attribute is a summary of the access state, state and indicted state attributes, to provide a quick indication of the component status. The component status is one of the following:

Normal/Good

The component is behaving normally.

Inactive

The status of the component is inactive because it is a component that is managed using the Compaq Capacity on Demand (CCoD) feature (typically a CPU). The component is physically present but off line and therefore available for spare capacity.

Warning

This status warns you that a component is not in a normal state but may return to a normal state after a system reboot. For example, when you take a CPU off line using the -offline nosave option, its status changes to warning state. It is considered a warning status because this CPU automatically will become on line and available after system reboot or initialization.

Critical

This status warns you that a component is not in a normal state and will not return automatically to a normal state. You must intervene to bring the component back to a normal state (on line and available). For example, when you take a CPU off line, its offline state persists across a reboot and its status changes to critical. You only can bring the CPU back on line by manual intervention. Other examples of components that will cause a critical status are components that are indicted (through the Component Indictment facility), and components with power off.

Indicted Attribute

The indicted attribute is an indication of whether a component has been indicted by a fault analysis utility. If a component has been indicted, the additional attributes indicted_probability and indicted_urgency are also set. For more information on these attributes, see Section 3.1.4.

4.3.2    OLAR Events

OLAR operations will cause a change in a component's state. All changes in a component state will result in the generation of an EVM event. EVM events that track changes in the state of a component begin with sys.unix.hw.state_change. Events that result from OLAR operations are hardware state change events. For a description of each type of state change event that can occur, enter the following command:

# evmwatch -i -f  '[name sys.unix.hw.state_change]'  | evmshow -t "@name" -x | more

See EVM(5) or the System Administration manual for more information.

Applications can be programmed to subscribe to OLAR events. Examples showing how to write code that subscribes to OLAR specific EVM events is supplied in /usr/examples/evm/evm_ex_olar_mon.c. You must have the OSFEXAMPLESnnn subset installed to get the /usr/examples directory.

4.3.3    Locating a CPU

If you are removing a CPU in an AlphaServer GS160/GS320, you may need to locate the Quad Building Block (QBB) number for each installed CPU. If you are using SysMan Station, the Hardware View window shows the hierarchy of the hardware graphically. If you are using a command line interface, enter the following command to get information on the hardware hierarchy. hwmgr -view hierarchy:

# hwmgr -view hierarchy

 HWID:   hardware hierarchy     (!)warning (X)critical (-)inactive (see -status)
 -------------------------------------------------------------------------------
    1:   platform Compaq AlphaServer GS160 6/731
    9:     bus wfqbb0
    10:       connection wfqbb0slot0
    11:         bus wfiop0
    12:           connection wfiop0slot0
    13:             bus pci0
    14:               connection pci0slot1
 
     
.
.
.
    57: cpu qbb-0 CPU0 [1] 58: cpu qbb-0 CPU2    

  1. This line shows the hardware ID (57), the component type (cpu), the hard Quad Building Block (QBB) number where the CPU is located (qbb-0), and the CPU name (CPU0). Note that the hard QBB number does not change in a partitioned system. [Return to example]

To quickly identify which QBB a CPU is associated with, using the CPU hardware ID, enter the following command:

# hwmgr -view hier -id HWID

  HWID:   hardware hierarchy
 -------------------------------------------------------------------------------
    58:   cpu qbb-0 CPU0

4.4    Cautions Before Performing CPU OLAR Operations

The following cautions must be considered before adding or removing CPUs:

4.5    Component Removal Procedure

The process of removing a component consists of the following steps:

  1. Use one of the appropriate management applications (the hwmgr command or SysMan) to prepare for the removal of a component, such as verifying the status of the CPU and ensuring that no user processes are bound to it currently. Any processes that are not specifically bound to a CPU are migrated automatically to other running CPUs when the CPU is put off line. Also, any CPU with processes bound to a RAD of which this is the last running CPU should not be put off line unless it is acceptable to suspend processes that are bound to that RAD.

  2. Use one of the appropriate management applications to take the component off line and remove power. See Section 4.5.1 for more information. When power is removed, the LED on the CPU module will illuminate yellow, indicating that the CPU module power is off and it is safe to remove.

  3. Remove the component physically. The operating system automatically recognizes that the CPU module physically has been removed. There is no need to perform a scan operation to update the hardware configuration.

Before a CPU can be removed physically from the system, it must be placed off line and the power turned off, using any of the supported management applications described in the following sections. Processes queued for execution on a CPU that is to be placed off line simply are migrated to run-queues of other running (online) processors.

If another system administrator is actively managing the systems' processors, you will get a warning message telling you to perform the operation at another time.

4.5.1    Taking CPUs Off Line, Removing Power, or Both

You may want to take a CPU off line if it is suspected of potentially failing. Compaq Analyze proactively may indicate that a CPU is suspected of a potential failure by notifying the Component Indictment facility, which will create EVM events about the indictment. For information on component indictment, see Section 3.1.

A CPU that is placed off line will be persistently off line across reboots, by default. You optionally may set a CPU to be put on line at the next reboot.

On AlphaServer GS80, GS160, or GS320 systems or ES45 systems, any CPU can be placed off line as long as it is not the last CPU in the primary processor set. On other older SMP systems, any CPU except the primary CPU can be placed off line. If your system supports OLAR of CPUs, you also can turn the power off to allow for removal of the CPU.

If OLAR is not supported on your SMP system, the CPU may be placed off line, but may not have power turned off by the operating system. This will stop scheduling of processes on this processor and potentially avoid a kernel panic if the processor is experiencing uncorrectable errors.

If your SMP system does not support OLAR, the Manage CPUs application will not offer you the opportunity to remove power from the CPU, and the hwmgr command will return an appropriate message if you attempt to remove power from the CPU. Attempts to use the hwmgr command to remove power from a CPU in a system that does not support OLAR will not succeed and an error message will be displayed.

4.5.2    Taking CPUs Off Line, Removing Power, or Both Using SysMan Menu

For instructions on how to start the SysMan Menu, see the System Administration manual.

To take a CPU off line using the SysMan Menu, do the following:

  1. Select Hardware.

  2. Select Manage CPUs in the SysMan Menu application. Only one system administrator can put a CPU off line at any given time.

  3. Select the CPU or the CPUs that you want to put off line. For instructions on selecting multiple CPUs, see the online help.

  4. Select Modify ....

  5. If the system does not support OLAR, select Off line. If the system supports OLAR, select one of the following:

    By default, the CPU remains in the state you chose for subsequent system reboots. To have the CPU automatically go on line at the next system reboot, do the following:

    1. Select Offline Options... in the Manage CPUs Modify dialog box.

    2. Select the Bring selected CPUs on line at system reboot checkbox.

    3. Select OK in the Offline Options dialog box.

  6. Select OK to complete the offline operation.

See the Manage CPUs Online Help for additional information on performing offline operations.

4.5.3    Taking CPUs Off Line, Removing Power, or Both Using SysMan Station

For instructions on how to start the SysMan Station, see the System Administration manual.

To take a CPU off line using the SysMan Station, do the following:

  1. Select the Hardware View window.

  2. Select a CPU icon and press MB1.

  3. Select Manage CPUs from the Tools menu. Follow the steps in Section 4.5.2.

See the Manage CPUs Online Help for additional information on performing offline operations.

4.5.4    Taking a CPU Off Line Using the hwmgr Command

To take a CPU Off line using the hwmgr command, do the following:

  1. Verify the status of the component. The access state of a CPU must be on line. Note the HWID number and name of the CPUs for use in later steps. Enter the following command:

    # hwmgr -status component

                       STATUS   ACCESS                          INDICT
     HWID:  HOSTNAME   SUMMARY  STATE              STATE        LEVEL   NAME
    ------------------------------------------------------------------------------
       
    .
    .
    .
       57: wild-one online available CPU0 58: wild-one critical offline available CPU2 59: wild-one online available CPU4 60: wild-one online available CPU6    

  2. Enter either of the following commands to put the component off line:

    # /usr/sbin/hwmgr -offline -name cpu-name

    # /usr/sbin/hwmgr -offline -id HWID

    If the component is unable to be put off line due to processes bound to the CPU, or if the processor is the last processor in the primary processor set, the command will notify you and suggest using the -force option after you assess the impact of putting the CPU off line.

    Additional options of [-nosave] or [-force] can be used. The CPU [-nosave] option specifies that on the next system reboot the CPU will be brought back on line. The [-force] option forces a CPU off line if processes were identified as being bound to that CPU.

    If you now want to remove power from the component, see Section 4.5.5.

4.5.5    Removing Power from CPUs Using the hwmgr Command

To remove power to an offline CPU using the hwmgr command, do the following:

  1. Verify the status of the component. The access state of a CPU must be off line in order to remove power from it. If you need to take a CPU off line, see Section 4.5.4. Enter the following command to view the component status.

    # hwmgr -status component

                       STATUS   ACCESS                          INDICT
     HWID:  HOSTNAME   SUMMARY  STATE              STATE        LEVEL   NAME
    ------------------------------------------------------------------------------
    
    .
    .
    .
       57: wild-one online available CPU0 58: wild-one critical offline available CPU2 59: wild-one online available CPU4 60: wild-one online available CPU6    

    To limit the view to only components that have a status summary value other than good, enter the command:

    # hwmgr -status component -ngood

                       STATUS   ACCESS                          INDICT
     HWID:  HOSTNAME   SUMMARY  STATE              STATE        LEVEL   NAME
    ------------------------------------------------------------------------------
    
    .
    .
    .
       58: wild-one critical offline available CPU2  

  2. Use the CPU's HWID value to verify if the CPU is able to have its power turned off. Enter the following command:

    # hwmgr -get attribute -a capabilities -id 58

    58:
      capabilities = 1
    

    If the capabilities value is 1, the CPU is capable of having its power turned off. If the capabilities value is 0, the CPU cannot have its power turned off.

  3. Enter either of the following commands to remove power to the component:

    # /usr/sbin/hwmgr -power off -name cpu-name

    # /usr/sbin/hwmgr -power off -id HWID

4.6    Component Addition Procedure

The process of inserting a CPU module component consists of the following steps:

  1. Add the component physically. Select an available CPU slot in one of the configured Quad Building Blocks (QBB). If there are available slots in several QBBs, it is typically best to equally distribute the number of CPUs among the configured QBBs.

    Insert the CPU module into the CPU slot. Ensure that you align the color-coded decal on the CPU module with the color-coded decal on the CPU slot. The LED on the CPU module will illuminate yellow, indicating that the CPU module's power is off. Note that the CPU will be recognized automatically by the operating system, even though it does not yet have power applied. There is no need to perform a scan operation for the operating system to identify the CPU module.

    Warning

    You should not add a component without referring to the component documentation, which contains important safety information and information on preventing static discharges that can destroy the component.

  2. Use one of the appropriate management applications (the hwmgr command or SysMan) to apply power to the component and place it on line. See Section 4.6.1 for more information. When power is applied to the CPU, it will undergo a short self-test (7-10 seconds), after which the LED will illuminate green, indicating the CPU module power is on and has passed its self-test. When the CPU is placed on line, the operating system will begin automatically to schedule and execute tasks on this CPU.

  3. Use one of the appropriate management applications to verify that the component is functioning properly. If the CPU is a replacement for one that has been indicted, be sure to clear the indictment after the component has been verified as functioning properly, as discussed in Section 3.1.5.

Newly inserted CPUs are recognized automatically by the operating system, even before their power is on. They cannot start scheduling and executing processes until the CPU has power on and is placed on line.

If another system administrator is actively managing the systems' processors, you will get an error message telling you to perform the operation at another time.

Warning

You must follow all safety procedures as documented in the hardware documentation accompanying the component. You also should consult the component replacement procedures in the service manual for your system. Failure to follow safety procedures could result in personal injury or could damage the component.

4.6.1    Putting CPUs On Line, Applying Power, or Both

You may place a CPU on line if its access state is off line and its state is available. This typically is done when a CPU is replaced or newly installed.

4.6.2    Putting CPUs On Line, Applying Power, or Both Using SysMan Menu

For instructions on how to start the SysMan Menu, see the System Administration manual.

To put a CPU on line using the SysMan Menu, do the following:

  1. Select Hardware.

  2. Select Manage CPUs in the SysMan Menu application.

  3. Select the CPU or the CPUs that you want to put on line.

  4. Select Modify ....

  5. If the system does not support OLAR, select On line. If the system supports OLAR, select one of the following:

    By default, the CPU will remain in the state you chose for subsequent system reboots.

  6. Select OK to complete the operation.

See the Manage CPUs Online Help for additional information on performing online operations.

4.6.3    Putting CPUs On Line, Applying Power, or Both Using SysMan Station

For instructions on how to start the SysMan Station, see the System Administration manual.

To put a CPU on line using the SysMan Station, do the following:

  1. Select the Hardware View window.

  2. Select a CPU icon and press MB1.

  3. Select Manage CPUs from the Tools menu. Only one system administrator can place a CPU on line at any given time.

Follow the steps in Section 4.6.2.

See the Manage CPUs Online Help for additional information on performing online operations.

4.6.4    Applying Power to CPUs with the hwmgr Command

To apply power to a CPU that is off line and has power off using the hwmgr command, do the following:

  1. Verify the status of the component. Note the ID number of the CPUs in the first column of the output. Enter the following command to view the status of the components:

    # hwmgr -status component

                       STATUS   ACCESS                          INDICT
     HWID:  HOSTNAME   SUMMARY  STATE              STATE        LEVEL   NAME
    ------------------------------------------------------------------------------
    
    .
    .
    .
       57: wild-one online available CPU0 58: wild-one critical offline available CPU2 59: wild-one online available CPU4 60: wild-one online available CPU6    

    To limit the view to only components that have a status summary value other than good, enter the following command:

    # hwmgr -status component -ngood

                       STATUS   ACCESS                          INDICT
     HWID:  HOSTNAME   SUMMARY  STATE              STATE        LEVEL   NAME
    ------------------------------------------------------------------------------
    
    .
    .
    .
       58: wild-one critical offline available CPU2  

  2. Enter either of the following commands to apply power:

    # /usr/sbin/hwmgr -power on -name cpu-name

    # /usr/sbin/hwmgr -power on -id HWID

The CPU must be placed on line before the operating system schedules processes to be run on the CPU. If you want to bring the CPU on line, see Section 4.6.5

4.6.5    Putting a CPU On Line Using the hwmgr Command

To put a CPU on line using the hwmgr command, do the following:

  1. Enter the following command to verify the status of the component:

    # hwmgr -status component

                       STATUS   ACCESS                          INDICT
     HWID:  HOSTNAME   SUMMARY  STATE              STATE        LEVEL   NAME
    ------------------------------------------------------------------------------
    
    .
    .
    .
       57: wild-one online available CPU0 58: wild-one critical offline available CPU2 59: wild-one online available CPU4 60: wild-one online available CPU6    

  2. Enter either of the following commands to put the CPU on line:

    # /usr/sbin/hwmgr -online -name cpu-name

    # /usr/sbin/hwmgr -online -id HWID

4.7    Monitoring and Managing Components with SysMan Station

SysMan Station is a GUI based System Management application and is a central point from which to manage and monitor your system.

Use SysMan Station to monitor components such as CPUs. When viewing any system component, you can obtain detailed information on its properties or launch applications that enable you to perform administrative tasks on the component.

The SysMan Station cannot be used in a character cell user environment like the SysMan Menu. SysMan Station requires that your system support graphics capability.

The SysMan Station has extensive Online Help for using the application.

The Status view of the SysMan Station tells you the status of your system at a glance. The Status lights for each attention group can be green, yellow, or red. If an attention group has a green light, all is well for that part of the system. If an attention group has a yellow light, there has been an event indicating a warning for that group. A red light for one of the attention groups indicates a critical condition that requires attention. Figure 4-1 shows indicted components causing the System attention group to be red.

Figure 4-1:  Status View

If a component is indicted, you can view recent events by double clicking on the System attention group. Figure 4-2 shows an example of state change events. You can double click on any of the events to see full information on the event.

Figure 4-2:  Events

To view the status of a component, in the Hardware View of SysMan Station, click on the component with MB3. A window will open. Select the Properties menu item to see the current status of the component, including all values discussed in Section 4.3.1 and indictment values. Figure 4-3 shows an example of the properties of a CPU.

Figure 4-3:  CPU Properties