5    Cluster Application Availability

This chapter provides the following information:

5.1    Overview

The cluster application availability (CAA) subsystem provides high availability for single-instance applications and the capability to monitor applications and the state of other types of resources, such as network interfaces, tape devices, and media changer devices. (A single-instance application runs on a single member of a cluster, and cannot be run on more than one member at a time.) A single instance of any application that can run on Tru64 UNIX can be made highly available in a cluster with CAA. For example, in a cluster, the daemons for BIND (named), DHCP (joind), and network locking (rpc.lockd and rpc.statd) are managed by CAA.

Each application under CAA control has a resource profile, which describes that application's resource requirements and the circumstances under which it can be relocated to another cluster member. CAA monitors the state of cluster members and resources to ensure that each application runs on a member that meets its resource requirements. Resource profiles can be created and managed through either a command-line interface or a graphical user interface (GUI).

CAA can automatically relocate an application to another cluster member if a required resource, or the current member itself, becomes unavailable. This feature requires no changes to the application itself, and can be used with any single-instance application. CAA also monitors resources so that it can restart applications resources that have gone off line due to a resource failure.

Note

CAA's resource monitoring and application restart capabilities are enhancements to the type of application availability provided by available server environment (ASE) for user-defined services in previous TruCluster products.

Figure 5-1 shows how the failure of one member results in the failover of an application to the second member. If clients access the application through a cluster alias, the cluster alias subsystem automatically forwards connection requests to the second member.

Figure 5-1:  Application Failover with CAA

5.2    CAA Architecture

The CAA subsystem consists of the following components:

resource

A resource is a cluster software or hardware component that provides a service to end users or to other software components. Resources are the building blocks that CAA uses to make services highly available to clients. CAA supports the following types of resources: applications, network interfaces, tape drives, and media changers.

resource manager

The resource manager communicates with all the components of the CAA subsystem, as well as the connection manager and the Event Manager (EVM).

The resource manager consists of all the CAA daemons running on cluster members. Each CAA daemon (caad) starts, stops, relocates, and restarts application resources when a required resource, the application itself, or a cluster member fails. Each cluster member runs a CAA daemon. These daemons are independent but they communicate with each other, sharing information about the status of the resources.

The resource manager also uses the resource monitors that monitor the status of a particular type of resource.

resource monitor

A resource monitor is a shared library located in /var/cluster/caa/monitors, which is loaded by the resource manager, caad, at boot time. There is one resource monitor for each type of resource (application, network, tape, and media changer).

resource profile

Resource profiles contain the information needed by the resource manager and monitors to control application relocation and monitor resources.

A resource profile contains keyword/value pairs that define a resource, its dependencies (for application resources), and how the resource is managed by CAA. After the resource is registered with caa_register, the resource manager can use the resource profile.

The caa_profile command and SysMan can create resource profiles, or they can be created in any text editor. (Use caa_profile -validate to ensure the correct syntax of profiles that are created or modified using a text editor.) Errors other than syntactical errors are detected at the time of registration. This two-stage validation allows for profiles to be created with dependencies on resources that are currently off line or yet to be created.

Resource profiles are located in the /var/cluster/caa/profile directory. The file names of resource profiles take the form resource_name.cap.

action script

An action script is a set of commands that are used by CAA to start, stop, and check an application. The name of an application's action script is defined in that application's resource profile.

You can create or update an action script using the command-line interface, SysMan, or a text editor.

Action scripts are located in the /var/cluster/caa/script directory. The file names of action scripts take the form resource_name.scr.

command-line interface

The CAA subsystem provides the caa_profile, caa_register, caa_unregister, caa_start, caa_stop, caa_relocate, and caa_stat commands to manage and monitor resources. See caa(4) for a list of all CAA reference pages.

The command-line interface interacts with resource profiles, action scripts, and the resource manager.

graphical user interface

SysMan Menu and SysMan Station provide graphical user interfaces (GUIs) to perform system management tasks for the cluster, cluster members, and CAA applications. For more information on using the GUIs for performing system management tasks for CAA applications, see sysman(8) and the online help for the SysMan Menu and SysMan Station.

The CAA GUI calls the command-line interface to interact with resource profiles, action scripts, and the resource manager.

Although the connection manager and Event Manager are not part of the CAA subsystem, the subsystem makes extensive use of these facilities.

Figure 5-2 shows a graphical representation of the CAA architecture.

Figure 5-2:  CAA Architecture

5.3    Resources

A resource is a cluster software or hardware component that provides a service to end users or to other software components. Resources are the building blocks that CAA uses to make services highly available to clients. CAA supports the following types of resources:

application

An executable program. An application resource can have dependencies on other resources, including another application resource. In the resource profile that defines an application resource, these dependencies are defined as either required, REQUIRED_RESOURCES, or optional, OPTIONAL_RESOURCES.

If you define a resource as a required resource and the required resource becomes unavailable, CAA stops the application. CAA then attempts to restart the application on another member that has the required resource. If CAA cannot restart the application on another member because the other member is down or because the placement policy forbids starting the application on that member, the application is stopped. CAA does not restart the application until all required resources are available.

You can use optional resources in conjunction with required resources and the placement policy to help determine the optimal system on which to start an application. If an optional resource becomes unavailable the application does not fail over.

network

A network interface. All cluster members can indirectly access any network attached to any member. An application that makes extensive use of a network connection available on another cluster member can add traffic to the cluster interconnect, and slow down performance of both the application and the cluster. Defining a network resource as a required resource for an application is useful when you want an application to run on a member with direct connectivity to a specific network.

If you define a network resource as a required resource for an application and the network interface adapter fails, CAA relocates or stops the application if it cannot relocate the resource.

If you define a network resource as an optional resource for an application, CAA starts the application on a member that is directly connected to the network. If the subnet adapter fails, the application reverts to accessing the network indirectly.

tape or changer

A tape drive or media changer. If you define a tape or media changer resource as a required resource for an application, the application always runs on a cluster member with direct connectivity to the tape device or changer. If the device fails, CAA attempts to relocate the application, or stops the application if relocation is not possible.

If you define a tape or media changer resource as an optional resource for an application, CAA attempts to start the application on a member with direct connectivity, but it also runs the application on a member that does not have direct connectivity to the device. Running on a member with direct connectivity to a tape device is desirable to maximize performance.

5.4    Resource Profiles

Each resource has a resource profile, which defines the resource, lists any dependencies, and provides instructions for how CAA should manage the resource. A resource profile is a simple text file containing a list of keyword/value pairs, which are described in caa(4). By default, all resource profiles are located in the /var/cluster/caa/profile directory.

A resource profile must be registered through the caa_register command in order for CAA to monitor and manage the resource.

The following sections describe the two types of resource profiles:

5.4.1    Application Resource Profiles

For an application resource, a resource profile can contain the application's type, name, check interval, monitoring thresholds, resource dependencies (required resources), optional resources, hosting member list, placement policy, restart attempts, failover delay, auto start value, active placement value, and name of the resource's action script. Some keywords are optional. For example, the following sample named.cap resource profile does not set an active placement value, which means that the placement of the application will not be reevaluated when a member boots into the cluster.

cat named.cap
TYPE = application
NAME = named
DESCRIPTION = BIND Server
CHECK_INTERVAL = 
FAILURE_THRESHOLD = 0
FAILURE_INTERVAL = 0
REQUIRED_RESOURCES = 
OPTIONAL_RESOURCES = 
HOSTING_MEMBERS = 
PLACEMENT = balanced
RESTART_ATTEMPTS = 
FAILOVER_DELAY = 
AUTO_START = 
ACTION_SCRIPT = named.scr
 

The caa(4) reference page provides detailed descriptions of each type of profile and keyword. In addition, see the Cluster Highly Available Applications manual and caa_profile(8) for more information on the contents and creation of application resource profiles.

The remainder of this section discusses placement policies, hosting members, active placement, and failure threshold and failure interval. Action scripts are described in Section 5.5.

An application's placement policy determines where the application is started. Supported policies are: balanced, favored, and restricted.

balanced

CAA favors starting or restarting the application resource on the member that is currently running the fewest application resources. Placement that is due to optional resources is considered first. Next, the host with the fewest application resources running is chosen. If no cluster member is favored by these criteria, any available member is chosen.

favored

CAA refers to the list of members in the HOSTING_MEMBERS attribute of the resource profile. Only cluster members that are both in this list and satisfy the required resources are eligible for placement consideration. Placement due to optional resources is considered first. If no member can be chosen based on optional resources, the order of the hosting members decides which member will run the application resource. If none of the members in the hosting member list are available, CAA favors placing the application resource on the member that is running the fewest application resources.

You must specify a hosting members list when you select a favored placement policy.

restricted

This policy is similar to the favored placement policy, except that if none of the members on the hosting members list are available, CAA will not start or restart the application resource. A restricted placement policy ensures that the resource never runs on a member that is not on the list, unless you manually relocate it to that member.

You must specify a hosting members list when you select a restricted placement policy.

Hosting members are, in order of preference, members to consider when the application is (a) started, or (b) relocated. A hosting member list is used in conjunction only with the favored or restricted placement policies.

Active placement causes CAA to reevaluate the placement of an application when a new cluster member is added to a cluster or rebooted. If a more highly favored cluster member joins the cluster and active placement is on, then the application will stop on its current member and restart on the more favored member.

Failure threshold and failure interval values are used together to stop an application that repeatedly fails. If an application fails too many times during the failure interval time, the application is not started again. These values are considered only when a check of the application fails, and not at initial start attempts.

The restart attempts value defines the maximum number of times that an application start or restart is attempted on one cluster member before that attempt is considered failed.

5.4.2    Nonapplication Resource Profiles

All other types of currently supported resources (network, tape, and media changer) have resource profiles that define which resource to monitor and specify the failure threshold and failure interval values. If a nonapplication resource fails too many times during the failure interval time, monitoring of the resource is stopped.

For tape and media changer resources, you define which tape to monitor by its device name; for a network resource you must define a subnet.

See the Cluster Highly Available Applications manual, caa_profile(8), and caa(4) for detailed descriptions of the contents and creation of resource profiles.

5.5    Action Scripts

An action script is a set of commands used by CAA to start, stop, and check an application. Only application resources have action scripts. The name of an action script is specified as the ACTION_SCRIPT value in the application's resource profile.

By default, action scripts are located in the /var/cluster/caa/script directory although they can be placed anywhere. The file names of action scripts take the form resource_name.scr.

The Cluster Highly Available Applications manual provides examples of action scripts.

In function, an action script is similar to available server environment (ASE) scripts, and to the system initialization scripts located in the /sbin/init.d directory.

An action script has multiple entry points that are executed by the CAA commands when an application resource needs to be started or stopped. The start entry point is used by caa_start and caa_relocate to start an application, and the stop entry point is used by caa_stop and caa_relocate to stop an application. The check entry point is used by the resource manager to validate that an application is still running.

Each action script has an associated timeout value defined in its application resource profile. If the action script does not finish executing within this time, CAA considers the start attempt a failure and either attempts to start the application on another member or fails completely.

Both the caa_profile command and the SysMan suite of applications can be used to create simple action scripts when creating resource profiles. You may need to edit these action scripts to customize the start, stop, and check procedures for an application.