5    Cluster Application Availability

This chapter provides the following information:

5.1    Overview

The cluster application availability (CAA) subsystem provides high availability for single-instance applications and the capability to monitor applications and the state of other types of resources, such as network interfaces, tape devices, and media changer devices. (A single-instance application runs on a single member of a cluster, and cannot be run on more than one member at a time.) A single instance of any application that can run on Tru64 UNIX can be made highly available in a cluster with CAA. For example, in a cluster, the daemons for BIND (named), DHCP (joind), and network locking (rpc.lockd and rpc.statd) are managed by CAA.

Each application under CAA control has a resource profile, which describes that application's resource requirements and the circumstances under which it can be relocated to another cluster member. CAA monitors the state of cluster members and resources to ensure that each application runs on a member that meets its resource requirements. Resource profiles can be created and managed through either a command-line interface or a graphical user interface (GUI).

CAA can automatically relocate an application to another cluster member if a required resource, or the current member itself, becomes unavailable. This feature requires no changes to the application itself, and can be used with any single-instance application. CAA also monitors resources so that it can restart applications resources that have gone off line due to a resource failure.

Note

CAA's resource monitoring and application restart capabilities are enhancements to the type of application availability provided by available server environment (ASE) for user-defined services in previous TruCluster products.

Figure 5-1 shows how the failure of one member results in the failover of an application to the second member. If clients access the application through a cluster alias, the cluster alias subsystem automatically forwards connection requests to the second member.

Figure 5-1:  Application Failover with CAA

5.2    CAA Architecture

The CAA subsystem consists of the following components:

resource

A resource is a cluster software or hardware component that provides a service to end users or to other software components. Resources are the building blocks that CAA uses to make services highly available to clients. CAA supports the following types of resources: applications, network interfaces, tape drives, and media changers.

resource manager

The resource manager communicates with all the components of the CAA subsystem, as well as the connection manager and the event manager (EVM).

The resource manager consists of all the CAA daemons running on cluster members. Each CAA daemon (caad) starts, stops, relocates, and restarts application resources when a required resource, the application itself, or a cluster member fails. Each cluster member runs a CAA daemon. These daemons are independent but they communicate with each other, sharing information about the status of the resources.

The resource manager also uses the resource monitors that monitor the status of a particular type of resource.

resource monitor

A resource monitor is a shared library located in /var/cluster/caa/monitors, which is loaded by the resource manager, caad, at boot time. There is one resource monitor for each type of resource (application, network, tape, and media changer).

resource profile

Resource profiles contain the information needed by the resource manager and monitors to control application relocation and monitor resources.

A resource profile contains keyword/value pairs that define a resource, its dependencies (for application resources), and how the resource is managed by CAA. Once the resource is registered with caa_register, the resource manager can use the resource profile.

The caa_profile command and SysMan can create resource profiles, or they can be created in any text editor. Profiles that are created or modified using a text editor should be validated using caa_profile -validate to ensure correct syntax. Errors other than syntactical errors are detected at the time of registration. This two-stage validation allows for profiles to be created with dependencies on resources that are currently off line or yet to be created.

Resource profiles are located in the /var/cluster/caa/profile directory. The file names of resource profiles take the form resource_name.cap.

action script

An action script is a set of commands used by CAA to start, stop, and check an application. The name of an application's action script is defined in that application's resource profile.

You can create or update an action script using the command-line interface, SysMan, or a text editor.

Action scripts are located in the /var/cluster/caa/script directory. The file names of action scripts take the form resource_name.scr.

command-line interface

The CAA subsystem provides the caa_profile, caa_register, caa_unregister, caa_start, caa_stop, caa_relocate, and caa_stat commands to manage and monitor resources. See caa(4) for a list of all CAA reference pages.

The command-line interface interacts with resource profiles, action scripts, and the resource manager.

graphical user interface

SysMan Menu and SysMan Station provide graphical user interfaces (GUIs) to perform system management tasks for the cluster, cluster members, and CAA applications. For more information on using the GUIs for performing system management tasks for CAA applications, see sysman(8) and the online help for the SysMan Menu and SysMan Station.

The CAA GUI calls the command-line interface to interact with resource profiles, action scripts, and the resource manager.

Although the connection manager and event manager are not part of the CAA subsystem, the subsystem makes extensive use of these facilities.

Figure 5-2 shows a graphical representation of the CAA architecture.

Figure 5-2:  CAA Architecture

5.3    Resources

A resource is a cluster software or hardware component that provides a service to end users or to other software components. Resources are the building blocks that CAA uses to make services highly available to clients. CAA supports the following types of resources:

5.4    Resource Profiles

Each resource has a resource profile, which defines the resource, lists any dependencies, and provides instructions for how CAA should manage the resource. A resource profile is a simple text file containing a list of keyword/value pairs described in caa(4). By default, all resource profiles are located in the /var/cluster/caa/profile directory.

A resource profile must be registered through the caa_register command in order for CAA to monitor and manage the resource.

The following sections describe the two types of resource profiles:

5.4.1    Application Resource Profiles

For an application resource, a resource profile can contain the application's type, name, check interval, monitoring thresholds, resource dependencies (required resources), optional resources, hosting member list, placement policy, restart attempts, failover delay, auto start value, active placement value, and name of the resource's action script. Some keywords are optional. For example, the following sample named.cap resource profile does not set an active placement value, which means that the placement of the application will not be reevaluated when a member boots into the cluster.

cat named.cap
TYPE = application
NAME = named
DESCRIPTION = BIND Server
CHECK_INTERVAL = 
FAILURE_THRESHOLD = 0
FAILURE_INTERVAL = 0
REQUIRED_RESOURCES = 
OPTIONAL_RESOURCES = 
HOSTING_MEMBERS = 
PLACEMENT = balanced
RESTART_ATTEMPTS = 
FAILOVER_DELAY = 
AUTO_START = 
ACTION_SCRIPT = named.scr
 

The caa(4) reference page provides detailed descriptions of each type of profile and keyword. In addition, see the TruCluster Server Highly Available Applications manual and caa_profile(8) for more information on the the contents and creation of application resource profiles.

The remainder of this section takes a brief look at placement policies, hosting members, active placement, and failure threshold and failure interval. Action scripts are described in Section 5.5.

An application's placement policy determines where the application is started. Supported policies are: balanced, favored, and restricted.

balanced

CAA favors starting or restarting the application resource on the member currently running the fewest application resources. Placement due to optional resources is considered first. Next, the host with the fewest application resources running is chosen. If no cluster member is favored by these criteria, any available member is chosen.

favored

CAA refers to the list of members in the HOSTING_MEMBERS attribute of the resource profile. Only cluster members that are both in this list and satisfy the required resources are eligible for placement consideration. Placement due to optional resources is considered first. If no member can be chosen based on optional resources, the order of the hosting members decides which member will run the application resource. If none of the members in the hosting member list are available, CAA favors placing the application resource on the member running the fewest application resources.

You must specify a hosting members list when you select a favored placement policy.

restricted

Similar to the favored placement policy, except that if none of the members on the hosting members list are available, CAA will not start or restart the application resource. A restricted placement policy ensures that the resource will never run on a member that is not on the list, unless you manually relocate it to that member.

You must specify a hosting members list when you select a restricted placement policy.

Hosting members are, in order of preference, members to consider when the application is (a) started, or (b) relocated. A hosting member list is used in conjunction only with the favored or restricted placement policies.

Active placement causes CAA to reevaluate the placement of an application when a new cluster member is added to a cluster or rebooted. If a more highly favored cluster member joins the cluster and active placement is on, then the application will stop on its current member and restart on the more favored member.

Failure threshold and failure interval values are used together to stop an application that repeatedly fails. If an application fails too many times during the failure interval time, the application is not started again. These values are considered only when a check of the application fails, and not at initial start attempts.

The restart attempts value defines the maximum number of times that an application start or restart is attempted on one cluster member before that attempt is considered failed.

5.4.2    Nonapplication Resource Profiles

All other types of currently supported resources (network, tape, and media changer) have resource profiles that define which resource to monitor and specify the failure threshold and failure interval values. If a nonapplication resource fails too many times during the failure interval time, monitoring of the resource is stopped.

For tape and media changer resources, you define which tape to monitor by its device name; for a network resource you must define a subnet.

See the TruCluster Server Highly Available Applications manual, caa_profile(8), and caa(4) for detailed descriptions of the contents and creation of resource profiles.

5.5    Action Scripts

An action script is a set of commands used by CAA to start, stop, and check an application. Only application resources have action scripts. The name of an action script is specified as the ACTION_SCRIPT value in the application's resource profile.

By default, action scripts are located in the /var/cluster/caa/script directory although they can be placed anywhere. The file names of action scripts take the form resource_name.scr

The TruCluster Server Highly Available Applications manual provides examples of action scripts.

In function, an action script is similar to available server environment (ASE) scripts, and to the system initialization scripts located in the /sbin/init.d directory.

An action script has multiple entry points that are executed by the CAA commands when an application resource needs to be started or stopped. The start entry point is used by caa_start and caa_relocate to start an application, and the stop entry point is used by caa_stop and caa_relocate to stop an application. The check entry point is used by the resource manager to validate that an application is still running.

Each action script has an associated timeout value defined in the application resource profile. If the action script does not finish executing within this time, CAA considers the start attempt a failure and will either attempt to start the application on another member or fail completely.

Both the caa_profile command and the SysMan suite of applications can be used to create simple action scripts when creating resource profiles. You may need to edit these action scripts to customize the start, stop, and check procedures for an application.