8    Managing Highly Available Applications

This chapter describes the management tasks that are associated with highly available applications and the cluster application availability (CAA) subsystem. The following sections discuss these and other topics:

For detailed information on setting up applications with CAA, see the TruCluster Server Cluster Highly Available Applications manual. For a general discussion of CAA, see the TruCluster Server Cluster Technical Overview.

After an application has been made highly available and is running under the management of the CAA subsystem, it requires little intervention from you. However, the following situations can arise where you might want to actively manage a highly available application:

When you work with application resources, the actual names of the applications that are associated with a resource are not necessarily the same as the resource name. The name of an application resource is the same as the root name of its resource profile. For example, the resource profile for the cluster_lockd resource is /var/cluster/caa/profile/cluster_lockd.cap. The applications that are associated with the cluster_lockd resource are rpc.lockd and rpc.statd.

Because a resource and its associated application can have different names, there are cases where it is futile to look for a resource name in a list of processes running on the cluster. When managing an application with CAA, you must use its resource name.

8.1    Learning the Status of a Resource

Registered resources have an associated state. A resource can be in one of the following three states:

CAA will always try to match the state of an application resource to its target state. The target state is set to ONLINE when you use caa_start, and set to OFFLINE when you use caa_stop. If the target state is not equal to the state of the application resource, then CAA is either in the middle of starting or stopping the application, or the application has failed to run or start successfully. If the target state for a nonapplication resource is ever OFFLINE, the resource has failed too many times within the failure threshold. See Section 8.5 for more information.

From the information given in the Target and State fields, you can ascertain information about the resource. Descriptions of what combinations of the two fields can mean for the different types of resources are listed in Table 8-1 (application), Table 8-2 (network), and Table 8-3 (tape, media changer). If a resource has any combination of State and Target other than both ONLINE, all resources that require that resource have a state of OFFLINE.

Table 8-1:  Target and State Combinations for Application Resources

Target State Description

ONLINE

ONLINE Application has started successfully
ONLINE OFFLINE Start command has been issued but execution of action script start entry point not yet complete.
    Application stopped because of failure of required resource.
    Application has active placement on and is being relocated due to the starting or addition of a new cluster member.
    Application being relocated due to explicit relocation or failure of cluster member.
    No suitable member to start the application is available.
OFFLINE ONLINE Stop command has been issued, but execution of action script stop entry point not yet complete.
OFFLINE OFFLINE Application has not been started yet.
    Application stopped because Failure Threshold has been reached.
    Application has been successfully stopped.
ONLINE UNKNOWN Action script stop entry point has returned failure.
OFFLINE UNKNOWN A command to stop the application was issued on an application in state UNKNOWN. Action script stop entry point still returns failure. To set application state to OFFLINE use caa_stop -f.

Table 8-2:  Target and State Combinations for Network Resources

Target State Description

ONLINE

ONLINE Network is functioning correctly.
ONLINE OFFLINE There is no direct connectivity to the network from the cluster member.
OFFLINE ONLINE Network card is considered failed and no longer monitored by CAA because Failure Threshold has been reached.
OFFLINE OFFLINE Network is not directly accessible to machine.
    Network card is considered failed and no longer monitored by CAA because Failure Threshold has been reached.

Table 8-3:  Target and State Combinations for Tape and Media Changer Resources

Target State Description

ONLINE

ONLINE Tape or media changer has a direct connection to the machine and is functioning correctly.
ONLINE OFFLINE Tape device or media changer associated with resource has sent out an Event Manager (EVM) event that it is no longer working correctly. Resource is considered failed.
OFFLINE ONLINE Tape device or media changer is considered failed and no longer monitored by CAA because Failure Threshold has been reached.
OFFLINE OFFLINE Tape device or media changer does not have a direct connection to the cluster member.

8.1.1    Learning the State of a Resource

To learn the state of a resource, enter the caa_stat command as follows:

# caa_stat resource_name

The command returns the following values:

For example:

# caa_stat clock
NAME=clock
TYPE=application
TARGET=ONLINE
STATE=ONLINE on provolone

To use a script to learn whether an resource is on line, use the -r option for the caa_stat command as follows:

# caa_stat  resource_name -r ; echo $?

A value of 0 (zero) is returned if the resource is in the ONLINE state.

With the -g option for the caa_stat command, you can use a script to learn whether an application resource is registered as follows:

# caa_stat  resource_name -g ; echo $?

A value of 0 (zero) is returned if the resource is registered.

8.1.2    Learning Status of All Resources on One Cluster Member

The caa_stat -c cluster_member command returns the status of all resources on cluster_member. For example:

# caa_stat -c polishham
NAME=dhcp
TYPE=application
TARGET=ONLINE
STATE=ONLINE on polishham
 
NAME=named
TYPE=application
TARGET=ONLINE
STATE=ONLINE on polishham
 
NAME=xclock
TYPE=application
TARGET=ONLINE
STATE=ONLINE on polishham

This command is useful when you need to shut down a cluster member and want to learn which applications are candidates for failover or manual relocation.

8.1.3    Learning Status of All Resources on All Cluster Members

The caa_stat command returns the status of all resources on all cluster members. For example:

# caa_stat
NAME=dhcp
TYPE=application
TARGET=ONLINE
STATE=ONLINE on polishham
 
NAME=xclock
TYPE=application
TARGET=ONLINE
STATE=ONLINE on provolone
 
NAME=named
TYPE=application
TARGET=OFFLINE
STATE=OFFLINE
 
NAME=ln0
TYPE=network
TARGET=ONLINE on provolone
TARGET=ONLINE on polishham
TARGET=ONLINE on peppicelli
STATE=OFFLINE on provolone
STATE=ONLINE on polishham
STATE=ONLINE on peppicelli

When you use the -t option, the information is displayed in tabular form. For example:

# caa_stat  -t
 
Name           Type           Target    State     Host
---------------------------------------------------------
cluster_lockd  application    ONLINE    ONLINE    provolone
dhcp           application    OFFLINE   OFFLINE
named          application    OFFLINE   OFFLINE         
ln0            network        ONLINE    ONLINE    provolone
ln0            network        ONLINE    OFFLINE   polishham

8.1.4    Getting Number of Failures and Restarts and Target States

The caa_stat -v command returns the status, including number of failures and restarts, of all resources on all cluster members. For example:

# caa_stat -v
NAME=cluster_lockd
TYPE=application
RESTART_COUNT=0
RESTART_ATTEMPTS=30
FAILURE_COUNT=0
FAILURE_THRESHOLD=0
TARGET=ONLINE
STATE=ONLINE on provolone
 
NAME=dhcp
TYPE=application
RESTART_COUNT=0
RESTART_ATTEMPTS=1
FAILURE_COUNT=1
FAILURE_THRESHOLD=3
TARGET=ONLINE
STATE=OFFLINE
 
NAME=ln0
TYPE=network
FAILURE_THRESHOLD=5
FAILURE_COUNT=1 on provolone
FAILURE_COUNT=0 on polishham
TARGET=ONLINE on provolone
TARGET=OFFLINE on polishham 
STATE=ONLINE on provolone
STATE=OFFLINE on polishham

When you use the -t option, the information is displayed in tabular form. For example:

# caa_stat -v -t
 
Name           Type           R/RA   F/FT   Target    State     Host
----------------------------------------------------------------------
cluster_lockd  application    0/30   0/0    ONLINE    ONLINE    provolone
dhcp           application    0/1    0/0    OFFLINE   OFFLINE
named          application    0/1    0/0    OFFLINE   OFFLINE         
ln0            network               0/5    ONLINE    ONLINE    provolone
ln0            network               1/5    ONLINE    OFFLINE   polishham

This information can be useful for finding resources that frequently fail or have been restarted many times.

8.2    Relocating Applications

There are times when you may want to relocate applications from one cluster to another. You may want to:

You use the caa_relocate command to relocate applications. Whenever you relocate applications, the system returns messages tracking the relocation. For example:

Attempting to stop `cluster_lockd` on member `provolone`
Stop of `cluster_lockd` on member `provolone` succeeded.
Attempting to start `cluster_lockd` on member `pepicelli`
Start of `cluster_lockd` on member `pepicelli` succeeded.

The following sections discuss relocating applications in more detail.

8.2.1    Manual Relocation of All Applications on a Cluster Member

When you shut down a cluster member, CAA automatically relocates all applications under its control running on that member, according to the placement policy for each application. However, you might want to manually relocate the applications before shutdown of a cluster member for the following reasons:

To relocate all applications from member1 to member2, enter the following command:

# caa_relocate -s member1 -c member2

To relocate all applications on member1 according to each application's placement policy, enter the following command:

# caa_relocate -s member1

Use the caa_stat command to verify that all application resources were successfully relocated.

8.2.2    Manual Relocation of a Single Application

You may want to relocate a single application to a specific cluster member for one of the following reasons:

To relocate a single application to member2, enter the following command:

# caa_relocate resource_name -c member2

Use the caa_stat command to verify that the application resource was successfully relocated.

8.2.3    Manual Relocation of Dependent Applications

You may want to relocate a group of applications that depend on each other. An application resource that has at least one other application resource listed in the REQUIRED_RESOURCE field of its profile depends on these applications. If you want to relocate an application with dependencies on other application resources, you must force the relocation by using the -f option with the caa_relocate command.

Forcing a relocation makes CAA relocate resources that the specified resource depends on, as well as all ONLINE application resources that depend on the resource specified. The dependencies may be indirect: one resource may depend on another through one or more intermediate resources.

To relocate a single application resource and its dependent application resources to member2, enter the following command:

# caa_relocate resource_name -f -c member2

Use the caa_stat command to verify that the application resources were successfully relocated.

8.3    Starting and Stopping Application Resources

The following section describes how to start and stop CAA application resources.

Note

Always use caa_start and caa_stop or the SysMan equivalents to start and stop applications that CAA manages. Never start or stop the applications manually after they are registered with CAA.

8.3.1    Starting Application Resources

To start an application resource, use the caa_start command followed by the name of the application resource to be started. To stop an application resource, use the caa_stop command followed by the name of the application resource to be stopped. A resource must be registered using caa_register before it can be started.

Immediately after the caa_start command is executed, the target is set to ONLINE. CAA always attempts to match the state to equal the target, so the CAA subsystem starts the application. Any application required resources have their target states set to ONLINE as well and the CAA subsystem attempts to start them.

To start a resource named clock on the cluster member determined by the resource's placement policy, enter the following command:

# /usr/sbin/caa_start clock

An example of the output of the previous command follows:

Attempting to start `clock` on member `polishham` 
Start of `clock` on member `polishham` succeeded.

The command will wait up to the SCRIPT_TIMEOUT value to receive notification of success or failure from the action script each time the action script is called.

To start clock on a specific cluster member, assuming that the placement policy allows it, enter the following command:

# /usr/sbin/caa_start clock -c member_name

If the specified member is not available, the resource will not start.

If required resources are not available and cannot be started on the specified member, caa_start fails. You will instead see a response that the application resource could not be started because of dependencies.

To force a specific application resource and all its required application resources to start or relocate to the same cluster member, enter the following command:

#/usr/sbin/caa_start -f clock

See caa_start(8) for more information.

8.3.2    Stopping Application Resources

To stop highly available applications, use the caa_stop command. As noted earlier, never use the kill command or other methods to stop a resource that is under the control of the CAA subsystem.

Immediately after the caa_stop command is executed, the target is set to OFFLINE. CAA always attempts to match the state to equal the target, so the CAA subsystem stops the application.

The command in the following example stops the clock resource:

#/usr/sbin/caa_stop clock

If other application resources have dependencies on the application resource that is specified, the previous command will not stop the application. You will instead see a response that the application resource could not be stopped because of dependencies. To force the application to stop the specified resource and all the other resources that depend on it, enter the following command:

#/usr/sbin/caa_stop -f clock

See caa_stop(8) for more information.

8.3.3    No Multiple Instances of an Application Resource

If multiple start and/or stop operations on the same application resource are initiated simultaneously, either on separate members or on a single member, it is uncertain which operation will prevail. However, multiple start operations do not result in multiple instances of an application resource.

8.3.4    Using caa_stop to Reset UNKNOWN State

If an application resource state is set to UNKNOWN, first try to run caa_stop. If it does not reset the resource to OFFLINE, use the caa_stop -f command. The command will ignore any errors returned by the stop script, set the resource to OFFLINE, and set all applications that depend on the application resource to OFFLINE as well.

Before you attempt to restart the application resource, look at the stop entry point of the action to be sure that it successfully stops the application and returns 0. Also make sure that it returns 0 if the application is not currently running.

8.4    Registering and Unregistering Resources

A resource must be registered with the CAA subsystem before CAA can manage that resource. This task needs to be performed only once for each resource.

Before a resource can be registered, a valid resource profile for the resource must exist in the /var/cluster/caa/profile directory. The TruCluster Server Cluster Highly Available Applications manual describes the process for creating resource profiles.

To learn which resources are registered on the cluster, enter the following caa_stat command:

# /usr/sbin/caa_stat

8.4.1    Registering Resources

Use the caa_register command to register an application resource as follows:

# caa_register resource_name

For example, to register an application resource named dtcalc, enter the following command:

# /usr/sbin/caa_stat dtcalc

If an application resource has resource dependencies defined in the REQUIRED_RESOURCES attribute of the profile, all resources listed for this attribute must be registered first.

For more information, see caa_register(8).

8.4.2    Unregistering Resources

You might want to unregister a resource to remove it from being monitored by the CAA subsystem. To unregister an application resource, you must first stop it, which changes the state of the resource to OFFLINE. See Section 8.3.2 for instructions on how to stop an application.

To unregister a resource, use the caa_unregister command. For example, to unregister the resource dtcalc, enter the following command:

# /usr/sbin/caa_unregister dtcalc

For more information, see caa_unregister(8).

For information on registering or unregistering a resource with the SysMan Menu, see the SysMan online help.

8.4.3    Updating Registration

You may need to update the registration of an application resource if you have modified its profile. For a detailed discussion of resource profiles see the Cluster Highly Available Applications manual.

To update the registration of a resource, use the caa_register -u command. For example, to update the resource dtcalc, enter the following command:

# /usr/sbin/caa_register -u dtcalc

Note

The caa_register -u command and the SysMan Menu allow you to update the REQUIRED_RESOURCES field in the profile of an ONLINE resource with the name of a resource that is OFFLINE. This can cause the system to be out of synch with the profiles if you update the REQUIRED_RESOURCES field with an application that is OFFLINE. If you do this, you must manually start the required resource or stop the updated resource.

Similarly, a change to the HOSTING_MEMBERS list value of the profile only affects future relocations and starts. If you update the HOSTING_MEMBERS list in the profile of an ONLINE application resource with a restricted placement policy, make sure that the application is running on one of the cluster members in that list. If the application is not running on one of the allowed members, run the caa_relocate on the application after running the caa_register -u command.

8.5    Network, Tape, and Media Changer Resources

Only application resources can be stopped using caa_stop. However, nonapplication resources can be restarted using caa_start if they have had more failures than the resource failure threshold within the failure interval. Starting a nonapplication resource resets its TARGET value to ONLINE. This causes any applications that are dependent on this resource to start as well.

Network, tape, and media changer resources may fail repeatedly due to hardware problems. If this happens, do not allow CAA on the failing cluster member to use the device and, if possible, relocate or stop application resources. Exceeding the failure threshold within the failure interval causes the resource for the device to be disabled. If a resource is disabled, the TARGET state for the resource on a particular cluster member is set equal to OFFLINE, as shown with caa_stat resource_name. For example:

# /usr/sbin/caa_stat network1

NAME=network1
TYPE=network
TARGET=OFFLINE on provolone
TARGET=ONLINE on polishham
STATE=ONLINE on provolone
STATE=ONLINE on polishham

If a network, tape, or changer resource has the TARGET state set to OFFLINE because the failure count exceeds the failure threshold within the failure interval, the STATE for all resources that depend on that resource become OFFLINE though their TARGET remains ONLINE. These dependent applications will relocate to another machine where the resource is ONLINE. If no cluster member is available with this resource ONLINE, the applications remain OFFLINE until both the STATE and TARGET are ONLINE for the resource on the current member.

You can reset the TARGET state for a nonapplication resource to ONLINE by using the caa_start (for all members) or caa_start -c cluster_member command (for a particular member). The failure count is reset to zero (0) when this is done.

If the TARGET value is set to OFFLINE by a failure count that exceeds the failure threshold, the resource is treated as if it were OFFLINE by CAA, even though the STATE value may be ONLINE.

Note

If a tape or media changer resource is reconnected to a cluster after removal of the device while the cluster is running or a physical failure occurs, the cluster does not automatically detect the reconnection of the device. You must run the drdmgr -a DRD_CHECK_PATH device_name command.

8.6    Using SysMan to Manage CAA

This section describes how to use the SysMan suite of tools to manage CAA. For a general discussion of invoking SysMan and using it in a cluster, see Chapter 2.

8.6.1    Managing CAA with SysMan Menu

The Cluster Application Availability (CAA) Management branch of the SysMan Menu is located under the TruCluster Specific heading as shown in Figure 8-1. You can open the CAA Management dialog box by either selecting Cluster Application Availability (CAA) Management on the menu and clicking on the Select button, or by double-clicking on the text.

Figure 8-1:  CAA Branch of SysMan Menu

8.6.1.1    CAA Management Dialog Box

The CAA Management dialog box (Figure 8-2) allows you to start, stop, and relocate applications. If you start or relocate an application, a dialog box prompts you to decide placement for the application.

You can also open the Setup dialog box to create, modify, register, and unregister resources.

Figure 8-2:  CAA Management Dialog Box

8.6.1.2    Start Dialog Box

The Start dialog box (Figure 8-3) allows you to choose whether you want the application resource to be placed according to its placement policy or explicitly on another member.

You can place an application on a member explicitly only if it is allowed by the hosting member list. If the placement policy is restricted, and you try to place the application on a member that is not included in the hosting members list, the start attempt will fail.

Figure 8-3:  Start Dialog Box

8.6.1.3    Setup Dialog Box

To add, modify, register, and unregister profiles of any type, use the Setup dialog box, as shown in Figure 8-4. This dialog box can be reached from the Setup... button on the CAA Management dialog box. For details on setting up resources with SysMan Menu, see the online help.

Figure 8-4:  Setup Dialog Box

8.6.2    Managing CAA with SysMan Station

The SysMan Station can be used to manage CAA resources. Figure 8-5 shows theSysMan Station CAA_Applications_(active) View. Figure 8-6 shows theSysMan Station CAA_Applications_(all) View. Select one of these views using the View menu at the top of the window. Selecting a cluster icon or cluster member icon makes the whole SysMan Menu available under the Tools menu, including CAA-specific tasks.

The icons for the application resources represent the resource state. In these two figures App1 and App2 are currently offline and cluster_lockd is online.

Figure 8-5:  SysMan Station CAA_Applications_(active) View

Figure 8-6:  SysMan Station CAA_Applications_(all) View

8.6.2.1    Starting an Application with SysMan Station

To start applications in either the CAA_Applications_(active) view (Figure 8-5) or the CAA_Applications_(all) View (Figure 8-6), select the application name under the cluster icon, click the right mouse button or click on the Tools Menu and select CAA Management ==> Start Application.

8.6.2.2    Resource Setup with SysMan Station

To set up resources using SysMan Station, select either the cluster icon or a cluster member icon. Click the right mouse button or click on the Tools menu, and select CAA Management ==> CAA Setup. See Figure 8-7. The rest of the steps are the same as for SysMan Menu and are described in detail in the Tasks section of the online help.

Figure 8-7:  SysMan Station CAA Setup Screen

8.7    CAA Considerations for Startup and Shutdown

The CAA daemon needs to read the information for every resource from the database. Because of this, if there are a large number of resources registered, your cluster members might take a long time to boot.

CAA may display the following message during a member boot:

Cannot communicate with the CAA daemon.

This message may or may not be preceded by the message:

Error: could not start up CAA Applications
Cannot communicate with the CAA daemon.

These messages indicate that you did not register the TruCluster Server license. When the member finishes booting, enter the following command:

# lmf list

If the TCS-UA license is not active, register it as described in the Cluster Installation guide and start the CAA daemon (caad) as follows:

#/usr/sbin/caad

When you shut down a cluster, CAA notes for each application resource whether it is ONLINE or OFFLINE. On restart of the cluster, applications that were ONLINE are restarted. Applications that were OFFLINE are not restarted. Applications that were marked as UNKNOWN are considered to be stopped. If an application was stopped because of an issue that the cluster reboot resolves, use the caa_start command to start the application.

If you want to choose placement of applications before shutting down a cluster member, determine the state of resources and relocate any applications from the member to be shut down to another member. Reasons for relocating applications are listed in Section 8.2.

Applications that are currently running when the cluster is shut down will be restarted when the cluster is reformed. Any applications that have AUTO_START set to 1 will also start when the cluster is reformed.

8.8    Managing caad

You should not have to manage the CAA daemon (caad). The CAA daemon is started at boot time and stopped at shutdown on every cluster member. However, if there are problems with the daemon, you may need to intervene.

If one of the commands caa_stat, caa_start, caa_stop, or caa_relocate responds with "Cannot communicate with the CAA daemon!", the caad daemon is probably not running. To determine whether the daemon is running, see Section 8.8.1.

8.8.1    Determining Status of the Local CAA Daemon

To determine the status of the CAA daemon, enter the following command:

# ps ax | grep -v grep | grep caad

If caad is running, output similar to the following is displayed:

545317 ??       S        0:00.38 caad

If nothing is displayed, caad is not running.

You can determine the status of other caad daemons by logging in to the other cluster members and running the ps ax |grep -v grep | grep caad command.

If the caad daemon is not running, CAA is no longer managing the application resources that were started on that machine. You cannot use caa_stop to stop the applications. After the daemon is restarted as described in Section 8.8.2, the resources on that machine should be fully manageable by CAA.

8.8.2    Restarting the CAA Daemon

If the caad daemon dies on one cluster member, all application resources continue to run, but you can no longer manage them with the CAA subsystem. You can restart the daemon by entering the /usr/sbin/caad command.

Do not use the startup script /sbin/init.d/clu_caa to restart the CAA daemon. Use this script only to start caad when a cluster member is booting up.

8.8.3    Monitoring CAA Daemon Messages

You can view information about changes to the state of resources by looking at events that are posted to EVM by the CAA daemon. For details on EVM messages, see Section 8.9.

8.9    Using EVM to View CAA Events

CAA posts events to Event Manager (EVM). These may be useful in troubleshooting errors that occur in the CAA subsystem.

Note

Some CAA actions are logged via syslog to /var/cluster/members/{member}/adm/ syslog.dated/[date]/daemon.log. When trying to identify problems, it may be useful to look in both the daemon.log and EVM for information. EVM has the advantage of being a single source of information for the whole cluster while daemon.log information is specific to each member. Some information is available only in the daemon.log files.

You can access EVM events either by using the SysMan Station or the EVM commands at the command line. For detailed information on how to use SysMan Station, see the Tru64 UNIX System Administration manual. See the online help for information on how to perform specific tasks.

Many events that CAA generates are defined in the EVM configuration file, /usr/share/evm/templates/clu/caa/caa.evt. These events all have a name in the form of sys.unix.clu.caa.*.

CAA also creates some events that have the name sys.unix.syslog.daemon. Events posted by other daemons are also posted with this name, so there will be more than just CAA events listed.

For detailed information on how to get information from the EVM Event Management System, see EVM(5), evmget(1), or evmshow(1).

8.9.1    Viewing CAA Events

To view events related to CAA that have been sent to EVM, enter the following command:

# evmget -f "[name *.caa.*]" | evmshow
CAA cluster_lockd was registered
CAA cluster_lockd is transitioning from state ONLINE to state OFFLINE
CAA resource sbtest action script /var/cluster/caa/script/foo.scr (start): success
CAA Test2002_Scale6 was registered
CAA Test2002_Scale6 was unregistered

To get more verbose event detail from EVM, use the -d option as follows:

# evmget -f '[name *.caa.*]' | evmshow -d | more
============================ EVM Log event ===========================
EVM event name: sys.unix.clu.caa.app.registered
 
    This event is posted by the Cluster Application Availability
    subsystem (CAA) when a new application has been registered.
 
======================================================================
 
Formatted Message:
    CAA a was registered
 
Event Data Items:
    Event Name        : sys.unix.clu.caa.app.registered
    Cluster Event     : True
    Priority          : 300
    PID               : 1109815
    PPID              : 1103504
    Event Id          : 4578
    Member Id         : 2
    Timestamp         : 18-Apr-2001 16:56:17
    Cluster IP address: 16.69.225.123
    Host Name         : provolone.zk4.dec.com
    Cluster Name      : deli
    User Name         : root
    Format            : CAA $application was registered
    Reference         : cat:evmexp_caa.cat
 
Variable Items:
    application (STRING) = "a"
 
======================================================================   

The template script /var/cluster/caa/template/template.scr has been updated to create scripts that post events to EVM when CAA attempts to start, stop, or check applications. Any action scripts that were newly created with caa_profile or SysMan will now post events to EVM. To view only these events, enter the following command

# evmget -f "[name sys.unix.clu.caa.action_script]" | evmshow -t "@timestamp  @@"

CAA events can also be viewed by using SysMan Station. Click on the Status Light or Label Box for Applications in the SysMan Station Monitor Window.

To view other events that are logged by the caad daemon, as well as other daemons, enter the following command:

# evmget -f "[name sys.unix.syslog.daemon]" | \ 
evmshow -t "@timestamp  @@"

8.9.2    Monitoring CAA Events

To monitor CAA events with time stamps on the console, enter the following command:

# evmwatch -f "[name *.caa.*]" | evmshow "@timestamp  @@"

As events that are related to CAA are posted to EVM, they are displayed on the terminal where this command is executed. An example of the messages is as follows:

CAA cluster_lockd was registered
CAA cluster_lockd is transitioning from state ONLINE to state OFFLINE
CAA Test2002_Scale6 was registered
CAA Test2002_Scale6 was unregistered
CAA xclock is transitioning from state ONLINE to state OFFLINE
CAA xclock had an error, and is no longer running 
CAA cluster_lockd is transitioning from state ONLINE to state OFFLINE
CAA cluster_lockd started on member polishham

To monitor other events that are logged by the CAA daemon using the syslog facility, enter the following command:

# evmwatch -f "[name sys.unix.syslog.daemon]" | evmshow | grep CAA

8.10    Troubleshooting with Events

The error messages in this section may be displayed when showing events from the CAA daemon by entering the following command:

# evmget -f "[name sys.unix.syslog.daemon]" | evmshow | grep CAA

Action Script Has Timed Out

CAAD[564686]: RTD #0: Action Script \
/var/cluster/caa/script/[script_name].scr(start) timed out! (timeout=60)

First determine that the action script correctly starts the application by running /var/cluster/caa/script/[script_name].scr start. If the action script runs correctly and successfully returns with no errors, but it takes longer to execute than the SCRIPT_TIMEOUT value, increase the SCRIPT_TIMEOUT value. If an application that is executed in the script takes a long time to finish, you may want to background the task in the script by adding an ampersand (&) to the line in the script that starts the application. This will however cause the command to always return a status of 0 and CAA will have no way of detecting a command that failed to start for some trivial reason, such as a misspelled command path.

Action Script Stop Entry Point Not Returning 0

CAAD[524894]: `foo` on member `provolone` has experienced an unrecoverable failure.
 

This message occurs when a stop entry point returns a value other than 0. The resource is put into the UNKNOWN state. The application must be stopped by correcting the stop action script to return 0 and running caa_stop or caa_stop -f. In either case, fix the stop action script to return 0 before you attempt to restart the application resource.

Network Failure

CAAD[524764]: `tu0` has gone offline on member `skiing`

A message like this for network resource tu0 indicates that the network has gone down. Make sure that the network card is connected correctly. Replace the card, if necessary.

Lock Preventing Start of CAA Daemon

CAAD[526369]: CAAD exiting; Another caad may be running, could not obtain \
lock file /var/cluster/caa/locks/.lock-provolone.dec.com

A message similar to this is displayed when attempting to start a second caad. Determine whether caad is running as described in Section 8.8.1. If there is no daemon running, remove the lock file that is listed in the message and restart caad as described in Section 8.8.2.

8.11    Troubleshooting a Command-Line Message

A message like the following indicates that CAA cannot find the profile for a resource that you attempted to register:

Cannot access the resource
profile file_name

For example, if there is no profile for clock, an attempt to register clock fails as follows:

# caa_register clock
Cannot access the resource profile '/var/cluster/caa/profile/clock.cap'.

The resource profile is either not in the right location or does not exist. You must make sure that the profile exists in the location that is cited in the message.