Chapter 8Debugging Applications in the Foundation Services
For information about how to report and check errors caused
by applications and how to debug application, see the following sections:
For debugging purposes configure remote IP access to all nodes in the
cluster. For more information, see "Cluster Addressing and Networking" in Netra High Availability Suite Foundation Services 2.1 6/03 Overview.
You can use standard Solaris operating system commands in the Foundation Services environment.
For debugging applications that interact with the Foundation Services nodes use the
debugging software provided with the Forte Developer 6 Software Suite.
Reporting Application Errors
Configure
applications to report errors and their causes. This information can be used
during troubleshooting to reduce the risk of the re-occurrence of similar
errors. To facilitate recovery from an error, you can provide the following
information:
The return value of the function call that returned the error
The context in which the error occurred
An indication of the severity of the error
The standard return values for CMM API errors are summarized in Table 8-1.
Reading Error Information for Debugging
In the Foundation Services, standard error and alert messages are sent to
system log files. In error scenarios, you can refer to the system log files
to determine the history of a process. Critical errors are written on the
console in addition to being logged in the system log files.
While it is true that errors can cause notifications to be sent, notifications
are events and are not errors in themselves. For information on notifications,
see Chapter 6, Understanding Change Notifications.
The NMA enables you to receive information on notifications. Statistics
are available to diagnose the cause of errors received. See the Netra High Availability Suite Foundation Services 2.1 6/03 NMA Programming Guide.
For information about using and configuring system log files,
see the Netra High Availability Suite Foundation Services 2.1 6/03 Cluster Administration Guide.
Stopping the Daemon Monitor for Debugging
You cannot debug critical services, such as the CMM
or Reliable NFS, on a running cluster. Debugging would interrupt the regular
messages that these services send between nodes. Debugging tools, such as
the truss command, cannot be used on daemons while they
are being monitored by the Daemon Monitor.
Before debugging a Foundation Services daemon or a monitored Solaris daemon, stop
the Daemon Monitor from monitoring the daemon that you want to debug. When
you have finished debugging, restart the Daemon Monitor.
For information about how to stop and restart the Daemon Monitor, see
the Netra High Availability Suite Foundation Services 2.1 6/03 Cluster Administration Guide.
For a list of monitored daemons, see the nhpmd(1M) man page.
Broken Pipe Error Messages
If one of the applications you are running
on your cluster terminates suddenly, CMM notification pipes that this application
opened are kept on the nhcmmd side. You can be left with
a broken pipe from the CMM to the dead application. If the CMM later sends
a notification to this dead application, the CMM realizes that the application
is dead and closes the broken pipe. Alternatively, the CMM frequently checks
to see if a client application is dead and if necessary, closes associated
pipes.
If many of your applications die suddenly, without notifying the CMM,
the following can happen: Many pipes are broken.
Unless the CMM has a notification to emit, neither the dead
applications nor the broken pipes, are identified by the CMM.
Each broken pipe is associated to a file descriptor. This
can lead to a file descriptor shortage as the quantity of file descriptors
increases, which can saturate the CMM.
If one of your applications has died suddenly, you receive
a system log message such as this:
# Dec 23 09:56:07 machine_name CMM[839]: S-CMM
notif to /var/run/CMM_884 fails: Broken pipe
|
The CMM detects the problem and closes the notification pipe. For further
information on accessing system log files, see "Accessing and Maintaining System Log
Messages" in the Netra High Availability Suite Foundation Services 2.1 6/03 Cluster Administration Guide and the syslog.conf(4)
Solaris man page.
Return Values of the CMM API
The CMM API provides
extensive return values for errors and successful function calls. They are
listed in Table 8-1.
Table 8-1 Common Return Values of the CMM API
Return Value | Result | Possible Responses |
CMM_OK | The function call succeeded. | None required. |
CMM_EAGAIN | Returned information is based on a cluster view that has not been updated
by the master node for more than 10 seconds. | Retry the function call. |
CMM_EBADF | An identifier or descriptor
that corresponds to a file descriptor is invalid. The connection to the CMM
is no longer valid. Perhaps the CMM is dead. | Verify that data in your program is not corrupted. Call the cmm_cmc_register() and the cmm_notify_getfd() functions to fetch
a new connection. |
CMM_EBUSY | For all functions: The CMM API server is temporarily
out of resources to respond to the requested operation.
For cmm_cmc_unregister(): An attempt to unregister a callback, that is,
a call to the cmm_cmc_unregister() function, failed because
the caller's callback function is active.
See the cmm_cmc_unregister(3CMM) man page. | Wait, then retry the function call. You can
decide the length of wait, based on the application's characteristics. |
CMM_ECANCELED | A switchover operation was cancelled. For
example, when trying to demote the master, no vice-master can take over the
master role. | Continue. |
CMM_ECONN | The
local CMM API process is unreachable. | Check that the process is currently running. Perhaps it is not running yet.
Retry the function call. |
CMM_EEXIST | Only one function can be registered at a time. An attempt to call the cmm_cmc_register() function when a callback is already registered
returns this message. | The calling
process has already registered a callback. Verify that the existing function
is required for the purpose of your program. |
CMM_EINVAL | A function
parameter has an invalid value. | Ensure that the type of each parameter matches the type in the function prototype.
For example the nodeid is not a master-eligible
node. Cast variables to the expected type if necessary and verify
that the area of memory that stores the parameter is valid. |
CMM_ENOCLUSTER | One of the following has occurred: The local node is not configured in an active cluster. This
occurs, for example, when the cluster election is in progress.
The local node has been removed from the cluster
node table on the master node. For more information, see the cluster_nodes_table(4)
man page.
There is more than one master node.
The master node has been disqualified and no vice-master
node has taken over the master role.
A failover has been triggered by the disqualification of
the master node. During the failover, there is a brief time when there is
no master node. The CMM_ENOCLUSTER error was returned
during this time.
| Any combination
of the following: |
CMM_ENOENT | An attempted operation on
an item failed because the item does not exist. For example, when calling
the cmm_cmc_unregister() function, no callback has been
registered. Not critical. | Any
combination of the following: Verify that the area of memory that stores the item is valid.
If you want to delete the item, continue.
|
CMM_ENOMSG | An attempt
to dispatch an event failed because there are no events to be dispatched. | Continue. |
CMM_ENOTSUP | The operation could not be correctly executed. This error can be the result
of a system problem such as a file that cannot be created or a problem with
Remote Procedure Call (RPC) services. | Examine the system log files. |
CMM_EPERM | The call tried to execute on a node other than the master node, but it can
execute only on the master node. For more information, see the cmm_mastership_release(3CMM), cmm_member_setqualif(3CMM), and cmm_member_seizequalif(3CMM) man pages. | Execute the function only on the master node. |
CMM_ERANGE | The
number of cells in the table is smaller than the number of nodes in the cluster.
Returned by the cmm_member_getall() function. See the cmm_member_getall(3CMM) man page. | Add an entry in the table for each potential peer node. |
CMM_ESRCH | Using the cmm_member_getinfo() function
to obtain information about a node that is either not in the local cluster
node table, or is in the local cluster node table but currently has the CMM_OUT_OF_CLUSTER role.
Using the cmm_potential_getinfo() function
to obtain information about a node that is not in the local cluster node table.
Using the cmm_vicemaster_getinfo() function
while the cluster has no vice-master.
| Any combination of
the following: Examine why the master-eligible node is down or isolated.
Add an entry for this node to the cluster node table. See
the cluster_nodes_table(4) man page.
Change the node's role to master or vice-master.
|
CMM_ETIMEDOUT | No response even
when an operation is retried, until the delay has expired. The function call
was timed out. | Any combination
of the following: |
|