Sun Microsystems
Products & Services
 
Support & Training
 
 

Previous Previous     Contents     Index     Next Next
Chapter 7

Recovering From Node Reboot at Runtime

For information about the causes of node reboot at runtime, see A Monitored Daemon Fails Causing a Node to Reboot at Runtime.

A Monitored Daemon Fails Causing a Node to Reboot at Runtime

When a monitored daemon fails, the Daemon Monitor triggers a recovery response. The recovery response is often to restart the failed daemon. If the daemon fails to restart correctly, the Daemon Monitor reboots the node. The failure of a monitored daemon is the most common cause of a node reboot.

If the system recovers correctly, the daemon core and error message might be the only evidence of the failure. You must take the failure seriously even though the system has recovered.

For a list of recovery responses made by the Daemon Monitor, see the nhpmd(1M) man page. For a summary of the causes of daemon failure during startup, see A Monitored Daemon Fails Causing a Master-Eligible Node to Reboot at Startup and A Monitored Daemon Fails Causing a Diskless Node or Dataless Node to Reboot at Startup.

Table 7-1 and Table 7-2 summarize the events that can cause a monitored daemon to fail at runtime. To recover from daemon failure, perform the procedure in To Recover From Daemon Failure.

Table 7-1 Possible Causes of Daemon Failure at Runtime

Failed Daemon

Possible Cause at Runtime

nhcmmd

The nhcmmd daemon was killed.

The failing node does not see its presence in the cluster_nodes_table.

nhprobed

The nhprobed daemon was killed.

nhwdtd

The operating system has hung.

The system is overloaded.

Table 7-2 Causes of Daemon Failure on Master-Eligible Nodes During Failover or Switchover

Failed Daemon

Possible Cause During Failover or Switchover

nhcrfsd

The nhcrfsd daemon was killed during the failover or switchover.

nhcmmd

The node cannot connect to the nhprobed daemon.

nhprobed

The node cannot create the required threads, sockets, or pipe.

ProcedureTo Recover From Daemon Failure

  1. Examine the core file produced by the failed daemon.

    The core file is located in the /var/tmp/SUNWcgha/core directory, and has the format:core.node_name.executable_file_name.process_ID.time

    For more information about core dumps, see the coreadm(1M) man page.

  2. Examine the system log files for an error message produced by the failed daemon.

    For example, the following error message is produced by the failure of a daemon launched by the rpc nametag:

    [ID 615790 local0.notice] "rpc" Failed to stay up.

    For information about which nametag launches which daemon, see the nhpmd(1M) man page.

  3. Identify the cause of the daemon failure.

    Use the information obtained in Step 1, Step 2, Table 7-1, and Table 7-2.

  4. Fix the underlying problem, if necessary.

  5. Confirm that the recovery procedure has been carried out by searching the system log files for local0 information.

    • If your system log file is not configured for local0 information, reconfigure it.

      For information, see the Netra High Availability Suite Foundation Services 2.1 6/03 Cluster Administration Guide.

    • If local0 information is logged to a file, search the file for the string "nhpmd".

      Lines containing the string "nhpmd" describe the recovery response performed by the Daemon Monitor.

Previous Previous     Contents     Index     Next Next