PROBLEM: (87512) (PATCH ID: TCR520-045) ******** The most common cause of this problem is an attempt to reboot a cluster node before a cluster-wide shutdown (shutdown -csh) has completed. During a cluster-wide shutdown, cluster joins are disabled to ensure that the cluster shuts down completely. Then, when a node reboots pre-maturely, its join request is rejected. CNX was coded to "halt" under these circumstances, but the code was incorrect and called the internal halt() routine. This led to confusing panics on SMP systems, because the internal halt() routine only halts a single CPU. As a result, other CPU's continue to execute, which eventually leads to a difficult to diagnose system panic. Instead of halting in this case CNX will now wait until all cluster members have halted, and then attempt to form a new incarnation of the cluster. All other calls to "halt()" have been replaced by an SMP-correct halt mechanism. PROBLEM: (93705, 93009, 91784, 90289, 89354, 87114) (PATCH ID: TCR520-170) ******** This patch fixes a set of problems in the connection manager which could generate one of the following panic strings or symptoms: 1. PANIC: "CNX MGR: commit_tx: invalid node state" 2. Several members get KMF with the cnx_csb_th thread running "get_one_response()". 3. PANIC: "simple_lock: minimum spl violation" in cnx_unlock_me() 4. PANIC: "CNX MGR: rcnx_node_status: stale message" in rcnx_status_V1() 5. PANIC: "CNX MGR: verify_csbs: i/o clean member in topology" 6. PANIC: "CNX MGR: Reconfig failure, no time left, leaving cluster" PROBLEM: (93705, 93009, BCGM51FG7) (PATCH ID: TCR520-169) ******** 1. Do not allow a CSB to be terminated by the cnx_node_down_th thread if the CSB is still selected, CNX_CSEL is set. If the CNX_CSEL bit remains set for longer than the cluster_rebuild_delay panic because the cnx_csb_th is hung. 2. Allow the cs_zombie case to be treated the same as the cs_node_down case in deal_with_unchosen().