PROBLEM: (TKTR72133) (Patch ID: TCR100-003) ******** This fixes a problem in which a cluster would hang if the Memory Channel Hub was turned off. This problem occurred because some systems were further in the failover sequence than others. This patch fixes that problem by including a check for FAILOVER_DONE as well as FAILOVER_OK. If the system is crashed, and a core dump created, when it is hung, a possible stack trace would look like: 0 alpha_delay(0xfffffc00004e5e40, 0x20, 0xfffffc00004e9154, 0x6, 0... 1 microdelay(0xfffffc00004e9154, 0x6, 0xfffffc0007d7c000, 0x6, 0xf... 2 rmerror_failover(0x1, 0x2, 0x1406, 0x6, 0x8300000000) ["../../..... 3 rmerror_state_change(0xfffffc0000000000, 0x1406, 0xfffffc0000763... 4 rmspurISR(0xfffffc0007902640, 0x729d000000000, 0xfffffc0007f8521... 5 intr_dispatch_post(0xfffffc0007fc1340, 0x0, 0xffffffff87d1c000, ... 6 _XentInt(0x0, 0xfffffc0000477014, 0xfffffc0000763da0, 0x3fff, 0x... 7 idle_thread() ["../../../../src/kernel/kern/sched_prim.c":3021, ... The /var/adm/syslog.dated/.../kern.log file may contain entries similar to: ..date/host.. vmunix: rmerror_state_change: unit = 0 Err_reg = 0x1406 node = 6 ..date/host.. vmunix: rmerror_failover: Requesting node = 2 ..date/host.. vmunix: rmerror_failover: not every node can failover PROBLEM: (Patch ID: TCR100-005) ******** This patch fixes a problem where remote sync pages that are allocated in process space are never released on process exit, eventually requiring a reboot. PROBLEM: (QAR 52044) (Patch ID: TCR100-011) ******** This patch is a software workaround for a hardware problem with CCMAA-AA MEMORY CHANNEL adapters that may cause a cluster to hang or panic during node transitions. (Clusters using only CCMAA-BA MEMORY CHANNEL adapters do not exhibit this problem.) A CCMAA-AA MEMORY CHANNEL adapter can intermittently fail to assert interrupts, for instance when a node tries to join a cluster. When this occurs, the cluster may hang or generate a "simple lock timeout" panic. Halting and crashing a hung cluster will show the following: (One node will have had a simple lock timeout panic.) The simple lock timeout identifies the cluster lock that is held. The node holding the lock can be found via the rm_ctx data structure. On that node there is typically a thread holding the cluster lock and waiting for a response to an interrupt that has been sent to another node.