PROBLEM: (89313, GB_G01781) (PATCH ID: TCR520-013) ******** This patch fixes a situation in which one or several cluster members would panic if a Memory Channel cable was removed or faulty. PROBLEM: (90195) (PATCH ID: TCR520-055) ******** This patch fixes the following problems with Memory Channel in a cluster environment: - a problem with the Memory Channel power off in LAN interconnect cluster which causes a cluster wide panic, - a user is now allow to kill a LAN interconnect cluster via Memory Channel, - supports Memory Channel usage in a LAN cluster. PROBLEM: (84876, 87656) (PATCH ID: TCR520-106) ******** This patch fixes when the master failover node goes offline during a failover and failing over due to parity errors increasing beyond the limit. Some symtoms of the master failover node going offline: One node in the cluster panics, and the other nodes hang. The reason for the panic may be anything, but it is important to note that the cluster was in a failover during the panic. This can be seen in the crash dump: rmerror_int: failover: mchan0 error_type = 0xe0000000 error_count = 0xba time = 0x17e573c50 mcerr = 0x12020248 lcsr = 0xc07b mcport = 0x16400000 The nodes that hang may display: m_state_change: mchan0 slot 0 offline rm slave: mchan0, hubslot = 1, phys_rail 1 removed rm slave: mchan0, hubslot = 1, phys_rail 1 (size 512 MB) depending on the timing and where in the code path the master was when he failed. If the other nodes are not reset before the paniced node is rebooted they may panic. Those panics can misleading and range from: "panic (cpu 1): ics_unable_to_make_progress: input thread stalled" to a machine check. Some symtoms of the parity error limit being exceeded: This is more difficult to diagnose, and if any of the following panics are seen this may be the cause: 1. On one node: "panic (cpu 0): simple_lock: time limit exceeded", and on one or more of the other nodes: PANIC: "ics_mct: Node arrival with node in bad state" 2. PANIC: "cmn_err: CE_PANIC: ICS MCT Assertion failed: total_fragments == ectx" 3. PANIC: "cmn_err: CE_PANIC: ICS MCT Assertion failed: lf != 0,file: ics_mct_oolencoder.c" 4. panic (cpu 1): kernel memory fault PROBLEM: (IT_G03453) (PATCH ID: TCR520-132) ******** This fix addresses a problem in which a bad Memory Channel cable causes a cluster member to panic with a panic string of "rm_eh_init" or "rm_eh_init_prail". The problem occurs in dual-rail Memory Channel cluster configurations after a cable problem causes the cluster to mark the first rail as bad and fail over to the second rail. If the Memory Channel code later decides to mark the first rail as okay and additional cable problems occur, a cluster member may panic as described above. PROBLEM: (92909) (PATCH ID: TCR520-152) ******** This patch contains changes that should make Memory Channel failovers work better. It will also handle bad optical cables. The symptoms of bad optical cables are an impossible number of state change interrupts or out of bounds hubslot identification numbers being passed to the state change interrupt handler. PROBLEM: (92318) (PATCH ID: TCR520-134) ******** This patch fixes a problem in which a node booting into a cluster hangs during Memory Channel initialization. This problem may occur in a heavily loaded cluster when logical rail threads associated with the Memory Channel logical rails are blocked while a member is booting. There will be no deterministic console message. The cluster will get stuck during the booting. It may happen, as we have seen in the past, that the cluster gets stuck here: rm slave: mchan0, hubslot = 7, phys_rail 0 (size 512 MB) rm slave: mchan1, hubslot = 7, phys_rail 1 (size 512 MB) rm slave: log_rail 0 (size 512 MB), phys_rail 1 (mchan1) PROBLEM: (95004) (PATCH ID: TCR520-218) ******** In a dual rail memory channel cluster, when one initiates failover, another node (typically a TLASER) may crash with a KERNEL MEMORY FAULT panic. 0 boot 1 panic 2 trap 3 _XentMM 4 rm_get_lock_master 5 rm_error_cluster_sync 6 rm_slave_failover 7 rm_failover_request_int 8 rm_prail_int 9 rm_int 10 Mchan_isr 11 intr_dispatch_post 12 _XentInt PROBLEM: (93962) (PATCH ID: TCR520-180) ******** When parity errors increase beyond the error raet threshold, in a single physical rail configuration, the Memory Channel driver will flag the rail as 'noisy' and attempt to failover. In the single rail configuration there exists no failover rail, and this action causes the entire cluster to panic. This fix panics the node whose error has exceeded the threshold. PROBLEM: (94360) (PATCH ID: TCR520-181) ******** The Memory Channel driver leaves stale data on an offline physical rail. If a node is rebooted while this physical rail is offline, and then the physical rail comes back online the rebooted node will panic with: panic (cpu 1): memory channel - cluster still thinks node is member To allow this node to join the cluster requires a cluster reboot.