PROBLEM: (89313, GB_G01781) (PATCH ID: TCR520-013) ******** This patch fixes a situation in which one or several cluster members would panic if a Memory Channel cable was removed or faulty. PROBLEM: (90195) (PATCH ID: TCR520-055) ******** This patch fixes the following problems with Memory Channel in a cluster environment: - a problem with the Memory Channel power off in LAN interconnect cluster which causes a cluster wide panic, - a user is now allow to kill a LAN interconnect cluster via Memory Channel, - supports Memory Channel usage in a LAN cluster. PROBLEM: (84876, 87656) (PATCH ID: TCR520-106) ******** This patch fixes when the master failover node goes offline during a failover and failing over due to parity errors increasing beyond the limit. Some symtoms of the master failover node going offline: One node in the cluster panics, and the other nodes hang. The reason for the panic may be anything, but it is important to note that the cluster was in a failover during the panic. This can be seen in the crash dump: rmerror_int: failover: mchan0 error_type = 0xe0000000 error_count = 0xba time = 0x17e573c50 mcerr = 0x12020248 lcsr = 0xc07b mcport = 0x16400000 The nodes that hang may display: m_state_change: mchan0 slot 0 offline rm slave: mchan0, hubslot = 1, phys_rail 1 removed rm slave: mchan0, hubslot = 1, phys_rail 1 (size 512 MB) depending on the timing and where in the code path the master was when he failed. If the other nodes are not reset before the paniced node is rebooted they may panic. Those panics can misleading and range from: "panic (cpu 1): ics_unable_to_make_progress: input thread stalled" to a machine check. Some symtoms of the parity error limit being exceeded: This is more difficult to diagnose, and if any of the following panics are seen this may be the cause: 1. On one node: "panic (cpu 0): simple_lock: time limit exceeded", and on one or more of the other nodes: PANIC: "ics_mct: Node arrival with node in bad state" 2. PANIC: "cmn_err: CE_PANIC: ICS MCT Assertion failed: total_fragments == ectx" 3. PANIC: "cmn_err: CE_PANIC: ICS MCT Assertion failed: lf != 0,file: ics_mct_oolencoder.c" 4. panic (cpu 1): kernel memory fault PROBLEM: (IT_G03453) (PATCH ID: TCR520-132) ******** This fix addresses a problem in which a bad Memory Channel cable causes a cluster member to panic with a panic string of "rm_eh_init" or "rm_eh_init_prail". The problem occurs in dual-rail Memory Channel cluster configurations after a cable problem causes the cluster to mark the first rail as bad and fail over to the second rail. If the Memory Channel code later decides to mark the first rail as okay and additional cable problems occur, a cluster member may panic as described above. PROBLEM: (92909) (PATCH ID: TCR520-152) ******** This patch contains changes that should make Memory Channel failovers work better. It will also handle bad optical cables. The symptoms of bad optical cables are an impossible number of state change interrupts or out of bounds hubslot identification numbers being passed to the state change interrupt handler. PROBLEM: (92318) (PATCH ID: TCR520-134) ******** This patch fixes a problem in which a node booting into a cluster hangs during Memory Channel initialization. This problem may occur in a heavily loaded cluster when logical rail threads associated with the Memory Channel logical rails are blocked while a member is booting. There will be no deterministic console message. The cluster will get stuck during the booting. It may happen, as we have seen in the past, that the cluster gets stuck here: rm slave: mchan0, hubslot = 7, phys_rail 0 (size 512 MB) rm slave: mchan1, hubslot = 7, phys_rail 1 (size 512 MB) rm slave: log_rail 0 (size 512 MB), phys_rail 1 (mchan1) PROBLEM: (95004) (PATCH ID: TCR520-218) ******** In a dual rail memory channel cluster, when one initiates failover, another node (typically a TLASER) may crash with a KERNEL MEMORY FAULT panic. 0 boot 1 panic 2 trap 3 _XentMM 4 rm_get_lock_master 5 rm_error_cluster_sync 6 rm_slave_failover 7 rm_failover_request_int 8 rm_prail_int 9 rm_int 10 Mchan_isr 11 intr_dispatch_post 12 _XentInt PROBLEM: (93962) (PATCH ID: TCR520-180) ******** When parity errors increase beyond the error raet threshold, in a single physical rail configuration, the Memory Channel driver will flag the rail as 'noisy' and attempt to failover. In the single rail configuration there exists no failover rail, and this action causes the entire cluster to panic. This fix panics the node whose error has exceeded the threshold. PROBLEM: (94360) (PATCH ID: TCR520-181) ******** The Memory Channel driver leaves stale data on an offline physical rail. If a node is rebooted while this physical rail is offline, and then the physical rail comes back online the rebooted node will panic with: panic (cpu 1): memory channel - cluster still thinks node is member To allow this node to join the cluster requires a cluster reboot. PROBLEM: (95052) (PATCH ID: TCR520-233) ******** A debug kernel [built non-optimized] can panic with a kernel memory fault: 0 panic 1 trap 2 _XentMM 3 rm_notif_request 4 rm_lrail_int_ctx 5 rm_int 6 Mchan_isr 7 intr_dispatch_no_post 8 _XentInt PROBLEM: (95794) (PATCH ID: TCR520-253) ******** An error in event log indexing may result in the appearance of superfluous "rm_event, index too big" messages on the system console. PROBLEM: (92102, 94878, 94910, 94988, 95476, 95669) (PATCH ID: TCR520-208) ******** In a memory channel cluster, rebooting a node without performing a hardware reset an crash other members with a RM_AUDIT_ACK_BLOCK panic. 0 boot 1 panic 2 rm_crash_node_mask 3 rm_panic 4 rm_audit_ack_block 5 rm_write_sync 6 rm_get_errcnt_lock 7 rm_lock_global_error 8 rm_eh_init_shared_data_req 9 rm_prail_int 10 rm_int 11 Mchan_isr 12 _XentInt PROBLEM: (94391, 89889) (PATCH ID: TCR520-225) ******** This fixes issues associated with the initialization of the Memory Channel driver. The fix addresses an issue whereby the driver's internal data structures become inconsistent during boot, and this inconsistency is subsequently propagated to other nodes. It also adds resiliency during boot. Some of the symptoms the fix addresses are listed below. PANIC: "rm_prail_boot_signal_any_node:Fix configuration and reboot" PANIC: "ics_mct: Error from establish_RM_notification_channel" PANIC: "ics_mct: Error from register_RM_notification_callback" PANIC: "ics_mct: Error from send_RM_notification"