PROBLEM: (QAR 58405) (Patch ID: TCR150-002) ******** This patch fixes a problem booting a second member into a cluster. The node may boot and form a cluster from the point of view of cnxshow. However, attempts to access a DRD which is expected to be served from the booting node generates the following message on the console: "bss_rm_iorequest: recvd I/O from dead host 0" Another problem may appear if a node of this "cluster" is rebooted. In this case cnxshow will not show a normal cluster. One can tell if a cluster has this problem from the boot time messages on the node already up. Let the booting node be node 0 and the node up be node 1. Then the messages on node 1 when 0 boots should look like this: memory channel request from node 0 memory channel update request from node 0 memory channel - adding node 0 If the last message, "memory channel - adding node 0", is missing the problem exists. One can also tell by dbxing the kernel and looking at bss_server_work.last_bitmap: dbx -k /vmunix (dbx) p bss_server_work.last_bitmap This should have the same value on all nodes and the number of bits set should be the same as the number of nodes in the cluster. It is possible that "memory channel - adding node 0" is present and the bss_server_work.last_bitmaps look good but cnxshow does not display a good cluster. This happens when the cluster, at some point in the past, exhibited the missing "memory channel - adding node 0" behavior. Note that the cluster may exhibit the missing "memory channel - adding node 0" frequently or infrequently. PROBLEM: (QAR 60982) (Patch ID: TCR150-015) ******** In a virtual hub cluster, shutting down one node can cause the other to crash. Typical panic strings on the node that crashes are "rm_failover_self" and "rm_failover_all: target rail offline". PROBLEM: (Patch ID: TCR150-021) ******** Various repairs in Memory Channel error handling. Fixes for virtual hub booting with cable unplugged. Typical panic string for the boot with cable unplugged: "rm_delete_context: fatal MC error" Panics removed in the failover code are: "rm_failover_request_int_common: failed to free error_cnt lock" "rm_failover_request_int_common: failed to get error cnt lock" A panic removed in error handling is: "rmerror_int: failed to free error count lock" A fix for noticing that a node has gone down during error handling keeps another node from panicing with: "rm_delete_context: fatal MC error" PROBLEM: (QAR 58777, QAR 59100, QAR 59466, QAR 59898, QAR 62225) (Patch ID: TCR150-026) ******** This patch corrects the following: Various problems with MC error handling discovered in cable pull under load tests. Typical panic strings are: "rm_delete_context: fatal MC error" and "Kernel Memory Fault". In addition the nodes may hang. It is not recommended that cables be pulled, even with this patch, this is because we are still sorting out some problems that result in memory corruption when cables are pulled. This patch has audits to detect some of the corruption and will crash a node if corruption is detected. To test error handling in a safe way, one should power down the active hub. While cable pull is not trouble free with this patch, it is felt that error handling in general is more robust in this implementation. PROBLEM: (Patch ID: TCR150-029) ******** Hubless MC2 systems hang during boot and/or experience error interrupts. PROBLEM: (none) (Patch ID: TCR150-052) ******** Reliable datagram (RDG) messaging delivers low latency, high bandwidth networking for cluster applications. Cluster applications wishing to utilize these features code to the api defined in the rdg shared library, librdg.so. PROBLEM: () (Patch ID: TCR150-065) ******** Applications that are developed to the Reliable Datagram API may see a problem where RdgIoPoll() indicates an IO has completed, when one actually has not. PROBLEM: (QAR 75850) (Patch ID: TCR150-078) ******** This patch fixes a kernel memory fault in rm_lock_update_retry(). PROBLEM: ('QAR 73648') (Patch ID: TCR150-069) ******** This patch fixes a problem where both nodes in a cluster will panic at the same time with a simple_lock timeout panic.