Compaq Services Division: Patch/ECO - rtrsun314_227


Download Sites: Colorado Georgia Northern Europe	RTR V3.1D RTRSUN314_227 Reliable Transaction Router SUN UNIX ECO Summary TITLE: RTR V3.1D RTRSUN314_227 Reliable Transaction Router SUN UNIX ECO Summary Modification Date: 05-APR-1999 Modification Type: New Kit Copyright (c) Compaq Computer Corporation 1999. All rights reserved. PRODUCT: Reliable Transaction Router (RTR) for UNIX for SUN OP/SYS: UNIX for SUN SOURCE: Compaq Computer Corporation ECO INFORMATION: ECO Kit Name: RTRSUN314_227 ECO Kits Superseded by This ECO Kit: ECO Kit Approximate Size: 15756 Blocks (8067072 Bytes) TAR file - 15756 Blocks (8067072 Bytes) Kit Applies To: RTR V3.1D SUN UNIX V2.5, V2.5.1, V7.0 System/Cluster Reboot Necessary: No ECO KIT SUMMARY: An ECO kit exists for Reliable Transaction Router on UNIX for SUN V2.5 through V7.0. This kit addresses the following problems: Problems Addressed in RTRSUN314_227: o Remote commands failing intermittently Remote commands can sometime produce the error "%RTR-F-ERRACCNOD, error accessing node". Repeating the command one or more times usually results in the desired effect. After stopping and starting RTR, one or more command server processes may report "ACPNOTVIA" in response to a subsequent SHOW command. The command DISCONNECT SERVER entered one or more times usually clears the condition. Also, the output from remote STOP RTR and DISCONNECT SERVER commands is abbreviated. o Overlapping partition definitions. Under some conditions, simultaneously opening several server channels with overlapping partition definitions could cause the ACP to crash. This is no longer the case. o Errors at severity "error" and "warning" reported as "fatal" RTR defines messages with one of the following severity levels: success, informational, warning, error and fatal. When displayed in a terminal session or log file entry, the severity of a message is indicated in the preamble string as a letter, e.g. "%RTR-S-FACCREATED . . ." indicates the successful creation of a facility. Previous versions of RTR grouped together messages of severity warning, error and fatal, and displayed the severity level of these messages as fatal. This had the unintended effect of displaying some messages at a higher severity level than intended. This defect has been eliminated so that messages are now rendered consistently with their descriptions in the System Manager's Manual and elsewhere. o ASCII representation of key bounds Key range low and high bounds are now displayed properly as a comma-separated list of numbers and C-style strings. o Facility name allows some eight bit characters but not all Non-English character sets are not supported for identifiers. The supported character set for RTR identifiers such as facility names is ASCII, with lowercase and uppercase letters equivalent. Eight bit characters are not supported because the name might not interoperate with RTR processes using a different locale or running another RTR version. o Node /isolate and link /suspect implementation faulty The 'set node' qualifier /isolate and 'set link' qualifer /suspect have been superseded by /autoisolate. Any RTR node may disconnect a remote node if it finds that node to be unresponsive or congested. The normal behavior following such action is automatic network link reconnection and recovery. Node autoisolation is a feature that allows a node where the feature is enabled to disconnect a congested remote node in such a way that when the congested node attempts to reconnect it receives an instruction to close all its network links and cease connection attempts. Whilst in this state, the node is termed isolated. Remote node autoisolation may be enabled at the node level where it applies to all links, or for specific links only with the 'set link/autoisolate' command. An isolated node will remain in that state until the system manager performs the following actions: - enables the link to the isolated node at all nodes that have isolated it set link /enable - exit the isolated state at the isolated node set node/noisolate o MONITOR STALLS field widths adjusted The numbers should now be legible even when many bytes have been sent. o Incorrect open channel count in MONITOR CHANNEL The opened channel count now counts every opened channel exactly once. o Distributed quorum loop Previous versions of RTR could generate unnecessarily large amounts of network traffic whilst establishing quorum. In cases where a recipient node was slow in processing incoming network messages this could result in the loss of the sending RTRACP process (rdm__malloc: not enough memory) This has been corrected. o MODIFY JOURNAL command now allowed while ACP is running The earlier restriction that a MODIFY JOURNAL command could not be issued after RTR was started has now been lifted. It is however now required that RTR be started for a MODIFY JOURNAL command to be accepted. o Bug in load balance feature In earlier versions of RTR it was occasionally possible for a router to become permanently incapable of accepting new incoming frontend connections if router load-balancing had been enabled by specifying the /BALANCE qualifier, and a frontend happened to connect to the router during a quorum state transition. This has now been corrected. o Rtr applications hang on trying to continue after ACP restarted If the application tried to open a channel again after seeing the status RTR_STS_ACPNOTVIA it could hang on the subsequent rtr_receive_message call. This problem has been corrected for threaded Unix platforms. It is no longer necessary to restart any Rtr application for Unix after restarting Rtr. o Using RTR commands after ACP has been stopped or dies API verbs called from the RTR command line interpreter would fail with the status ACPNOTVIA if RTR was stopped and restarted without restarting the command server. This has been corrected. The problem can be avoided on earlier vesions of RTR by issuing the command 'disconnect server' after stopping RTR. o Application crash trying to send large messages to looping ACP Several changes were made to combat this combination of application crash and rapidly expanding ACP heap: Flow control is now granted only to the channel and facility that requested A problem was discovered and corrected whereby a grant of flow control credit could allow unrelated channels to send too. This is believed to be the prime cause of the symptoms reported. An application that is unable to send to the ACP due to resource shortage, for example if the ACP is alive but no longer receiving for whatever reason, now keeps trying indefinitely, and will now appear to hang rather than crash. The TCP_NODELAY option which disables the Nagle algorithm is no longer enabled on any Rtr platform. This will improve throughput under load, although there may be a slight impact on response time under certain conditions. o Functions permitted in a rtr_set_wakeup() handler In an Rtr wakeup handler in an AST in an unthreaded OpenVMS application the use of rtr_reply_to_client(), rtr_send_to_server(), rtr_broadcast_event(), or rtr_receive_message() with a non-zero timeout is not recommended. They may block when they need transaction ids or flow control, which will cause the whole application to hang until the wakeup completes. In an Rtr wakeup handler in a threaded application the same rules apply. Note that wakeups are unnecessary in a threaded paradigm, but they may be used in common code in applications that also need to run on OpenVMS. Please note that your mainline code continues to run while your wakeup is executing, so extra synchronisation may be required. Also note that if the wakeup does block then it does not generally hang the whole application. In an Rtr wakeup handler in a signal in an unthreaded Unix application no RTR API functions and only the very few asynch-safe system and library functions may be called, because the wakeup is performed in a signal handler context. It is permitted to write to a pipe or access a volatile sig_atomic_t variable, but using malloc() and printf() for example will cause unexpected failures. Alternatively, on most Unix platforms you can compile and link the application as a threaded application with the reentrant Rtr shared library '-lrtr_r'. For maximum portability the wakeup handler should do no more than just the minimum necessary to wake up the mainline event loop. You should assume that mainline code and other threads might continue to run in parallel with the wakeup, especially on machines with more than one CPU. o Superfluous network traffic for nonexistent channels Whenever a channel opens or closes, RTR sends an update message to the router so that it can modify its broadcast routing information, if necessary. In previous versions of RTR such messages were sent even if no channel existed for the facility. In cases where machines with the Frontend role had a large number of facilities defined, this could result in significant network traffic that would be quite noticeable over slow links, such as asynchronous connections over telephone wire. RTR no longer sends these messages unless a channel exists on the facility. o Crash for invalid facility on AIX An uninitalized variable could cause a process crash on AIX. This has been corrected. o Large messages can exhaust ACP memory before it closes channel Flow control is supposed to close a channel to release memory occupied by the queue of messages waiting to be sent to an unresponsive application channel after it has been found to be full a certain number of times. If the application is sending many very large messages, especially if flow control is also not working properly for some other reason, then flow control may not close the channel soon enough and the ACP may run out of memory. As a workaround only for applications using large messages containing tens of thousands of bytes, the following environment variables are provided in this release to adjust the flow control algorithm parameters: RTR_MAX_CHANNEL_WAITQ_BYTES default 100000 increase to several million RTR_MAX_CHANNEL_WAITQ_LIMIT default -1 reduce to tens of millions RTR_MAX_CHANNEL_FULL_COUNT default 2000 reduce to less than LIMIT/64000 The BYTES should be set somewhat larger than the total amount of data that can ever occur in one transaction or one set of broadcasts. This might be a few million for a typical large message application. Flow control will start to operate and hang each application that is sending the data while this number of BYTES is exceeded for a channel. The channel will be closed if this happens more than COUNT times, or if the LIMIT is ever exceeded. Either set the LIMIT to an amount of acp waitq memory that you can afford for one channel, considerably less than your virtual memory and heap quotas, or set the COUNT to the number of messages you can afford to store in acp waitq memory for one channel. The default for LIMIT is -1, which means that there is no LIMIT and the COUNT will apply. These environment variables are provisional. Please consult with Rtr Engineering before changing any of these values. o Servers hanging during failover recovery Transaction recovery as a result of server failover could result in server applications getting hung in 'local recovery' state if it also happened that more than 10 client channels had simultaneously caused new transactions to be presented to the backend node. This has been fixed both by increasing the limit to 50 and by adding a check to make sure that recovery is complete before enforcing the limit, which is designed to keep a backend node from getting overwhelmed when transactions are coming in at a rate faster than it can handle. o Null bytes display in SHOW PARTITION output The display of null bytes in the upper and lower key bounds has been suppressed if the bytes appear at the end of a key of type string. o RTRACP core dump while idle When a Frontend node is trimmed from a Frontend/Router Facility Definition, a core dump may result when RTR attempts to verify the facility after a network link loss. This has been corrected. o RTRACP Crashes during trim operation. Crash during trim operation is now corrected. The following problems were corrected in V3.1D ECO2 and are included in this kit: o A problem introduced in the 3.1D ECO1 (171) release that caused errors when using more than one partition has been corrected. The following problems were corrected in V3.1D ECO1 and are included in this kit: o A configuration including front end nodes running RTR V2 and router nodes running RTR V3.1D could result in an ACP failure with an assertion on the router node. This has been corrected. o The RTR ACP process would sometimes crash due to mis-handling of data pointers related to key range and facility information. The code has been changed to correct these problems. o The delivery of the RTR_EVTNUM_SRPRIMARY event sometimes arrived at an application in Shadow Secondary mode before the completion of the pending transaction, which potentially allowed the same transaction to be played to both shadow sites. This problem has been corrected by ensuring that the event is delivered only between transactions. o Following facility reconfiguration with the "trim facility" command, a situation could arise where a frontend node would be unable to connect to its own role as a router in a facility. Once in this mode, starting client channels would be unable to proceed past the state "wait_keyin". o The RTR Command Server process would sometimes crash while exiting because the shared memory segment required to print log file information was no longer available when the final flush of the log buffer occurred. This problem has been corrected. INSTALLATION NOTES: The Reliable Transaction Router Version 3.1D ECO11 installation procedure is the same as the installation procedure for RTR Version 3.1D. Refer to the Installation Guide for further information. This patch can be found at any of these sites: Colorado Site Georgia Site European Site Files on this server are as follows: rtrsun314_227.README rtrsun314_227.CHKSUM rtrsun314_227.CVRLET_TXT rtrsun314_227.tar

	Updated: June, 1998 Legal