RTR V3.2 RTR320_268 Reliable Transaction Router Tru64 UNIX ECO Summary

TITLE: RTR V3.2 RTR320_268 Reliable Transaction Router Tru64 UNIX ECO Summary Copyright (c) Compaq Computer Corporation 2000. All rights reserved. Modification Date: 11-OCT-2000 Modification Type: Updated Kit: Supersedes RTR320_266 PRODUCT: Reliable Transaction Router (RTR) for Tru64 UNIX OP/SYS: Compaq Tru64 UNIX SOURCE: Compaq Computer Corporation ECO INFORMATION: ECO Kit Name: RTR320_268 ECO Kits Superseded by This ECO Kit: RTR320_266 RTR320_255 RTR320_249 ECO Kit Approximate Size: 10600 Blocks (5427200 Bytes) Kit Applies To: RTR V3.2 on Compaq Tru64 UNIX V4.0D, V4.0E, V4.0F, V4.0G, V5.0 System/Cluster Reboot Necessary: No Installation Rating: INSTALL_UNKNOWN Kit Dependencies: The following remedial kit(s) must be installed BEFORE installation of this kit: None In order to receive all the corrections listed in this kit, the following remedial kits should also be installed: None ECO KIT SUMMARY: An ECO kit exists for Reliable Transaction Router V3.2 on Compaq Tru64 UNIX V4.0D through V5.0. This kit addresses the following problems: Problems Addressed in the RTR320_268 Kit: o The following changes and corrections have been made for RTR V3.2, ECO8 for all platforms. + 14-8-337 V4.0 SET TRANSACTION Functionality RTR would not allow the SET TRANSACTION command to abort a voted transaction if a coordinating router was still available. Users would get a SETTRANROUTER error. This constraint has now been relaxed. You can use SET TRANSACTION to abort a transaction that is in VOTED state. All backend participants and the client receive a rtr_mt_rejected message along with the reason, indicating that the transaction has been aborted by a SET TRANSACTION operation. If the coordinating router is running a version of RTR prior to V3.2, ECO8, the SET TRANSACTION command fails with a TRNEEDUPGRADE error, indicating that "The new SET TRAN feature is not supported in current version of RTR". There are eight valid state changes allowed for the SET TRANSACTION command. Attempting to change transaction state to a state that is not allowed produces an error message of %RTR-E-INVSTATCHANGE, Invalid to change from current state to the specified state. The following table identifies the valid state changes. Table_1_Valid_Transaction_State_Transitions _________________________________________________________ ___________________NEW_STATE___________________ ___________________________________________________________ Current State COMMIT ABORT EXCEPTION DONE ___________________________________________________________ SENDING YES VOTED YES YES COMMIT YES YES EXCEPTION YES YES PRI_DONE YES ___________________________________________________________ All transaction states referenced in the above table are RTR journal states. Use the RTR commands DUMP JOURNAL or SHOW TRANSACTION to determine the journal state for each transaction branch. Changing from State VOTED to State ABORT A transaction can be stalled in a VOTED state in one or more of the following situations. - There is a distributed deadlock (only possible with multi-participant transactions and multiple simultaneously active transactions.) - One participant has voted, but some other participant has not voted (for example, waiting for the database record to become accessible.) - The transaction is not currently active, and is a candidate for recovery. The transaction is probably a multi-participant transaction and the final journal state for the local branch is VOTED. The transaction outcome cannot be determined without consulting the journal from the other participants. Whatever the cause, this transaction ties up the system resources and prevents other transactions from running. Should the system administrator decide to abort the transaction using the SET TRANSACTION command, RTR sends a request-to-abort message to all the participants (transaction branches) to abort each transaction branch. After the abort, RTR presents a rtr_mt_rejected message to each server with a status indicating that "TX was aborted by Set Transaction operation". If the coordinating router is available, a race condition is possible, in that the transaction coordinator might be trying to commit the transaction at the instant that the operator was attempting to abort the transaction. Under this scenario, RTR may not allow the abort to proceed, if the coordinating router has already decided to commit the transaction. An operator log message on the router will be written to warn the administrator of this situation. If the journal state of the transaction is VOTED, you can use SET TRANSACTION /NEW_STATE=COMMIT or ABORT, for example: RTR> SET TRANSACTION "8fd01f10,0,0,0,0,8fd01f10,7d9f25b0" /PARTITION=PART1 - _RTR> /STATE=VOTED /NEW=ABORT %RTR-S-SETTRANDONE, 1 transaction(s) updated in partition PART1 of facility RTR$DEFAULT_FACILITY o The following changes and corrections have been made for RTR V3.2, ECO7 for all platforms. + 14-3-252 ACP failed with insufficient memory Previous versions of RTR could allocate too much memory for replay queues. When insufficient virtual memory was available, this behavior could cause the ACP to fail. RTR V2 compatible functionality has been reintroduced, in that a replay queue that grows too large is now discarded. + 14-3-270 ACP crashed with INSVIRMEM while trying to abort a very large transaction This problem was caused by overlarge transaction replay queues and has been corrected. + 14-3-340 Standby partition fails to take over on network partitioning Previously, it was possible that a standby partition would fail to take over if network partitioning occurred. o The following changes and corrections have been made for RTR V3.2, ECO6 for all platforms. + 14-1-929 Unable to suspend a partition on a secondary site It was possible to encounter a situation where you could not suspend transaction presentation on a partition on a secondary site. If you gave the suspend command at a time when the secondary site only had transactions awaiting assignment of a server channel, then the command would not complete, leaving transaction presentation in the 'suspending' state. + 14-1-940 Partition switching between standby and active State instability could result if instances of a partition were assigned an equal priority. This could happen through use of SET PARTITION commands, or as a result of network partitioning. + 14-5-157 Improved diagnostics Improved diagnostics have been added for testing network-related problems. + 14-8-304 Aborted transaction caused the RTR ACP to crash A complex series of interactions could lead to an aborted transaction causing the RTR ACP to crash on another node where a standby server was being started. + 14-8-309 Partition switching between primary and standby Temporary partition state instability could occur during certain network failures causing application servers that subscribe to SRSTANDBY and SRPRIMARY events to receive repeated events in rapid sequence. + 14-8-310 Transaction on the front-end hangs in the sending state It was possible for a front-end to hang in the sending state when a front-end failed over to the secondary router and then back to the preferred router. o The following changes and corrections have been made for RTR V3.2, ECO5 for all platforms. + 14-1-864 MONITOR GROUP shows bad values in "server act" and "vreq" fields MONITOR GROUP previously showed invalid counts in the "act" and "ack" columns if a secondary rejected a transaction. + 14-1-865 Partition creation not adequately synchronised with journal scan Partitions are now created after the journal scans are complete. + 14-1-893 Partitioned Standby allowed to enter lcl_rec followed by "dual active" During periods of network instability, previous versions of RTR could incorrectly allow roles to become temporarily quorate. As a consequence, certain unexpected state transitions could occur, for example standby servers could become active. + 14-5-156 Logging partition state transitions Backend partition state transition logging contained a bug which allowed state transitions to be recorded incorrectly. This has been fixed. Note that this fix addresses the accuracy of the log file entries; it does not increase the number of entries in the RTR log to cover those cases where a backend partition state transition is not logged. o The following changes and corrections have been made for RTR V3.2, ECO4 for all platforms. + 14-1-743 Wrong return status RTR_STS_COMSTAUNO In RTR V3.2 it was sometimes possible for a transaction which had not yet been voted on by a server which exits in mid-transaction to be aborted with incorrect status RTR_STS_COMSTAUNO. + 14-1-805 Attempt to create a partition that already exists returns incorrect status An attempt to create a partition that already existed used to return the error KRINUSE (key range in use). This has been superseded by the more explicit PRTALREXI (partition already exists). + 14-1-813 MONITOR SYSTEM shows WARNING on calls due to invalid time call The MONITOR SYSTEM monitor picture would sometimes incorrectly display a warning state for the CALL row. + 14-1-841 Replayed shadow transaction stuck in VOTED The implementation of RTR's cooperative recovery protocol algorithm has been enhanced so that some situations which would previously hang during permanent network link outages are now recovered correctly using the remaining connections. + 14-1-842 Transactions which do not specify a timeout abort If a frontend failed over to another router, then failed back to the original router, transactions which were in progress could sometimes be rejected with status RTR_ STS_TXTIMOUT. + 14-1-845 Transactions remain in VOTED state In earlier versions of RTR it was occasionally possible for transactions to remain hanging in VOTED state on a shadow primary backend and be aborted with status RTR_ STS_FELINLOS on the secondary backend after network link failures in a slow WAN environment. + 14-1-846 Transactions remain in SENDING state In previous versions of RTR it was occasionally possible for a transaction to remain hanging in SENDING state on a backend after a network partition had forced the backend to lose quorum. + 14-5-111 Certain RTR commands now recorded in the RTR operator log Operator log files created by previous versions of RTR could sometimes be difficult to interpret. By recording certain RTR commands, such as START RTR and CREATE FACILITY, the RTR log file has become easier to interpret. + 14-5-156 Logging partition state transitions insufficient Previous versions of RTR did not report backend partition state transitions in the operator log file with sufficient detail. Backend partition state transitions are now reported as follows: - Previously unlogged state transitions are recorded in the operator log with the new PRTSTATRA message. - The PRTBEGIN message is no longer generated. - The PRTCREATED and PRTEND message formats have been changed to match that of the PRTSTATRA messages. + 14-8-267 V3.1D-to-V3.2 Journal incompatibility corrected If you upgrade from V3.1D to earlier versions of V3.2, it was possible to encounter situations which caused the RTR ACP to crash. + 14-8-287 Named partition state change caused crash When using the CREATE PARTITION command, it was possible for RTR to crash on the backend node if the last channel using the partition is closed at the same instant that a state-change message from the router is pending. o The following correction has been made in RTR V3.2, ECO2 for all platforms. + 14-8-268 ACP crash after death of concurrent server Under rare circumstances, after the death of a concurrent server, RTR would try to reschedule a transaction in commit state resulting in an RTR ACP crash. This bug was present in RTR V3.2 and V3.2 ECO1. o The following changes and corrections have been made for RTR V3.2, ECO1 for all platforms. + 14-1-50 Looping RTR process for empty node string, e.g., /NODE=dna. Specifying an incomplete node specification, such as one with only the protocol prefix, e.g., "RTR SHOW RTR /NODE=dna." could cause the RTR process to loop, consuming CPU. + 14-1-433 Show transactions not recovered on link break/reconnect If a secondary shadow backend lost its link to the RTR router after the router had sent a vote request, and the server on the primary shadow accepts the transaction, then in unusual circumstances it was possible that the transaction would not be immediately recovered on the secondary shadow after the link to the router was re-established. In such cases it required a cycle of the servers on the secondary site for the remembered transaction to be recovered from the primary shadow journal. + 14-1-582 ACP access violation If a number of concurrent servers died in sequence while processing the same transaction, then under rare circumstances it was possible the ACP could also abort. This was due to a counter being incremented incorrectly. + 14-1-617 Problems with DUMP JOURNAL In previous versions of RTR, qualifiers which required a value did not generate an error if the value was not supplied or was supplied incorrectly. Incorrect or missing values now generate an error message. If a string of fewer than five characters was passed for partition record class, the partition record counter was not updated and the record was not available. These problems have been fixed by comparing each character instead of five characters at a time. + 14-1-760 ACP crashed when modifying journal size After a journal had been modified, the Flow Control subsystem of RTR was not properly updated with the new size. This could cause a hang or crash situation even though the journal size was increased to accommodate increased traffic. + 14-1-763 rtr_close_channel fails for distributed transaction Calling rtr_close_channel while a distributed transaction was pending caused an incorrect status to be returned. + 14-1-772 CALL CLOSE_CHANNEL defaults to IMMEDIATE The flag RTR_F_CLO_IMMEDIATE is a new flag added in RTR V3.2 that allows the caller to close a server channel without acknowledging the transaction on the channel. By default, the flag is not set when calling the rtr_ close_channel API. However, the /IMMEDIATE qualifier is implicitly present in the RTR CLI version of the API (rtr call rtr_close_channel). Because this is incompatible with the behavior of previous versions of RTR, functionality has been restored to the same as before V3.2. When using the CLI version of the API (rtr call rtr_close_channel), /NOIMMEDIATE is now the default. + 14-1-774 TOOMANCHA and distributed transaction left open after rtr_open_channel failure If rtr_open_channel failed after the RTR ACP had been stopped, then that channel remained available for a subsequent open. The application could eventually run out of channels and return RTR_STS_TOOMANCHA. Now if rtr_open_channel fails after a distributed transaction has been opened, the distributed transaction is always closed. + 14-1-777 Transaction state is not getting EXCEPTION after issuing rtr_close/imme SET PARTITION /RECOVERY_RETRY_COUNT is new functionality implemented in RTR V3.2. The scope of this command was not fully documented, and is clarified here. If an application server dies while processing a transaction recovered from RTR journal, then RTR will present the transaction to another (concurrent or standby) server. The RECOVERY_RETRY_LIMIT indicates the maximum number of times the transaction should be presented to a server for recovery before being written to the journal as an exception. There are two types of recovery operations where transactions are recovered from journal: local recovery and shadow recovery. Shadow recovery is the process of recovering the remembered transactions written to a primary shadow journal while the secondary shadow site is down. The SET PARTITION /RECOVERY_RETRY_COUNT parameter does not have an effect on remembered transactions recovered during shadow recovery. That is, if there is a killer transaction remembered in the journal on a primary shadow node, on this node RTR does not count the number of times the transaction is recovered by a recovering secondary shadow node. To ensure that a remembered transaction will be exceptioned by RTR, you must start a sufficient number of concurrent servers on the recovering secondary shadow node. For this reason, RTR recommends that the number of concurrent secondary shadow servers started be greater than the value set for the RECOVERY_RETRY_LIMIT on a partition. This will ensure that a remembered (killer) transaction being recovered from a primary shadow journal will be exceptioned if the retry limit is exceeded. Only those transactions that have reached voting stage on a server can be exceptioned. If a server always dies before voting on a transaction, then the transaction will be aborted by RTR after the third try. This is a hard-coded limit (the so called "three strikes and you're out" feature). + 14-1-791 Backends incorrectly remain inquorate after routers trimmed In versions V3.1D, ECO14, and V3.2 of RTR it was sometimes possible for nodes to incorrectly remain inquorate following a TRIM FACILITY operation. + 14-1-792 Revised rtrreq.c and rtrsrv.c sample RTR applications The sample client and server used in the IVP have been extensively revised. Please pay special attention to the comments which explain how to write a wakeup handler, and comments drawing attention to several common programming mistakes we have seen in RTR applications. + 14-3-291 SHOW SERVER truncates shd_rec_icpl to shd_rec_ ic Some of the values previously truncated by the brief SHOW SERVER command are now displayed more fully. + 14-3-298 Application may crash if invoked before RTR after a reboot Normally the RTR executable must have been invoked at least once since reboot before an RTR application can be started. If an RTR application is invoked first, the first RTR API call now always returns RTRNOTSTA, RTR not started. + 14-7-420 IOS tid on IP only nodes is not unique Using previous versions of RTR, if you ran client applications that used the RTR V2 API on systems that had DECnet disabled, then there was a remote possibility that the same transaction identifier could be generated on two such systems, if RTR was started on both systems within milliseconds of each other. + 14-8-215 Faster loading of large journals on first CREATE FACILITY RTR now takes much less time to load journals containing a large number of journaled transactions. + 14-8-257 The broadcast message was not delivered from BE to client If a frontend loses the connection to its original router, and is the first frontend to connect to the router it fails over to, then the frontend may stop receiving broadcasts. Further, backends could also fail to receive broadcasts delivered by routers added to a facility after the server applications have started. + 14-8-262 RTR has both backends as primary for some transactions (STR#1885690) In a partitioned network situation (when each of two routers have access to only half of the backend nodes), RTR will choose the router with the lower network address as the one that remains or becomes active. In previous versions of RTR, this would sometimes result in both sets of backends becoming active, due to a problem with the network ID comparison algorithm. The following changes and corrections have been made for RTR V3.2, ECO1 for the Compaq Tru64 UNIX platform. + 14-3-190 Signals blocked in unthreaded UNIX applications during RTR API calls RTR now enables the usual termination signals during RTR API calls. For example, an idle server RTR application waiting in rtr_receive_message with no timeout will now respond to Control-C. + 14-3-300 Terminated RTR application process that used fork is still shown by RTR RTR applications now have FD_CLOEXEC set for the IPC sockets used to communicate with the RTR ACP, so that these do not remain open in a child process after fork and exec even after the parent process has terminated. This means that the RTR ACP now notices when the parent exits, and will not accumulate a wait queue of broadcast messages or delay failover. The terminated process no longer appears in RTR SHOW PROCESS. + 14-7-952 BADROWCOL and escape sequences visible on dumb or unknown terminal The default VT100-style terminal escape sequences can now be completely suppressed with a suitable TERMCAP environment variable setting. It is still necessary to set a non-zero window size to avoid BADROWCOL, for example: stty rows 48 cols 120 TERM=dumb TERMCAP="dumb:cm=:do=:le=:nd=:up=:ks=:ke=:cl=:ce=: ho=:mb=:md=:mr=:us=:ue=:me=:cr=:bl=:" This is particularly useful when running RTR in an Emacs shell window, and gives reasonably clean output for all RTR commands except MONITOR. Known Problems with Workarounds: o The following restrictions were described in previous release notes and are still applicable to RTR. + 14-8-318 Maximum journal size The RTR journal file is limited to 524287 blocks on a single disk. If you want to create a larger journal file, you have to distribute the journal across more than one disk. + 14-1-39 Declaring exit handlers in RTR applications If an exit handler contains calls to RTR, then the exit handler must be declared after the first call to RTR. Using the RTR V2 or V3 API, if the exit handler is declared before the first call to RTR, then any call to RTR made within the exit handler will return an error. Under the V3 API, the error status returned is RTR_STS_ INVCHANNEL. Under the V2 API, the error status returned is RTR$_INVALCH. + 14-1-103 Using rtr_set_wakeup() in a threaded program After calling rtr_set_wakeup() in a threaded program, you should also call rtr_set_wakeup(NULL) wherever your program can exit. This will prevent any wakeups in other threads while the main thread is already running the RTR exit handler, which could lead to a server core dump when trying to stop the server. + 14-1-263 Non-English character sets are not supported for identifiers The supported character set for RTR identifiers such as facility names is ASCII, with lowercase and uppercase letters equivalent. Eight bit characters are not supported because the name might not interoperate with RTR processes using a different locale or running another RTR version. + 14-1-419 SPUJOUFIL advice to CREATE JOURNAL/SUPERSEDE is dangerous If the operator copies journal files or copies disks containing journal files without first remounting the source disk read-only, then these are spurious because RTR sees duplicates that it did not create. RTR then displays the SPUJOUFIL message, which advises the operator to use CREATE JOURNAL/SUPERSEDE to destroy the original and all copies of the journal files, and all the transactions contained in them on that node, and then submit an SPR for something that is not in fact an RTR problem. This is not the correct action in situations like this. The operator should examine the log file, which shows the duplicate filenames, and then move any unwanted duplicate copies of journal files to anywhere other than a rtrjnl directory at the top level of a writable disk file system visible to RTR, and then try again. Only if SPUJOUFIL is caused by circumstances other than operator intervention should the operator consider making backup copies of the journal files, and only then abandoning the existing journal files and any transactions contained in them by using DELETE JOURNAL and CREATE JOURNAL, or the equivalent CREATE JOURNAL/SUPERSEDE. + 14-1-455 Last line of batch procedure sometimes ignored The last line of a batch procedure or command file must explicitly end with added by pressing the Enter/Return key when creating the procedure. Without the explicit , RTR ignores the line. The workaround is to add a comment to the end of the file or to explicitly add to the end of the last line of the batch procedure. + 14-1-462 MODIFY JOURNAL with list of devices does not give individual error messages Although MODIFY JOURNAL now only lists all devices that were successfully modified, if some disk devices cannot be modified because they do not contain a journal file at all, then nothing at all is reported for those devices. The workaround is to identify the omitted devices by comparing the command parameters and the messages, or modify them one device at a time. Verify the modification with SHOW JOURNAL /FILES /FULL. + 14-1-604 Combined local and remote SET MODE/GROUP The following combined local and remote command does not perform the remote part of the command: RTR> SET MODE/GROUP=newgroup/NODE=(THISNODE,ANOTHERNODE) After recording a different new group or no group setting in shared memory the comserver has to disconnect itself, and does so immediately. When the comserver is disconnected the SRVDISCON message is shown, and the remote part of this command, together with any other pending commands such as those issued by the same user in other windows, is aborted immediately. A workaround is to issue the local SET MODE/GROUP command separately first. + 14-1-681 ACP crash in RDM subsystem It has been observed that with a large number of concurrently active transactions, where each transaction sends back a large number of very large replies, it is possible to exhaust the virtual memory requirements of RTR in order to store the replies for possible recovery after a link glitch. This would cause RTR to crash on a backend node. The workaround is to reduce the number of concurrent server channels, so as to limit the number of concurrently active transactions on each backend. Another possibility would be to limit the number of replies per transaction. In our tests we were able to exceed the RTR limits using 10 servers, each sending back 200 replies of 64KB each. + 14-1-769 System time must not be set backwards Correct operation of RTR is not guaranteed if the time is not monotonic. When performing Year 2000 testing, stop all RTR processes and remove any journalled transactions before setting the clock back. When configuring the Network Time Protocol daemon ntp, do not select, or accept by default, any options which may result in the system clock being set backwards. While RTR is running as a service on NT, do not allow the time to be set backwards by ntp or in login scripts. On OpenVMS systems, do not change to a different time zone, or to or from Summer Time or Daylight Savings Time, while using current versions of RTR which are built on OpenVMS V6 to run on both OpenVMS V6 and V7. On most UNIX systems it is in any case not recommended that you change the date and time when running in multi-user mode. Adjustments are normally made by speeding up or slowing down the clock. + 14-1-948 Use of /MAXIMUM_BLOCKS qualifier Use of the /MAXIMUM_BLOCKS qualifier on the CREATE JOURNAL or MODIFY JOURNAL commands may cause RTR to crash. Please use the /BLOCKS qualifier instead. + 14-3-33 Partition limit per facility now 500 The previous releases supported only up to 100 partitions per facility. The current release increases this to 500 but is not extendable beyond that. + 14-3-50 Maximum number of application processes limit An ACP crash that occurred when starting the last of a great many applications has been corrected. When the process open file limit is reached, the application will now generally report ACPNOTVIA, "RTR ACP is no longer a viable entity, restart RTR". In actual fact the ACP continues to operate with all previously connected processes, and only the new rejected process thinks that the RTR ACP is not alive. This message should be interpreted as "ACPINSRES, The RTR ACP has insufficient resources". Please ensure that your system is configured with sufficient default per-process resources, or that the ACP process is started with increased resource limits. Allow at least one open file for each additional application process, and at least one for each link. + 14-7-24 Transaction size limits The number of bytes in any application message (that is, a message sent with the rtr_send_to_server(), rtr_ reply_to_client() or rtr_broadcast_event() routines) is currently restricted to 64000. The number of messages sent (that is, using rtr_send_ to_server() ) in any single transaction is limited to 65534. There is no fixed limit on the number of replies (that is, sent with rtr_reply_to_client() ) in any single transaction. Problems Addressed in the RTR320_266 Kit: The following changes and corrections have been made for RTR V3.2, ECO7 for all platforms. o 14-3-252 ACP failed with insufficient memory Previous versions of RTR could allocate too much memory for replay queues. When insufficient virtual memory was available, this behavior could cause the ACP to fail. RTR V2 compatible functionality has been reintroduced, in that a replay queue that grows too large is now discarded. o 14-3-270 ACP crashed with INSVIRMEM while trying to abort a very large transaction This problem was caused by overlarge transaction replay queues and has been corrected. o 14-3-340 Standby partition fails to take over on network partitioning Previously, it was possible that a standby partition would fail to take over if network partitioning occurred. The following changes and corrections have been made for RTR V3.2, ECO6 for all platforms. o 14-1-929 Unable to suspend a partition on a secondary site It was possible to encounter a situation where you could not suspend transaction presentation on a partition on a secondary site. If you gave the suspend command at a time when the secondary site only had transactions awaiting assignment of a server channel, then the command would not complete, leaving transaction presentation in the 'suspending' state. o 14-1-940 Partition switching between standby and active State instability could result if instances of a partition were assigned an equal priority. This could happen through use of SET PARTITION commands, or as a result of network partitioning. o 14-5-157 Improved diagnostics Improved diagnostics have been added for testing network- related problems. o 14-8-304 Aborted transaction caused the RTR ACP to crash A complex series of interactions could lead to an aborted transaction causing the RTR ACP to crash on another node where a standby server was being started. o 14-8-309 Partition switching between primary and standby Temporary partition state instability could occur during certain network failures causing application servers that subscribe to SRSTANDBY and SRPRIMARY events to receive repeated events in rapid sequence. o 14-8-310 Transaction on the front-end hangs in the sending state It was possible for a front-end to hang in the sending state when a front-end failed over to the secondary router and then back to the preferred router. The following changes and corrections have been made for RTR V3.2, ECO5 for all platforms. o 14-1-864 MONITOR GROUP shows bad values in "server act" and "vreq" fields MONITOR GROUP previously showed invalid counts in the "act" and "ack" columns if a secondary rejected a transaction. o 14-1-865 Partition creation not adequately synchronised with journal scan Partitions are now created after the journal scans are complete. o 14-1-893 Partitioned Standby allowed to enter lcl_rec followed by "dual active" During periods of network instability, previous versions of RTR could incorrectly allow roles to become temporarily quorate. As a consequence, certain unexpected state transitions could occur, for example standby servers could become active. o 14-5-156 Logging partition state transitions Backend partition state transition logging contained a bug which allowed state transitions to be recorded incorrectly. This has been fixed. Note that this fix addresses the accuracy of the log file entries; it does not increase the number of entries in the RTR log to cover those cases where a backend partition state transition is not logged. Problems Addressed in the RTR320_255 Kit: The following changes and corrections have been made for RTR V3.2, ECO4 for all platforms. o 14-1-743 Wrong return status RTR_STS_COMSTAUNO In RTR V3.2 it was sometimes possible for a transaction which had not yet been voted on by a server which exits in mid-transaction to be aborted with incorrect status RTR_STS_COMSTAUNO. o 14-1-805 Attempt to create a partition that already exists returns incorrect status An attempt to create a partition that already existed used to return the error KRINUSE (key range in use). This has been superseded by the more explicit PRTALREXI (partition already exists). o 14-1-813 MONITOR SYSTEM shows WARNING on calls due to invalid time call The MONITOR SYSTEM monitor picture would sometimes incorrectly display a warning state for the CALL row. o 14-1-841 Replayed shadow transaction stuck in VOTED The implementation of RTR's cooperative recovery protocol algorithm has been enhanced so that some situations which would previously hang during permanent network link outages are now recovered correctly using the remaining connections. o 14-1-842 Transactions which do not specify a timeout abort If a frontend failed over to another router, then failed back to the original router, transactions which were in progress could sometimes be rejected with status RTR_STS_TXTIMOUT. o 14-1-845 Transactions remain in VOTED state In earlier versions of RTR it was occasionally possible for transactions to remain hanging in VOTED state on a shadow primary backend and be aborted with status RTR_STS_FELINLOS on the secondary backend after network link failures in a slow WAN environment. o 14-1-846 Transactions remain in SENDING state In previous versions of RTR it was occasionally possible for a transaction to remain hanging in SENDING state on a backend after a network partition had forced the backend to lose quorum. o 14-5-111 Certain RTR commands now recorded in the RTR operator log Operator log files created by previous versions of RTR could sometimes be difficult to interpret. By recording certain RTR commands, such as START RTR and CREATE FACILITY, the RTR log file has become easier to interpret. o 14-5-156 Logging partition state transitions insufficient Previous versions of RTR did not report backend partition state transitions in the operator log file with sufficient detail. Backend partition state transitions are now reported as follows: - Previously unlogged state transitions are recorded in the operator log with the new PRTSTATRA message. - The PRTBEGIN message is no longer generated. - The PRTCREATED and PRTEND message formats have been changed to match that of the PRTSTATRA messages. o 14-8-267 V3.1D-to-V3.2 Journal incompatibility corrected If you upgrade from V3.1D to earlier versions of V3.2, it was possible to encounter situations which caused the RTR ACP to crash. o 14-8-287 Named partition state change caused crash When using the CREATE PARTITION command, it was possible for RTR to crash on the backend node if the last channel using the partition is closed at the same instant that a state-change message from the router is pending. Problems Addressed in the RTR320_249 Kit: o 14-8-268 ACP crash after death of concurrent server Under rare circumstances, after the death of a concurrent server, RTR would try to reschedule a transaction in commit state resulting in an RTR ACP crash. This bug was present in RTR V3.2 and V3.2 ECO1. o 14-1-50 Looping RTR process for empty node string, e.g., /NODE=dna. Specifying an incomplete node specification, such as one with only the protocol prefix, e.g., "RTR SHOW RTR /NODE=dna." could cause the RTR process to loop, consuming CPU. o 14-1-433 Show transactions not recovered on link break/reconnect If a secondary shadow backend lost its link to the RTR router after the router had sent a vote request, and the server on the primary shadow accepts the transaction, then in unusual circumstances it was possible that the transaction would not be immediately recovered on the secondary shadow after the link to the router was re-established. In such cases it required a cycle of the servers on the secondary site for the remembered transaction to be recovered from the primary shadow journal. o 14-1-582 ACP access violation If a number of concurrent servers died in sequence while processing the same transaction, then under rare circumstances it was possible the ACP could also abort. This was due to a counter being incremented incorrectly and has now been fixed. o 14-1-617 Problems with DUMP JOURNAL In previous versions of RTR, qualifiers which required a value did not generate an error if the value was not supplied or was supplied incorrectly. Incorrect or missing values now generate an error message. If a string of less than five characters was passed for partition record class, the partition record counter was not updated and the record was not available. These problems have been fixed by comparing each character instead of five characters at a time. o 14-1-760 ACP crashed when modifying journal size After a journal had been modified, the Flow Control subsystem of RTR was not properly updated with the new size. This could result in a hang or crash situation even though the journal size was increased to accommodate increased traffic. o 14-1-763 RTR_CLOSE_CHANNEL fails for distributed transaction Calling RTR_CLOSE_CHANNEL while a distributed transaction was pending caused an incorrect status to be returned. o 14-1-772 CALL CLOSE_CHANNEL defaults to IMMEDIATE The flag RTR_F_CLO_IMMEDIATE is a new flag added in RTR V3.2 that allows the caller to close a server channel without acknowledging the transaction on the channel. By default, the flag is not set when calling the RTR_CLOSE_CHANNEL API. However, the /IMMEDIATE qualifier is implicitly present in the RTR CLI version of the API (RTR call RTR_CLOSE_CHANNEL). Because this is incompatible with the behavior of previous versions of RTR, functionality has been restored to the same as before V3.2. When using the CLI version of the API (RTR call RTR_CLOSE_CHANNEL), /NOIMMEDIATE is now the default. o 14-1-774 TOOMANCHA and distributed transaction left open after RTR_OPEN_CHANNEL failure If RTR_OPEN_CHANNEL failed after the RTR acp had been stopped, then that channel remained available for a subsequent open. The application could eventually run out of channels and return RTR_STS_TOOMANCHA. Now if RTR_OPEN_CHANNEL fails after a distributed transaction has been opened, the distributed transaction is always closed. o 14-1-777 Transaction state is not getting EXCEPTION after issuing RTR_CLOSE/IMME SET PARTITION /RECOVERY_RETRY_COUNT is new functionality implemented in RTR V3.2. The scope of this command was not fully documented, and is clarified here. If an application server dies while processing a transaction recovered from RTR journal, then RTR will present the transaction to another (concurrent or standby) server. The RECOVERY_RETRY_LIMIT indicates the maximum number of times the transaction should be presented to a server for recovery before being written to the journal as an exception. There are two types of recovery operations where transactions are recovered from journal: local recovery and shadow recovery. Shadow recovery is the process of recovering the remembered transactions written to a primary shadow journal while the secondary shadow site is down. The SET PARTITION /RECOVERY_RETRY_COUNT parameter does not have an effect on remembered transactions recovered during shadow recovery. That is, if there is a killer transaction remembered in the journal on a primary shadow node, on this node RTR does not count the number of times the transaction is recovered by a recovering secondary shadow node. The way to ensure that a remembered transaction will be exceptioned by RTR is by starting a sufficient number of concurrent servers on the recovering secondary shadow node. For this reason, RTR recommends that the number of concurrent secondary shadow servers started is greater than the value set for the RECOVERY_RETRY_LIMIT on a partition. This will ensure that a remembered (killer) transaction being recovered from a primary shadow journal will be exceptioned if the retry limit is exceeded. Only those transactions that have reached voting stage on a server can be exceptioned. If a server always dies before voting on a transaction, then the transaction will be aborted by RTR after the third try. This is a hard-coded limit (the so called "three strikes and you're out" feature). o 14-1-791 Backends incorrectly remain inquorate after routers trimmed In versions V3.1D - ECO14 and V3.2 of RTR it was sometimes possible for nodes to erroneously remain inquorate following a TRIM FACILITY operation. o 14-1-792 Revised rtrreq.c and rtrsrv.c sample RTR applications The sample client and server used in the IVP have been extensively revised. Please pay special attention to the comments which explain how to write a wakeup handler, and comments drawing attention to several common programming mistakes we have seen in RTR applications. o 14-3-291 SHOW SERVER truncates shd_rec_icpl to shd_rec_ic Some of the values previously truncated by the brief SHOW SERVER command are now displayed more fully. o 14-3-298 Application may crash if invoked before RTR after a reboot Normally the RTR executable must have been invoked at least once since reboot before an RTR application can be started. If an RTR application is invoked first, the first RTR api call now always returns RTRNOTSTA, RTR not started. o 4-7-420 IOS tid on IP only nodes is not unique Using previous versions of RTR, if you ran client applications that used the RTR V2 API on systems that had DECnet disabled, then there was a remote possibility that the same transaction identifier could be generated on two such systems if RTR was started on both systems within milliseconds of each other. o 14-8-215 Faster loading of large journals on first CREATE FACILITY RTR now takes much less time to load journals containing a large number of journaled transactions. o 14-8-257 The broadcast message was not delivered from BE to client If a frontend loses the connection to its original router, and is the first frontend to connect to the router it fails over to, then the frontend may stop receiving broadcasts. Further, backends could also fail to receive broadcasts delivered by routers added to a facility after the server applications have started. o 14-8-262 RTR has both backends as primary for some transactions (STR#1885690) In a partitioned network situation (when each of two routers have access to only half of the backend nodes), RTR will choose the router with the lower network address as the one that remains or becomes active. In previous versions of RTR, this would sometimes result in both sets of backends becoming active, due to a problem with the network ID comparison algorithm. o 14-3-190 Signals blocked in unthreaded UNIX applications during RTR API calls RTR now enables the usual termination signals during RTR API calls. For example, an idle server RTR application waiting in rtr_receive_message with no timeout will now respond to Control-C. o 14-3-300 Terminated RTR application process that used fork is still shown by RTR RTR applications now have FD_CLOEXEC set for the IPC sockets used to communicate with the RTR ACP, so that these do not remain open in a child process after fork and exec even after the parent process has terminated. This means that the RTR ACP now notices when the parent exits, and will not accumulate a wait queue of broadcast messages or delay failover. The terminated process no longer appears in RTR SHOW PROCESS. o 14-7-952 BADROWCOL and escape sequences visible on dumb or unknown terminal The default VT100-style terminal escape sequences can now be completely suppressed with a suitable TERMCAP environment variable setting. It is still necessary to set a non-zero window size to avoid BADROWCOL, for example: stty rows 48 cols 120 TERM=dumb TERMCAP="dumb:cm=:do=:le=:nd=:up=:ks=:ke=:cl=:ce=: ho=:mb=:md=:mr=:us=:ue=:me=:cr=:bl=:" This is particularly useful when running RTR in an Emacs shell window, and gives reasonably clean output for all RTR commands except MONITOR. Known Problems with Workarounds The following restrictions were described in previous release notes and are still applicable to RTR. o 14-1-39 Declaring exit handlers in RTR applications If an exit handler contains calls to RTR, then the exit handler must be declared after the first call to RTR. Using the RTR V2 or V3 API, if the exit handler is declared before the first call to RTR, then any call to RTR made within the exit handler will return an error. Under the V3 API, the error status returned is RTR_STS_INVCHANNEL. Under the V2 API, the error status returned is RTR$_INVALCH. o 14-1-103 Using RTR_SET_WAKEUP() in a threaded program After calling RTR_SET_WAKEUP() in a threaded program, you should also call RTR_SET_WAKEUP(NULL) wherever your program can exit. This will prevent any wakeup in other threads while the main thread is already running the RTR exit handler, which could lead to a server core dump when trying to stop the server. o 14-1-263 Non-English character sets are not supported for identifiers The supported character set for RTR identifiers such as facility names is ESSAY, with lowercase and uppercase letters equivalent. Eight bit characters are not supported because the name might not interpret with RTR processes using a different locale or running another RTR version. o 14-1-419 SPUJOUFIL advice to CREATE JOURNAL/SUPERSEDE is dangerous If the operator copies journal files or copies disks containing journal files without first remounting the source disk read-only, then these are spurious because RTR sees duplicates that it did not create. RTR then displays the SPUJOUFIL message, which advises the operator to use CREATE JOURNAL/SUPERSEDE to destroy the original and all copies of the journal files, and all the transactions contained in them on that node, and then submit an SPR for something that is not in fact an RTR problem. This is not the correct action in situations like this. The operator should examine the log file, which shows the duplicate filenames, and then move any unwanted duplicate copies of journal files to anywhere other than a RTRJNL directory at the top level of a writable disk file system visible to RTR, and then try again. Only if SPUJOUFIL is caused by circumstances other than operator intervention should the operator consider making backup copies of the journal files, and only then abandoning the existing journal files and any transactions contained in them by using DELETE JOURNAL and CREATE JOURNAL, or the equivalent CREATE JOURNAL/SUPERSEDE. o 14-1-455 Last line of batch procedure sometimes ignored The last line of a batch procedure or command file must explicitly end with added by pressing the Enter/Return key when creating the procedure. Without the explicit , RTR ignores the line. The workaround is to add a comment to the end of the file or to explicitly add to the end of the last line of the batch procedure. o 14-1-462 MODIFY JOURNAL with list of devices does not give individual error messages Although MODIFY JOURNAL now only lists all devices that were successfully modified, if some disk devices cannot be modified because they do not contain a journal file at all, then nothing at all is reported for those devices. The workaround is to identify the omitted devices by comparing the command parameters and the messages, or modify them one device at a time. Verify the modification with SHOW JOURNAL /FILES /FULL. o 14-1-604 Combined local and remote SET MODE/GROUP The following combined local and remote command does not perform the remote part of the command: RTR> SET MODE/GROUP=newgroup/NODE=(THISNODE,ANOTHERNODE) After recording a different new group or no group setting in shared memory the comserver has to disconnect itself, and does so immediately. When the comserver is disconnected the SRVDISCON message is shown, and the remote part of this command, together with any other pending commands such as those issued by the same user in other windows, is aborted immediately. A workaround is to issue the local SET MODE/GROUP command separately first. o 14-1-681 ACP crash in RDM subsystem It has been observed that with a large number of concurrently active transactions, where each transaction sends back a large number of very large replies, it is possible to exhaust the virtual memory requirements of RTR in order to store the replies for possible recovery after a link glitch. This would cause RTR to crash on a backend node. The workaround is to reduce the number of concurrent server channels, so as to limit the number of concurrently active transactions on each backend. Another possibility would be to limit the number of replies per transaction. In our tests we were able to exceed the RTR limits using 10 servers, each sending back 200 replies of 64KB each. o 14-1-769 System time must not be set backwards Correct operation of RTR is not guaranteed if the time is not monotonic. When performing Year 2000 testing, stop all RTR processes and remove any journaled transactions before setting the clock back. When configuring the Network Time Protocol daemon ntp, do not select, or accept by default, any options which may result in the system clock being set backwards. While RTR is running as a service on NT, do not allow the time to be set backwards by ntp or in login scripts. On OpenVMS systems, do not change to a different time zone, or to or from Summer Time or Daylight Savings Time, while using current versions of RTR which are built on OpenVMS V6 to run on both OpenVMS V6 and V7. On most UNIX systems it is in any case not recommended that you change the date and time when running in multi-user mode. Adjustments are normally made by speeding up or slowing down the clock. o 14-3-33 Partition limit per facility now 500 The previous releases supported only up to 100 partitions per facility. The current release increases this to 500 but is not extendable beyond that. o 14-3-50 Maximum number of application processes limit An ACP crash that occurred when starting the last of a great many applications has been corrected. When the process open file limit is reached, the application will now generally report ACPNOTVIA, "RTR ACP is no longer a viable entity, restart RTR". In actual fact the ACP continues to operate with all previously connected processes, and only the new rejected process thinks that the RTR ACP is not alive. This message should be interpreted as "ACPINSRES, the RTR ACP has insufficient resources". Please ensure that your system is configured with sufficient default per-process resources, or that the ACP process is started with increased resource limits. Allow at least one open file for each additional application process, and at least one for each link. o 14-3-207 Client application in questionable flow control when RTR journal fills There is a known issue with flow control when the journal starts filling up. There is a race condition where, if the client can send more data than can be placed in the journal before flow control kicks in, then the transaction is aborted with the correct error notification. However, if flow control kicks in first, then a deadlock occurs where the journal space never frees up and hence RTR does not allow the client to proceed with the transaction. There are two workarounds: either specify a timeout with the transaction, or increase the size of the journal. o 14-3-253 Restrictions on the RTR wakeup handler The use of RTR_REPLY_TO_CLIENT, RTR_SEND_TO_SERVER, or RTR_BROADCAST_EVENT in an RTR wakeup handler is not recommended. They may block when they need transaction ids or flow control. This will cause undesired behavior. Functions permitted in an RTR_SET_WAKEUP() handler: - In an RTR wakeup handler in an AST in an unthreaded OpenVMS application, the use of RTR_REPLY_TO_CLIENT(), RTR_SEND_TO_SERVER(), RTR_BROADCAST_EVENT(), or RTR_RECEIVE_MESSAGE() with a non-zero timeout is not recommended. They may block when they need transaction ids or flow control, which will cause the whole application to hang until the wakeup completes. - The same rules apply in an RTR wakeup handler in a threaded application. Note that wakeup are unnecessary in a threaded paradigm, but they may be used in common code in applications that also need to run on OpenVMS. Please note that your mainline code continues to run while your wakeup is executing, so extra synchronization may be required. Also note that if the wakeup does block then it does not generally hang the whole application. - In an RTR wakeup handler in a signal in an unthreaded UNIX application, no RTR API functions and only the very few asynch-safe system and library functions may be called, because the wakeup is performed in a signal handler context. An application can write to a pipe or access a volatile sig_atomic_t variable, but using malloc() and printf(), for example, will cause unexpected failures. Alternatively, on most UNIX platforms, you can compile and link the application as a threaded application with the reentrant RTR shared library -lrtr_r. For maximum portability, the wakeup handler should do the minimum necessary to wake up the mainline event loop. You should assume that mainline code and other threads might continue to run in parallel with the wakeup, especially on machines with more than one CPU. o 14-7-24 Transaction size limits The number of bytes in any application message (that is, a message sent with the RTR_SEND_TO_SERVER(), RTR_REPLY_TO_CLIENT() or RTR_BROADCAST_EVENT() routines) is currently restricted to 64000. The number of messages sent (that is, using RTR_SEND_TO_SERVER() ) in any single transaction is limited to 65534. There is no fixed limit on the number of replies (that is, sent with RTR_REPLY_TO_CLIENT() ) in any single transaction. INSTALLATION NOTES: The Reliable Transaction Router V3.2 ECO8 installation procedure is the same as the installation procedure for RTR V3.2. Refer to the Installation Guide for further information.

This patch can be found at any of these sites:

Files on this server are as follows:

rtr320_268_us.README
rtr320_268_us.CHKSUM
rtr320_268_us.CVRLET_TXT
rtr320_268_us.tar
rtr320_268_us.CVRLET_TXT