ECO NUMBER: RTRAVME07032 ----------- PRODUCT: Reliable Transaction Router for OpenVMS -------- UPDATED PRODUCT: Reliable Transaction Router for OpenVMS 3.2 ---------------- APPRX BLCK SIZE: 14496 ---------------- COMPAQ Cover Letter Reliable Transaction Router Version 3.2 ECO7 for OpenVMS Alpha Problem Description ------------------- The following changes and corrections have been made for RTR V3.2, ECO7 for all platforms. o 14-3-252 ACP failed with insufficient memory Previous versions of RTR could allocate too much memory for replay queues. When insufficient virtual memory was available, this behavior could cause the ACP to fail. RTR V2 compatible functionality has been reintroduced, in that a replay queue that grows too large is now discarded. o 14-3-270 ACP crashed with INSVIRMEM while trying to abort a very large transaction This problem was caused by overlarge transaction replay queues and has been corrected. o 14-3-340 Standby partition fails to take over on network partitioning Previously, it was possible that a standby partition would fail to take over if network partitioning occurred. The following changes and corrections have been made for RTR V3.2, ECO6 for all platforms. o 14-1-929 Unable to suspend a partition on a secondary site It was possible to encounter a situation where you could not suspend transaction presentation on a partition on a secondary site. If you gave the suspend command at a time when the secondary site only had transactions awaiting assignment of a server channel, then the command would not complete, leaving transaction presentation in the 'suspending' state. o 14-1-940 Partition switching between standby and active State instability could result if instances of a partition were assigned an equal priority. This could happen through use of SET PARTITION commands, or as a result of network partitioning. o 14-5-157 Improved diagnostics Improved diagnostics have been added for testing network-related problems. o 14-8-304 Aborted transaction caused the RTR ACP to crash A complex series of interactions could lead to an aborted transaction causing the RTR ACP to crash on another node where a standby server was being started. o 14-8-309 Partition switching between primary and standby Temporary partition state instability could occur during certain network failures causing application servers that subscribe to SRSTANDBY and SRPRIMARY events to receive repeated events in rapid sequence. o 14-8-310 Transaction on the front-end hangs in the sending state It was possible for a front-end to hang in the sending state when a front-end failed over to the secondary router and then back to the preferred router. The following changes and corrections have been made for RTR V3.2, ECO5 for all platforms. o 14-1-864 MONITOR GROUP shows bad values in "server act" and "vreq" fields MONITOR GROUP previously showed invalid counts in the "act" and "ack" columns if a secondary rejected a transaction. o 14-1-865 Partition creation not adequately synchronised with journal scan Partitions are now created after the journal scans are complete. o 14-1-893 Partitioned Standby allowed to enter lcl_rec followed by "dual active" During periods of network instability, previous versions of RTR could incorrectly allow roles to become temporarily quorate. As a consequence, certain unexpected state transitions could occur, for example standby servers could become active. o 14-5-156 Logging partition state transitions Backend partition state transition logging contained a bug which allowed state transitions to be recorded incorrectly. This has been fixed. Note that this fix addresses the accuracy of the log file entries; it does not increase the number of entries in the RTR log to cover those cases where a backend partition state transition is not logged. The following changes and corrections have been made for RTR V3.2, ECO4 for all platforms. o 14-1-743 Wrong return status RTR_STS_COMSTAUNO In RTR V3.2 it was sometimes possible for a transaction which had not yet been voted on by a server which exits in mid-transaction to be aborted with incorrect status RTR_STS_COMSTAUNO. o 14-1-805 Attempt to create a partition that already exists returns incorrect status An attempt to create a partition that already existed used to return the error KRINUSE (key range in use). This has been superseded by the more explicit PRTALREXI (partition already exists). o 14-1-813 MONITOR SYSTEM shows WARNING on calls due to invalid time call The MONITOR SYSTEM monitor picture would sometimes incorrectly display a warning state for the CALL row. o 14-1-841 Replayed shadow transaction stuck in VOTED The implementation of RTR's cooperative recovery protocol algorithm has been enhanced so that some situations which would previously hang during permanent network link outages are now recovered correctly using the remaining connections. o 14-1-842 Transactions which do not specify a timeout abort If a frontend failed over to another router, then failed back to the original router, transactions which were in progress could sometimes be rejected with status RTR_ STS_TXTIMOUT. o 14-1-845 Transactions remain in VOTED state In earlier versions of RTR it was occasionally possible for transactions to remain hanging in VOTED state on a shadow primary backend and be aborted with status RTR_ STS_FELINLOS on the secondary backend after network link failures in a slow WAN environment. o 14-1-846 Transactions remain in SENDING state In previous versions of RTR it was occasionally possible for a transaction to remain hanging in SENDING state on a backend after a network partition had forced the backend to lose quorum. o 14-5-111 Certain RTR commands now recorded in the RTR operator log Operator log files created by previous versions of RTR could sometimes be difficult to interpret. By recording certain RTR commands, such as START RTR and CREATE FACILITY, the RTR log file has become easier to interpret. o 14-5-156 Logging partition state transitions insufficient Previous versions of RTR did not report backend partition state transitions in the operator log file with sufficient detail. Backend partition state transitions are now reported as follows: - Previously unlogged state transitions are recorded in the operator log with the new PRTSTATRA message. - The PRTBEGIN message is no longer generated. - The PRTCREATED and PRTEND message formats have been changed to match that of the PRTSTATRA messages. o 14-8-267 V3.1D-to-V3.2 Journal incompatibility corrected If you upgrade from V3.1D to earlier versions of V3.2, it was possible to encounter situations which caused the RTR ACP to crash. o 14-8-287 Named partition state change caused crash When using the CREATE PARTITION command, it was possible for RTR to crash on the backend node if the last channel using the partition is closed at the same instant that a state-change message from the router is pending. The following correction has been made in RTR V3.2, ECO2 for all platforms. o 14-8-268 ACP crash after death of concurrent server Under rare circumstances, after the death of a concurrent server, RTR would try to reschedule a transaction in commit state resulting in an RTR ACP crash. This bug was present in RTR V3.2 and V3.2 ECO1. The following changes and corrections have been made for RTR V3.2, ECO1 for all platforms. o 14-1-50 Looping RTR process for empty node string, e.g., /NODE=dna. Specifying an incomplete node specification, such as one with only the protocol prefix, e.g., "RTR SHOW RTR /NODE=dna." could cause the RTR process to loop, consuming CPU. o 14-1-433 Show transactions not recovered on link break/reconnect If a secondary shadow backend lost its link to the RTR router after the router had sent a vote request, and the server on the primary shadow accepts the transaction, then in unusual circumstances it was possible that the transaction would not be immediately recovered on the secondary shadow after the link to the router was re-established. In such cases it required a cycle of the servers on the secondary site for the remembered transaction to be recovered from the primary shadow journal. o 14-1-582 ACP access violation If a number of concurrent servers died in sequence while processing the same transaction, then under rare circumstances it was possible the ACP could also abort. This was due to a counter being incremented incorrectly. o 14-1-617 Problems with DUMP JOURNAL In previous versions of RTR, qualifiers which required a value did not generate an error if the value was not supplied or was supplied incorrectly. Incorrect or missing values now generate an error message. If a string of fewer than five characters was passed for partition record class, the partition record counter was not updated and the record was not available. These problems have been fixed by comparing each character instead of five characters at a time. o 14-1-760 ACP crashed when modifying journal size After a journal had been modified, the Flow Control subsystem of RTR was not properly updated with the new size. This could cause a hang or crash situation even though the journal size was increased to accommodate increased traffic. o 14-1-763 rtr_close_channel fails for distributed transaction Calling rtr_close_channel while a distributed transaction was pending caused an incorrect status to be returned. o 14-1-772 CALL CLOSE_CHANNEL defaults to IMMEDIATE The flag RTR_F_CLO_IMMEDIATE is a new flag added in RTR V3.2 that allows the caller to close a server channel without acknowledging the transaction on the channel. By default, the flag is not set when calling the rtr_ close_channel API. However, the /IMMEDIATE qualifier is implicitly present in the RTR CLI version of the API (rtr call rtr_close_channel). Because this is incompatible with the behavior of previous versions of RTR, functionality has been restored to the same as before V3.2. When using the CLI version of the API (rtr call rtr_close_channel), /NOIMMEDIATE is now the default. o 14-1-774 TOOMANCHA and distributed transaction left open after rtr_open_channel failure If rtr_open_channel failed after the RTR ACP had been stopped, then that channel remained available for a subsequent open. The application could eventually run out of channels and return RTR_STS_TOOMANCHA. Now if rtr_open_channel fails after a distributed transaction has been opened, the distributed transaction is always closed. o 14-1-777 Transaction state is not getting EXCEPTION after issuing rtr_close/imme SET PARTITION /RECOVERY_RETRY_COUNT is new functionality implemented in RTR V3.2. The scope of this command was not fully documented, and is clarified here. If an application server dies while processing a transaction recovered from RTR journal, then RTR will present the transaction to another (concurrent or standby) server. The RECOVERY_RETRY_LIMIT indicates the maximum number of times the transaction should be presented to a server for recovery before being written to the journal as an exception. There are two types of recovery operations where transactions are recovered from journal: local recovery and shadow recovery. Shadow recovery is the process of recovering the remembered transactions written to a primary shadow journal while the secondary shadow site is down. The SET PARTITION /RECOVERY_RETRY_COUNT parameter does not have an effect on remembered transactions recovered during shadow recovery. That is, if there is a killer transaction remembered in the journal on a primary shadow node, on this node RTR does not count the number of times the transaction is recovered by a recovering secondary shadow node. To ensure that a remembered transaction will be exceptioned by RTR, you must start a sufficient number of concurrent servers on the recovering secondary shadow node. For this reason, RTR recommends that the number of concurrent secondary shadow servers started be greater than the value set for the RECOVERY_RETRY_LIMIT on a partition. This will ensure that a remembered (killer) transaction being recovered from a primary shadow journal will be exceptioned if the retry limit is exceeded. Only those transactions that have reached voting stage on a server can be exceptioned. If a server always dies before voting on a transaction, then the transaction will be aborted by RTR after the third try. This is a hard-coded limit (the so called "three strikes and you're out" feature). o 14-1-791 Backends incorrectly remain inquorate after routers trimmed In versions V3.1D, ECO14, and V3.2 of RTR it was sometimes possible for nodes to incorrectly remain inquorate following a TRIM FACILITY operation. o 14-1-792 Revised rtrreq.c and rtrsrv.c sample RTR applications The sample client and server used in the IVP have been extensively revised. Please pay special attention to the comments which explain how to write a wakeup handler, and comments drawing attention to several common programming mistakes we have seen in RTR applications. o 14-3-291 SHOW SERVER truncates shd_rec_icpl to shd_rec_ ic Some of the values previously truncated by the brief SHOW SERVER command are now displayed more fully. o 14-3-298 Application may crash if invoked before RTR after a reboot Normally the RTR executable must have been invoked at least once since reboot before an RTR application can be started. If an RTR application is invoked first, the first RTR API call now always returns RTRNOTSTA, RTR not started. o 14-7-420 IOS tid on IP only nodes is not unique Using previous versions of RTR, if you ran client applications that used the RTR V2 API on systems that had DECnet disabled, then there was a remote possibility that the same transaction identifier could be generated on two such systems, if RTR was started on both systems within milliseconds of each other. o 14-8-215 Faster loading of large journals on first CREATE FACILITY RTR now takes much less time to load journals containing a large number of journaled transactions. o 14-8-257 The broadcast message was not delivered from BE to client If a frontend loses the connection to its original router, and is the first frontend to connect to the router it fails over to, then the frontend may stop receiving broadcasts. Further, backends could also fail to receive broadcasts delivered by routers added to a facility after the server applications have started. o 14-8-262 RTR has both backends as primary for some transactions (STR#1885690) In a partitioned network situation (when each of two routers have access to only half of the backend nodes), RTR will choose the router with the lower network address as the one that remains or becomes active. In previous versions of RTR, this would sometimes result in both sets of backends becoming active, due to a problem with the network ID comparison algorithm. The following changes and corrections have been made in RTR V3.2, ECO7 for the OpenVMS platform. o 14-3-142 $DCL_TX_PRC(W) service on V3 returns a different status to V2 when RTR has not been started RTR V3 now returns a status of RTRNOTSTA (which V2 expects) instead of NOACP when the $DCL_TX_PRC service is called before RTR has been started or after RTR has been stopped. TXSB also returns the same status as V2 instead of zero. o 14-3-176 RTR does not add the RTR$INFO and RTR$OPERATOR rights identifier During installation of RTR the rtr$operator and rtr$info rights identifiers were not added to the UAF file. The RTR installation now adds the rtr$info and rtr$operator rights indentifiers during the installation process and removes them during the product remove process. o 14-3-181 V2 API files missing on V3 A new option has been added to the V3 kit installation procedure asking if you want to install V2 api files (rtr.ada, rtr.bas, rtr.req, rtr.pli, rtr.mlb, rtr.for, rtr.pas). These V2 files will allow programmers to develop code in a V2 environment using ADA, BASIC, Bliss, macro, Fortran or Pascal languages. o 14-3-314 Severity level "informational" is not severe enough for status RTR_STS_ACPDIED The informational severity of the status RTR_STS_ ACPDIED was not consistent with the severity of the problem it described. This made error handling in command procedures unnecessarily complicated. The error status RTR_STS_ACPNOTVIA is now returned in situations where the informational status RTR_STS_ACPDIED would previously have been used. o 14-3-322 When tearing down a channel RTR might hang for approximately 60 seconds RTR might hang if there was an error on a channel while RTR was tearing down a rejected connection request. The hang was due to a DECnet dassgn call aborting a channel that was already closed. RTR now performs the dassgn in an asynchonous manner to eliminate the possible hang. o 14-3-336 Ast not delivered after channel shutdown Emulation of the RTR V2 API on RTR V3 was delivering channel shutdown completion AST's before any $deq completion ASTs for the channel. This differed from V2 behavior, and has been corrected. The following changes and corrections have been made in RTR V3.2, ECO4 for the OpenVMS platform. o 14-1-833 Memory leak using sys$dcl_tx_prc(w) to close a channel Previous versions of RTR could fail to free up some memory in calls to sys$dcl_tx_prc(w) when the RTR$M_ SHUTDOWN flag was set. An application that made extensive use of channel opening and closure could therefore fail due to lack of memory. o 14-8-278 Incorrect EXQUOTA BYTLM error message In previous versions, when a VMS username was 12 characters or longer in length, RTR failed with an EXQUOTA BYTLM error message. The following changes and corrections have been made in RTR V3.2, ECO1 for the OpenVMS platform. o 14-1-305 IOS RTRV2 command line default key range bounds wrong The RTR V2 command DCL_TX_PRC() treated unsigned quadword keys as signed. The RTR V2 interface now handles unsigned quadword keys correctly. o 14-1-326 VAX cpu spin or spurious leading zeroes for large quadword key On VAX systems negative quadword key range values were sometimes shown with trailing garbage. VAX systems now always terminate negative quadword values correctly. o 14-3-190 Signals blocked in unthreaded UNIX applications during RTR API calls RTR now enables the usual termination signals during RTR API calls. For example, an idle server RTR application waiting in rtr_receive_message with no timeout will now respond to Control-C. o 14-3-295 ACP crash when using DECdtm An application which explicitly commits an in-progress Rdb transaction could cause a duplicate accept message to be sent to the RTR ACP. The second message could cause the ACP to crash when it tries to free an already freed server (srb). o 14-8-260 RTRACP crashed without dump on the RTR V3.2 if UCX was not started RTR no longer crashes when UCX is installed but not started. The function call inet_ntoa() was returning a -1 when it should return a string. RTR now checks for this erroneous return status. Known Problems with Workarounds -------------------------------- The following restrictions were described in previous release notes and are still applicable to RTR. o 14-8-318 Maximum journal size The RTR journal file is limited to 524287 blocks on a single disk. If you want to create a larger journal file, you have to distribute the journal across more than one disk. o 14-1-39 Declaring exit handlers in RTR applications If an exit handler contains calls to RTR, then the exit handler must be declared after the first call to RTR. Using the RTR V2 or V3 API, if the exit handler is declared before the first call to RTR, then any call to RTR made within the exit handler will return an error. Under the V3 API, the error status returned is RTR_STS_ INVCHANNEL. Under the V2 API, the error status returned is RTR$_INVALCH. o 14-1-103 Using rtr_set_wakeup() in a threaded program After calling rtr_set_wakeup() in a threaded program, you should also call rtr_set_wakeup(NULL) wherever your program can exit. This will prevent any wakeups in other threads while the main thread is already running the RTR exit handler, which could lead to a server core dump when trying to stop the server. o 14-1-263 Non-English character sets are not supported for identifiers The supported character set for RTR identifiers such as facility names is ASCII, with lowercase and uppercase letters equivalent. Eight bit characters are not supported because the name might not interoperate with RTR processes using a different locale or running another RTR version. o 14-1-419 SPUJOUFIL advice to CREATE JOURNAL/SUPERSEDE is dangerous If the operator copies journal files or copies disks containing journal files without first remounting the source disk read-only, then these are spurious because RTR sees duplicates that it did not create. RTR then displays the SPUJOUFIL message, which advises the operator to use CREATE JOURNAL/SUPERSEDE to destroy the original and all copies of the journal files, and all the transactions contained in them on that node, and then submit an SPR for something that is not in fact an RTR problem. This is not the correct action in situations like this. The operator should examine the log file, which shows the duplicate filenames, and then move any unwanted duplicate copies of journal files to anywhere other than a rtrjnl directory at the top level of a writable disk file system visible to RTR, and then try again. Only if SPUJOUFIL is caused by circumstances other than operator intervention should the operator consider making backup copies of the journal files, and only then abandoning the existing journal files and any transactions contained in them by using DELETE JOURNAL and CREATE JOURNAL, or the equivalent CREATE JOURNAL/SUPERSEDE. o 14-1-455 Last line of batch procedure sometimes ignored The last line of a batch procedure or command file must explicitly end with added by pressing the Enter/Return key when creating the procedure. Without the explicit , RTR ignores the line. The workaround is to add a comment to the end of the file or to explicitly add to the end of the last line of the batch procedure. o 14-1-462 MODIFY JOURNAL with list of devices does not give individual error messages Although MODIFY JOURNAL now only lists all devices that were successfully modified, if some disk devices cannot be modified because they do not contain a journal file at all, then nothing at all is reported for those devices. The workaround is to identify the omitted devices by comparing the command parameters and the messages, or modify them one device at a time. Verify the modification with SHOW JOURNAL /FILES /FULL. o 14-1-604 Combined local and remote SET MODE/GROUP The following combined local and remote command does not perform the remote part of the command: RTR> SET MODE/GROUP=newgroup/NODE=(THISNODE,ANOTHERNODE) After recording a different new group or no group setting in shared memory the comserver has to disconnect itself, and does so immediately. When the comserver is disconnected the SRVDISCON message is shown, and the remote part of this command, together with any other pending commands such as those issued by the same user in other windows, is aborted immediately. A workaround is to issue the local SET MODE/GROUP command separately first. o 14-1-681 ACP crash in RDM subsystem It has been observed that with a large number of concurrently active transactions, where each transaction sends back a large number of very large replies, it is possible to exhaust the virtual memory requirements of RTR in order to store the replies for possible recovery after a link glitch. This would cause RTR to crash on a backend node. The workaround is to reduce the number of concurrent server channels, so as to limit the number of concurrently active transactions on each backend. Another possibility would be to limit the number of replies per transaction. In our tests we were able to exceed the RTR limits using 10 servers, each sending back 200 replies of 64KB each. o 14-1-769 System time must not be set backwards Correct operation of RTR is not guaranteed if the time is not monotonic. When performing Year 2000 testing, stop all RTR processes and remove any journalled transactions before setting the clock back. When configuring the Network Time Protocol daemon ntp, do not select, or accept by default, any options which may result in the system clock being set backwards. While RTR is running as a service on NT, do not allow the time to be set backwards by ntp or in login scripts. On OpenVMS systems, do not change to a different time zone, or to or from Summer Time or Daylight Savings Time, while using current versions of RTR which are built on OpenVMS V6 to run on both OpenVMS V6 and V7. On most UNIX systems it is in any case not recommended that you change the date and time when running in multi-user mode. Adjustments are normally made by speeding up or slowing down the clock. o 14-1-948 Use of /MAXIMUM_BLOCKS qualifier Use of the /MAXIMUM_BLOCKS qualifier on the CREATE JOURNAL or MODIFY JOURNAL commands may cause RTR to crash. Please use the /BLOCKS qualifier instead. o 14-3-33 Partition limit per facility now 500 The previous releases supported only up to 100 partitions per facility. The current release increases this to 500 but is not extendable beyond that. o 14-3-50 Maximum number of application processes limit An ACP crash that occurred when starting the last of a great many applications has been corrected. When the process open file limit is reached, the application will now generally report ACPNOTVIA, "RTR ACP is no longer a viable entity, restart RTR". In actual fact the ACP continues to operate with all previously connected processes, and only the new rejected process thinks that the RTR ACP is not alive. This message should be interpreted as "ACPINSRES, The RTR ACP has insufficient resources". Please ensure that your system is configured with sufficient default per-process resources, or that the ACP process is started with increased resource limits. Allow at least one open file for each additional application process, and at least one for each link. o 14-7-24 Transaction size limits The number of bytes in any application message (that is, a message sent with the rtr_send_to_server(), rtr_ reply_to_client() or rtr_broadcast_event() routines) is currently restricted to 64000. The number of messages sent (that is, using rtr_send_ to_server() ) in any single transaction is limited to 65534. There is no fixed limit on the number of replies (that is, sent with rtr_reply_to_client() ) in any single transaction. Installation Overview --------------------- The Reliable Transaction Router Version 3.2 ECO7 installation procedure is the same as the installation procedure for RTR Version 3.2. Refer to the Installation Guide for further Copyright (c) Compaq Computer Corporation 2000. All Rights reserved. This software is proprietary to and embodies the confidential technology of Compaq Computer Corporation. Possession, use, or copying of this software and media is authorized only pursuant to a valid written license from Compaq or an authorized sublicensor. This ECO has not been through an exhaustive field test process. Due to the experimental stage of this ECO/workaround, Compaq makes no representations regarding its use or performance. The customer shall have the sole responsibility for adequate protection and back-up data used in conjunction with this ECO/workaround.