RTR V3.1D RTRI231 RTR for Windows NT _ 95 3.1D Intel ECO Summary

TITLE: RTR V3.1D RTRI231 RTR for Windows NT _ 95 3.1D Intel ECO Summary Modification Date: 07-MAY-99 Modification Type: New Kit Copyright (c) Compaq Computer Corporation 1999. All rights reserved. PRODUCT: Reliable Transaction Router (RTR) V3.1D OP/SYS: Windows NT and Windows 95 SOURCE: Compaq Computer Corporation ECO INFORMATION: ECO Kit Name: RTRI231 ECO Kits Superseded by This ECO Kit: None ECO Kit Approximate Size: 3959 Blocks Kit Applies To: RTR V3.1D System/Cluster Reboot Necessary: Unknown Rolling Re-boot Supported: Information Not Available Installation Rating: INSTALL_UNKNOWN Kit Dependencies: The following remedial kit(s) must be installed BEFORE installation of this kit: None In order to receive all the corrections listed in this kit, the following remedial kits should also be installed: None ECO KIT SUMMARY: An ECO kit exists for Reliable Transaction Router (RTR) V3.1D on Windows NT V4.0. This kit addresses the following problems: Problems Addressed in the RTRI231 Kit (ECO-14): o 14-1-239 WSAEventSelect 10055 knl_env line 1339 RTR was not handling some errors that could be returned when an asynchronous DECnet line would shut down. This has been corrected. o 14-3-118 TX on pri_act not being played on sec_act If RTR is configured with servers on backends that are running DECnet Phase V, then under certain conditions, local recovery from the remote node's journal would not be performed. For example, local and shadow recovery would appear to work correctly in a shadow server configuration after the primary shadow would go down, but in actual fact any transactions in the remote node's journal would not be recovered. This can only occur if the backends are all using DECnet Phase V as the primary RTR transport, and if the DECnet addresses of the nodes concerned match a particular pattern. Note that this is a static DECnet configuration issue. If recovery works in your particular configuration, then it will always work so long as the DECnet network configuration is not changed. This has now been fixed. o 14-3-124 Bug in V2 load balance In earlier versions of RTR it was occasionally possible for a router to become permanently incapable of accepting new incoming frontend connections if router load-balancing had been enabled by specifying the /BALANCE qualifier, and a frontend happened to connect to the router during a quorum state transition. This has been corrected. o 14-3-131 $DCL_TX_PRC crash when no privs Running a V2 application from an account that does not have RTR info privilege no longer causes the application to crash. o 14-3-132 Extend journal failure Previous versions of RTR would suffer from a crash of the rtr acp process if the RTR journal had been created with /MAXIMUM_BLOCKS greater than /BLOCKS and RTR attempted to extend the journal beyond the initial size. This has now been corrected. o 14-3-134 More Broadcast message counters required Various counters connected with the delivery of broadcast events have been added: Facility counters fdb_cn_bm_transit_brd_lost and fdb_cn_bm_transit_brd_delivered, link counters ndb_cn_bm_transit_lost and ndb_cn_bm_transit_delivered, and process counters bm_brd_lost and bm_brd_delivered. o 14-3-140 RTR V3 $DCL_TX_PRC does not complete correctly when no TXSB is supplied Using the V2 API verbs on V3 would not always generate the same result as V2 when using undeclared or invalid channels. This has been corrected. o 14-3-150 RTR applications hang on trying to continue after ACP restarted If the application tries to open a channel again after seeing the status RTR_STS_ACPNOTVIA, it hangs on the subsequent rtr_receive_message call. This affects applications on threaded RTR platforms, especially Win32 and AIX. If the application tried to open a channel again after seeing the status RTR_STS_ACPNOTVIA it could hang on the subsequent rtr_receive_message call. This problem has been corrected for threaded UNIX platforms. It is no longer necessary to restart any RTR application for UNIX after restarting RTR. o 14-3-151 'signed' identifier not in VAX C For previous versions of RTR, compiling RTR applications with the VAX C compiler generated compiler errors, since the VAX C compiler does not recognize the 'signed' keyword used in RTR.H. This has now been fixed. The signed keyword is no longer defined in RTR.H if compiling with the VAX C compiler. o 14-3-161 MONITOR CALLS/ID=n where 'n' is not a valid id - monitors all ids Use of the monitor command with any of the qualifers /link, /process, /facility or /partition would generate an empty display if the requested entity did not exist. This was unlike V2 behavior, and was considered by some to be misleading. V2 behavior has been restored. o 14-3-165 Shadow servers experience deadlock using 2 partitions If two shadowed server partitions were set up with primary and secondary roles transposed on the two backends involved and the servers for these partitions acccessed the same database rows it was occasionally possible for a distributed deadlock to occur in which servers on each site waited forever for locks held by the primary server for the other partition to be released. his has now been corrected. o 14-3-167 Corruption in large monitor pictures Spurious and missing characters were seen in larger monitor pictures displayed to a terminal window or terminal. The corruption was particularly dramatic if a terminal escape sequence was affected. The corruption appeared to occur when more than 8k of data was buffered without a newline or explicit flush, e.g. when monitoring with a recent kit in which Rtr terminal output was changed to line-buffered for efficiency. The output always seemed to be correct when using MONITOR /OUTPUT, or when monitoring remotely from an Rtr platform other than VAX, or when buffering more than BUFSIZ 64k of output on OpenVMS Alpha. Investigation so far indicates that Rtr was buffering the output correctly, and that there would seem to be an error in the OpenVMS VAX runtime libraries. RTR now flushes output more frequently so as not to provoke this problem. o 14-3-168 How to calculate required quotas? There is no easy way to calculate the required UAF and SYSGEN parameters needed by RTR. The following information may be used to estimate virtual memory size requirements of the RTR ACP process. The base virtual memory requirement of an unconfigured RTR ACP process is approximately 5.8 Mbytes. To this should be added allowances for the following: - for each link, add 202 kBytes - for each facility, add 13 kBytes, plus 80 bytes for each link in the facility - for each client or server application process, add 190 kBytes for the first channel - for each additional application channel, add 1350 bytes It is also necessary to make allowance for the number of active transactions in the system. Unless your client applications are programmed to initiate multiple concurrent transactions, this number will not exceed the total number of client channels in the system, but you should verify this with your application providers. It is also necessary to determine the size of the transaction messages in use. For each frontend: - add 1 kByte per active transaction - add 250 bytes per message per transaction - plus the size of all the messages For transaction routers, allow about 1 kByte for each active transaction. For backends, allow: - 1 kByte per active transaction - 50 bytes for each message of a transaction. - plus to size of all replies The total of all contributions detailed above will yield an estimate of the likely virtual memory requirements of the ACP. Apply a large factor for safety - it is better to grant RTR resource limits exceeding its real requirements than to risk a loss of service in production as a result of insufficient resource allocation. Divide the result by VM size in pages to obtain the virtual memory requirement. You should set process memory and page file quotas to accommodate at least this much memory. Process quotas for the ACP process are controlled by qualifers to the 'start rtr' command. See the RTR System Manager's Manual for further information. For more control, you may individually set all process quotas for the ACP by using the appropriate qualifer with the 'start rtr' command. For a more holistic approach, 'start rtr' accepts '/links' and '/processes' as qualifers which can be used to specify the expected number of links and application processes in the configuration. The values supplied are used to calculate reasonably safe minimum values for the following ACP process quotas: - astlm - biolm - fillm - diolm - pgflquota The default value for '/links' is 512. This is high, but is chosen to protect RTR routers against a failover scenario where the number of frontends is large and the number of surviving routers becomes small. The default value for '/processes' is 64. This is large for frontend and router nodes, but you may need to specify a larger value on a backend hosting a complex application. You may use an explicit process quota qualifier to specifiy a value larger than that calculated through use of '/link' and '/process', but you may not specify a smaller value. Use of /link and /process do not consider memory requirement for transactions. If your application passes a large amount of data from client to server or vice-versa, you should include this in your sizing calculations. o 14-3-174 $ENQ ACCVIOs with bad channel message Calls to the V2 API specifying an invalid channel identifier would cause an access violation in LIBRTR. This has been corrected. o 14-3-195 $START_TX when ACP died The emulation of the RTR V2 API on V3 has been improved to correctly reflect V2 behaviour for channels which which were idle at the time of ACP failure. Subsequent calls on such channels now fail immediately with the status RTR$_NOACP. o 14-3-196 $START_TX from AST w/o ACP" Application calling $START_TX at AST level while the ACP died would cause the application to crash inside LIBRTR. This has been corrected and SYS$START_TX will simply return to the caller a message indicating that the ACP is not available. o 14-3-197 ACPNOTVIA error returned if RTR command $DCL_TX_PRC issued On RTR V3.1D (194-SWX01) ECO7-FT1 if the RTR command $DCL_TX_PRC is issued for a non-existent facility, an ACPNOTVIA error is returned. This does not happen the first time - only subsequent times if RTR is stopped in between. API verbs called from the RTR command line interpreter would fail with the status ACPNOTVIA if RTR was stopped and restarted without restarting the command server. This has been corrected. The problem can be avoided on earlier vesions of RTR by issuing the command 'disconnect server' after stopping RTR. o 14-3-203 ACP router crash when other nodes shut Configurations where more than 100 frontends were connected to any particular router may experience an ACP failure whilst managing quorum loss. This has been corrected. Automatic router failback has been restored for RTR V2 frontends connecting to RTR V3 routers. o 14-3-205 Inconsistent TR TX timeout if no link to FE Using previous versions of RTR, if a router lost a connection to a frontend that had a transaction active in enqueuing state, then the router would abort the transaction after a period of about one minute if the frontend link was not re-established. This even if the client had specified a transaction timeout much less than this when starting the transaction. This is now fixed, so that a transaction in enqueueing state on the router would be aborted after the interval specified by the client (if it's less than one minute) if the router loses its connection to the frontend. o 14-3-210 START RTR qualifiers from V2 Attempts to use obsolete V2 qualifiers to the 'start rtr' command cause a warning to be issued. Qualifiers affected are 'partitions', 'cache_pages', and 'relations'. Warnings are also generated if an OpenVMS qualifier is used on a non-OpenVMS o 14-3-211 Long facility names were truncated in SHOW FACILITY output Facility names near or at the maximum now push the next column to the right instead of being truncated. Slightly shorter names are still preferred to prevent the SHOW FACILITY columns becoming ragged. o 14-3-213 Fac name can be 31 chars? Although the %RTR-E-FACNAMLON message states the facility name can only have 30 characters, it can take up to 31. The documented maximum length of a facility name string is 30 characters. Prior versions of RTR permitted facility names as long as 31 characters. This has been corrected. o 14-3-217 Unthreaded UNIX applications using rtr_set_wakeup can fail, e.g., in malloc When an unthreaded UNIX RTR application calls rtr_set_wakeup, the non-reentrant RTR shared library -lrtr with which it is linked installs a signal handler. This signal handler called functions internal to RTR which could occasionally call runtime library functions such as malloc() that are not async-safe, according to the relevant standards. See man (4) signal. In practice this may appear to work most of the time, but break for no apparent reason when the signal happens to occur while background code is also in a runtime library call such as malloc. The problem in RTR has been corrected. The small penalty for this is that RTR no longer makes any attempt to try to ensure that messages available are not just housekeeping. Applications must always be prepared for a timeout return status on calling rtr_receive_message with a zero timeout, even after a wakeup suggests that a message ought to be available. Application writers are reminded that their RTR wakeup handlers are subject to the same restrictions: routines like printf, malloc, and the entire RTR API may not be used directly or indirectly from within a signal handler. A workaround for applications with unsafe wakeup handlers can be to link with the reentrant version of the library -lrtr_r because different rules apply for wakeups in a thread: applications should not call anything that is not thread-safe, or anything that might block indefinitely, such as rtr_send_to_server, rtr_reply_to_client, rtr_broadcast_event, or rtr_receive_message with a non-zero timeout. o 14-3-218 Microsoft Visual C compiler options /Gz (stdcall) and /Gr (fastcall) supported The RTR API functions in are now declared with the __cdecl attribute so they can be used in applications compiled with calling conventions other than the /Gd (cdecl) default. o 14-3-250 Flow control has -ve credit Applications with multiple channels engaged on more than one facility could experience flow control difficulties causing indefinite delays in transaction completion. This has been corrected. o 14-3-253 Restrictions on the RTR wakeup handler The use of rtr_reply_to_client, rtr_send_to_server, or rtr_broadcast_event in an RTR wakeup handler is not recommended. They may block when they need transaction ids or flow control. This will cause undesired behavior. Functions permitted in an rtr_set_wakeup() handler: In an RTR wakeup handler in an AST in an unthreaded OpenVMS application, the use of rtr_reply_to_client(), rtr_send_to_server(), rtr_broadcast_event(), or rtr_receive_message() with a non-zero timeout is not recommended. They may block when they need transaction ids or flow control, which will cause the whole application to hang until the wakeup completes. In an RTR wakeup handler in a threaded application the same rules apply. Note that wakeups are unnecessary in a threaded paradigm, but they may be used in common code in applications that also need to run on OpenVMS. Please note that your mainline code continues to run while your wakeup is executing, so extra synchronization may be required. Also note that if the wakeup does block then it does not generally hang the whole application. In an RTR wakeup handler in a signal in an unthreaded UNIX application, no RTR API functions and only the very few asynch-safe system and library functions may be called, because the wakeup is performed in a signal handler context. An application can write to a pipe or access a volatile sig_atomic_t variable, but using malloc() and printf(), for example, will cause unexpected failures. Alternatively, on most UNIX platforms, you can compile and link the application as a threaded application with the reentrant RTR shared library -lrtr_r. For maximum portability the wakeup handler should do the minimum necessary to wake up the mainline event loop. You should assume that mainline code and other threads might continue to run in parallel with the wakeup, especially on machines with more than one CPU. o 14-3-255 Multiple broadcast or data received on wrong channel When running W95/NT with Pathworks installed, RTR would not detect that the client had closed its channel when the client application was aborted by closing the window. RTR now detects when the client has aborted the channel and closes the channel. o 14-3-258 Stop inquorate standby from going active When there is a network segmentation in an active/standby configuration, the segment in the minority would become active. This behavior resulted in two active servers for the same partition. RTR now puts the inquorate or minority server in wt_quorum state and the majority server in active state. o 14-8-185 o 14-3-259 Slow or hanging applications using large messages and discarded broadcasts The threshold for flow control has been increased from 100000 to 1000000, and can also now be changed by defining the environment variable RTR_MAX_CHANNEL_WAITQ_BYTES to e.g. 10000000 when starting the ACP. When too many data bytes are queued to be sent to a destination process then the flow control feature is activated. The sender application may then be forced to wait for a while in the next api call that sends data, or broadcasts may be discarded, until the queue reduces and flow control credit is freely granted again. You should increase this parameter if your application sends large broadcasts or sends or replies with large amounts of data per transaction. Because broadcasts are subject to discarding you should not use them to send large amounts of data reliably. You may wish to consider using a sequence number and providing a read-only transaction in your application to detect and request re-transmission of any discarded broadcast data. There is also a hard limit parameter with the default 100000000 (10^8). The channel will be closed immediately if the send wait queue exceeds this. These tunable flow control parameters are provisional and subject to change. o 14-3-265 Successive ACP crashes Reception of a corrupt network message could cause a failed assertion and demise of the RTR ACP process. The behavior has been changed to yield a log file entry (BADNETMSG), followed by a reset of the link concerned. If such log file entries persist for a particular pair of nodes, it may mean that a network problem exists, and you should consider checking the network hardware for correct operation. The RTR KNL subsystem log entry has also been improved to better identify the link on which it reports errors. o 14-3-272 RTR no longer disables the TCP/IP Nagle algorithm with TCP_NODELAY The TCP_NODELAY option which disables the Nagle algorithm was previously enabled on all RTR platforms except Solaris. This change improves network throughput under load. Response time may be slightly longer under some conditions. The option can be activated by defining the environment variable RTR_TCP_NODELAY. This restores the old behavior on most platforms. o 14-3-274 Opening channel for non-existent facility causes crash An uninitialized variable caused this crash which was seen in recent Field Test kits (214) and (218) for AIX and Solaris. This has been corrected. o 14-3-275 aio not available makes RTR fail with unresolved errors for kaio_rdrw etc. RTR for AIX exploits Asynchronous I/O for increased journal performance. By default, aio is only `defined', i.e., disabled, instead of `available'. Aio can be configured with the system management tool: # smit aio. The RTR installation procedure post_i script now makes aio available, and ensures that aio will also be available after a restart. o 14-3-276 SHOW TRANSACTION on FE after current TR trimmed and before FE reconnected causes ACP dump Executing the command RTR SHOW TRANS on a frontend immediately after trimming the current router from the facility could infrequently cause the ACP to crash. This has been corrected. o 14-8-173 o 14-3-282 Dual ported TCP router not establishing facility links" Problems can arise if nodes in your configuration have multiple network adapters and the IP name server is not configured to return all the configured IP addresses for such nodes. This results in such nodes replying to connection requests with an ID that is different to that determined by the initiator of the connection. This can result in refused connections, or only the first connecting facility to gain a current router. This version of RTR has been changed to operate correctly in such a partially configured environment. o 14-3-283 Image identification shows previous rtr version Some versions of the OpenVMS/VAX kit for RTR V3.1D ECO10 were shipped with a shared library showing an incorrect version ID. This has now been corrected. o 14-3-285 OpenVMS process quotas artificially constained Prior versions of RTR would limit the maximum values that could be specified for the ACP process quotas to 64K. This restriction has been removed. Warning messages are generated if the requested quotas conflict with the system wide WSMAX parameter, of the remaining free page file space. o 14-3-82 Requester hangs in SYS$COMMIT_TXW with RTR V3.1D - null txn If rtr_start_tx() was called by a client followed immediately by rtr_accept_tx(), then the application would hang (unless rtr_start_tx() was called with a timeout). This has been corrected. The status returned in the rtr_mt_accepted data in such cases is RTR_STS_SYNCHCOMM (transaction committed synchronously). This also corrects the equivalent problem with the RTR V2 API. Also, the status returned in the TXSB for such transactions using the V2 API is RTR$_SYNCHCOMM. o 14-8-128 The RTR ACP now uses an asynchronous method when closing its links This version of RTR will defer the deassigning of its network channels during the closesocket routine. This change allows RTR to handle other requests while the channels are being run-down. RTR no longer appears to pause while a network link is being deassigned. o 14-8-131 Failure to come up in remember mode The non-availability of a remote journal at shadow recovery time will cause a partition that was previously processing in remember mode to resume processing in that mode. Prior versions of RTR would set the partition state to 'shawdow-recovery-fail', and transactions could not be processed until the configuration was manually corrected. o 14-8-144 When disconnected the ASYNC cable from the client, the RTR dump was generated on the client." Disconnecting a cable that was being used by an asynchronous DECnet link to a remote machine could cause an ACP failure when the transport marked the sockets as invalid. RTR has been changed to handle this error by temporarily suspending all network activity on the affected node. Network activity will resume as soon as the network is found to be usable again. o 14-8-154 RTR Router Crash Router ACPs configured to accept anonymous clients could fail when handling a network link loss event. This has been corrected. o 14-8-162 Hanging servers Transaction recovery as a result of server failover could result in server applications getting hung in 'local recovery' state if it also happened that more than 10 client channels had simultaneously caused new transactions to be presented to the backend node. This has been fixed both by increasing the limit to 50 and by adding a check to make sure that recovery is complete before enforcing the limit, which is designed to keep a backend node from getting overwhelmed when transactions are coming in at a rate faster than it can handle. o 14-8-175 RTRACP core dump while idle When a Frontend node is trimmed from a Frontend/Router Facility Definition, the Facility Descriptor block is not fully deleted. This causes a core dump when RTR attempts to verify the facility after a network link loss. RTR now properly deletes the Facility Descriptor Block associated with the trimmed frontend. o 14-8-181 RTRACP Crashes It was possible that RTR on a frontend could select a router as it's current router immediately after that router had been trimmed from the facility. This could potentially leave the frontend in a 'connecting' state. This has now been corrected. The following restrictions apply to this kit: o 14-1-285: A temporary inconsistency in shadow server state can occur during initial facility startup of a shadowed configuration. A shadow server can erroneously remain in state "sec_act" until the rest of the facility has been started. o 14-3-67: An application's wakeup routine may be called more often than necessary. INSTALLATION NOTES: Please refer to the Installation Guide supplied with previous versions. Note that this kit runs only on Intel processors. The installation is the same for both Windows NT and Windows 95. From an empty directory, run the self-extracting RTRI231.EXE. Then run SETUP.EXE to install RTRI231. All trademarks are the property of their respective owners.

This patch can be found at any of these sites:

Files on this server are as follows:

rtri231.README
rtri231.CHKSUM
rtri231.CVRLET_TXT
rtri231.exe
rtri231.CVRLET_TXT