RTR V3.1D RTRAIX31D_227 Reliable Transaction Router AIX UNIX ECO Summary
TITLE: RTR V3.1D RTRAIX31D_227 Reliable Transaction Router AIX UNIX ECO Summary
Copyright (c) Compaq Computer Corporation 1999. All rights reserved.
PRODUCT: Reliable Transaction Router (RTR) for AIX
OP/SYS: AIX (UNIX on IBM)
SOURCE: Compaq Computer Corporation
ECO INFORMATION:
ECO Kit Name: RTRAIX31D_227
ECO Kits Superseded by This ECO Kit:
ECO Kit Approximate Size: 17920 Blocks (9175040 Bytes)
TAR file - 17920 Blocks (9175040 Bytes)
Kit Applies To: RTR V3.1D
AIX UNIX V4.1, V4.3.2
System/Cluster Reboot Necessary: No
ECO KIT SUMMARY:
An ECO kit exists for Reliable Transaction Router on AIX (UNIX for
IBM) V4.1 through V4.3.2. This kit addresses the following problems:
Problems Addressed in RTRAIX31D_227:
o Remote commands failing intermittently
Remote commands can sometime produce the error "%RTR-F-ERRACCNOD,
error accessing node". Repeating the command one or more times
usually results in the desired effect.
After stopping and starting RTR, one or more command server processes
may report "ACPNOTVIA" in response to a subsequent SHOW command.
The command DISCONNECT SERVER entered one or more times usually clears
the condition.
Also, the output from remote STOP RTR and DISCONNECT SERVER commands
is abbreviated.
o Overlapping partition definitions.
Under some conditions, simultaneously opening several server channels
with overlapping partition definitions could cause the ACP to crash.
This is no longer the case.
o Errors at severity "error" and "warning" reported as "fatal"
RTR defines messages with one of the following severity levels:
success, informational, warning, error and fatal. When displayed in
a terminal session or log file entry, the severity of a message is
indicated in the preamble string as a letter, e.g. "%RTR-S-FACCREATED
. . ." indicates the successful creation of a facility. Previous
versions of RTR grouped together messages of severity warning, error
and fatal, and displayed the severity level of these messages as fatal.
This had the unintended effect of displaying some messages at a higher
severity level than intended. This defect has been eliminated so that
messages are now rendered consistently with their descriptions in the
System Manager's Manual and elsewhere.
o ASCII representation of key bounds
Key range low and high bounds are now displayed properly
as a comma-separated list of numbers and C-style strings.
o Facility name allows some eight bit characters but not all
Non-English character sets are not supported for identifiers.
The supported character set for RTR identifiers such as facility
names is ASCII, with lowercase and uppercase letters equivalent.
Eight bit characters are not supported because the name might not
interoperate with RTR processes using a different locale or running
another RTR version.
o Node /isolate and link /suspect implementation faulty
The 'set node' qualifier /isolate and 'set link' qualifer /suspect have
been superseded by /autoisolate.
Any RTR node may disconnect a remote node if it finds that node
to be unresponsive or congested. The normal behavior following
such action is automatic network link reconnection and recovery.
Node autoisolation is a feature that allows a node where the
feature is enabled to disconnect a congested remote node in such
a way that when the congested node attempts to reconnect it receives
an instruction to close all its network links and cease connection
attempts. Whilst in this state, the node is termed isolated.
Remote node autoisolation may be enabled at the node level where
it applies to all links, or for specific links only with the
'set link/autoisolate' command.
An isolated node will remain in that state until the system manager
performs the following actions:
- enables the link to the isolated node at all nodes that
have isolated it set link /enable
- exit the isolated state at the isolated node
set node/noisolate
o MONITOR STALLS field widths adjusted
The numbers should now be legible even when many bytes have been sent.
o Incorrect open channel count in MONITOR CHANNEL
The opened channel count now counts every opened channel exactly once.
o Distributed quorum loop
Previous versions of RTR could generate unnecessarily large amounts
of network traffic whilst establishing quorum. In cases where a
recipient node was slow in processing incoming network messages
this could result in the loss of the sending RTRACP process
(rdm__malloc: not enough memory)
This has been corrected.
o MODIFY JOURNAL command now allowed while ACP is running
The earlier restriction that a MODIFY JOURNAL command could not
be issued after RTR was started has now been lifted. It is however
now required that RTR be started for a MODIFY JOURNAL command to
be accepted.
o Bug in load balance feature
In earlier versions of RTR it was occasionally possible for a router
to become permanently incapable of accepting new incoming frontend
connections if router load-balancing had been enabled by specifying
the /BALANCE qualifier, and a frontend happened to connect to the
router during a quorum state transition.
This has now been corrected.
o Rtr applications hang on trying to continue after ACP restarted
If the application tried to open a channel again after seeing the
status RTR_STS_ACPNOTVIA it could hang on the subsequent
rtr_receive_message call. This problem has been corrected for
threaded Unix platforms. It is no longer necessary to restart any
Rtr application for Unix after restarting Rtr.
o Using RTR commands after ACP has been stopped or dies
API verbs called from the RTR command line interpreter would fail
with the status ACPNOTVIA if RTR was stopped and restarted without
restarting the command server. This has been corrected. The problem
can be avoided on earlier vesions of RTR by issuing the command
'disconnect server' after stopping RTR.
o Application crash trying to send large messages to looping ACP
Several changes were made to combat this combination of application
crash and rapidly expanding ACP heap:
Flow control is now granted only to the channel and facility that
requested A problem was discovered and corrected whereby a grant of
flow control credit could allow unrelated channels to send too. This
is believed to be the prime cause of the symptoms reported.
An application that is unable to send to the ACP due to resource
shortage, for example if the ACP is alive but no longer receiving
for whatever reason, now keeps trying indefinitely, and will now
appear to hang rather than crash. The TCP_NODELAY option which
disables the Nagle algorithm is no longer enabled on any Rtr
platform. This will improve throughput under load, although there
may be a slight impact on response time under certain conditions.
o Functions permitted in a rtr_set_wakeup() handler
In an Rtr wakeup handler in an AST in an unthreaded OpenVMS
application the use of rtr_reply_to_client(), rtr_send_to_server(),
rtr_broadcast_event(), or rtr_receive_message() with a non-zero
timeout is not recommended. They may block when they need
transaction ids or flow control, which will cause the whole
application to hang until the wakeup completes.
In an Rtr wakeup handler in a threaded application the same rules
apply. Note that wakeups are unnecessary in a threaded paradigm,
but they may be used in common code in applications that also need
to run on OpenVMS. Please note that your mainline code continues
to run while your wakeup is executing, so extra synchronisation may
be required. Also note that if the wakeup does block then it does
not generally hang the whole application.
In an Rtr wakeup handler in a signal in an unthreaded Unix
application no RTR API functions and only the very few asynch-safe
system and library functions may be called, because the wakeup is
performed in a signal handler context. It is permitted to write
to a pipe or access a volatile sig_atomic_t variable, but using
malloc() and printf() for example will cause unexpected failures.
Alternatively, on most Unix platforms you can compile and link
the application as a threaded application with the reentrant Rtr
shared library '-lrtr_r'.
For maximum portability the wakeup handler should do no more than
just the minimum necessary to wake up the mainline event loop.
You should assume that mainline code and other threads might
continue to run in parallel with the wakeup, especially on machines
with more than one CPU.
o Superfluous network traffic for nonexistent channels
Whenever a channel opens or closes, RTR sends an update message
to the router so that it can modify its broadcast routing
information, if necessary. In previous versions of RTR such
messages were sent even if no channel existed for the facility.
In cases where machines with the Frontend role had a large number
of facilities defined, this could result in significant network
traffic that would be quite noticeable over slow links, such as
asynchronous connections over telephone wire. RTR no longer
sends these messages unless a channel exists on the facility.
o Crash for invalid facility on AIX
An uninitalized variable could cause a process crash on AIX.
This has been corrected.
o Large messages can exhaust ACP memory before it closes channel
Flow control is supposed to close a channel to release memory
occupied by the queue of messages waiting to be sent to an
unresponsive application channel after it has been found to be
full a certain number of times. If the application is sending
many very large messages, especially if flow control is also
not working properly for some other reason, then flow control
may not close the channel soon enough and the ACP may run out
of memory.
As a workaround only for applications using large messages
containing tens of thousands of bytes, the following environment
variables are provided in this release to adjust the flow control
algorithm parameters:
RTR_MAX_CHANNEL_WAITQ_BYTES default 100000
increase to several million
RTR_MAX_CHANNEL_WAITQ_LIMIT default -1
reduce to tens of millions
RTR_MAX_CHANNEL_FULL_COUNT default 2000
reduce to less than LIMIT/64000
The BYTES should be set somewhat larger than the total amount
of data that can ever occur in one transaction or one set of
broadcasts. This might be a few million for a typical large
message application. Flow control will start to operate and
hang each application that is sending the data while this number
of BYTES is exceeded for a channel. The channel will be closed
if this happens more than COUNT times, or if the LIMIT is ever
exceeded. Either set the LIMIT to an amount of acp waitq memory
that you can afford for one channel, considerably less than your
virtual memory and heap quotas, or set the COUNT to the number
of messages you can afford to store in acp waitq memory for one
channel. The default for LIMIT is -1, which means that there is
no LIMIT and the COUNT will apply.
These environment variables are provisional.
Please consult with Rtr Engineering before changing any of these
values.
o Servers hanging during failover recovery
Transaction recovery as a result of server failover could result
in server applications getting hung in 'local recovery' state if
it also happened that more than 10 client channels had
simultaneously caused new transactions to be presented to the
backend node. This has been fixed both by increasing the limit
to 50 and by adding a check to make sure that recovery is complete
before enforcing the limit, which is designed to keep a backend
node from getting overwhelmed when transactions are coming in at
a rate faster than it can handle.
o Null bytes display in SHOW PARTITION output
The display of null bytes in the upper and lower key bounds
has been suppressed if the bytes appear at the end of a
key of type string.
o RTRACP core dump while idle
When a Frontend node is trimmed from a Frontend/Router Facility
Definition, a core dump may result when RTR attempts to verify
the facility after a network link loss.
This has been corrected.
o RTRACP Crashes during trim operation.
Crash during trim operation is now corrected.
The following problems were corrected in V3.1D ECO2 and are included
in this kit:
o A problem introduced in the 3.1D ECO1 (171) release that caused
errors when using more than one partition has been corrected.
The following problems were corrected in V3.1D ECO1 and are included
in this kit:
o A configuration including front end nodes running RTR V2 and
router nodes running RTR V3.1D could result in an ACP failure
with an assertion on the router node. This has been corrected.
o The RTR ACP process would sometimes crash due to mis-handling
of data pointers related to key range and facility information.
The code has been changed to correct these problems.
o The delivery of the RTR_EVTNUM_SRPRIMARY event sometimes
arrived at an application in Shadow Secondary mode before the
completion of the pending transaction, which potentially allowed
the same transaction to be played to both shadow sites. This
problem has been corrected by ensuring that the event is delivered
only between transactions.
o Following facility reconfiguration with the "trim facility" command,
a situation could arise where a frontend node would be unable to
connect to its own role as a router in a facility. Once in this
mode, starting client channels would be unable to proceed past
the state "wait_keyin".
o The RTR Command Server process would sometimes crash while
exiting because the shared memory segment required to print log
file information was no longer available when the final flush
of the log buffer occurred. This problem has been corrected.
INSTALLATION NOTES:
The Reliable Transaction Router Version 3.1D ECO11 installation
procedure is the same as the installation procedure for RTR Version 3.1D.
Refer to the Installation Guide for further information.
This patch can be found at any of these sites:
Colorado Site
Georgia Site
European Site
Files on this server are as follows:
rtraix31d_227.README
rtraix31d_227.CHKSUM
rtraix31d_227.CVRLET_TXT
rtraix31d_227.tar
|