RTR V3.2 RTR320_268 Reliable Transaction Router Tru64 UNIX ECO Summary
TITLE: RTR V3.2 RTR320_268 Reliable Transaction Router Tru64 UNIX ECO Summary
Copyright (c) Compaq Computer Corporation 2000. All rights reserved.
Modification Date: 11-OCT-2000
Modification Type: Updated Kit: Supersedes RTR320_266
PRODUCT: Reliable Transaction Router (RTR) for Tru64 UNIX
OP/SYS: Compaq Tru64 UNIX
SOURCE: Compaq Computer Corporation
ECO INFORMATION:
ECO Kit Name: RTR320_268
ECO Kits Superseded by This ECO Kit: RTR320_266
RTR320_255
RTR320_249
ECO Kit Approximate Size: 10600 Blocks (5427200 Bytes)
Kit Applies To: RTR V3.2 on
Compaq Tru64 UNIX V4.0D, V4.0E, V4.0F, V4.0G, V5.0
System/Cluster Reboot Necessary: No
Installation Rating: INSTALL_UNKNOWN
Kit Dependencies:
The following remedial kit(s) must be installed BEFORE
installation of this kit:
None
In order to receive all the corrections listed in this
kit, the following remedial kits should also be installed:
None
ECO KIT SUMMARY:
An ECO kit exists for Reliable Transaction Router V3.2 on Compaq Tru64
UNIX V4.0D through V5.0. This kit addresses the following problems:
Problems Addressed in the RTR320_268 Kit:
o The following changes and corrections have been made for RTR V3.2,
ECO8 for all platforms.
+ 14-8-337 V4.0 SET TRANSACTION Functionality
RTR would not allow the SET TRANSACTION command to abort a voted
transaction if a coordinating router was still available. Users
would get a SETTRANROUTER error. This constraint has now been
relaxed. You can use SET TRANSACTION to abort a transaction that
is in VOTED state. All backend participants and the client receive
a rtr_mt_rejected message along with the reason, indicating that
the transaction has been aborted by a SET TRANSACTION operation.
If the coordinating router is running a version of RTR prior to
V3.2, ECO8, the SET TRANSACTION command fails with a TRNEEDUPGRADE
error, indicating that "The new SET TRAN feature is not supported
in current version of RTR".
There are eight valid state changes allowed for the SET TRANSACTION
command. Attempting to change transaction state to a state that is
not allowed produces an error message of %RTR-E-INVSTATCHANGE,
Invalid to change from current state to the specified state. The
following table identifies the valid state changes.
Table_1_Valid_Transaction_State_Transitions
_________________________________________________________
___________________NEW_STATE___________________
___________________________________________________________
Current
State COMMIT ABORT EXCEPTION DONE
___________________________________________________________
SENDING YES
VOTED YES YES
COMMIT YES YES
EXCEPTION YES YES
PRI_DONE YES
___________________________________________________________
All transaction states referenced in the above table are RTR
journal states. Use the RTR commands DUMP JOURNAL or SHOW
TRANSACTION to determine the journal state for each transaction
branch.
Changing from State VOTED to State ABORT
A transaction can be stalled in a VOTED state in one or
more of the following situations.
- There is a distributed deadlock (only possible
with multi-participant transactions and multiple
simultaneously active transactions.)
- One participant has voted, but some other participant
has not voted (for example, waiting for the database
record to become accessible.)
- The transaction is not currently active, and is a
candidate for recovery. The transaction is probably a
multi-participant transaction and the final journal
state for the local branch is VOTED. The transaction
outcome cannot be determined without consulting the
journal from the other participants.
Whatever the cause, this transaction ties up the system resources
and prevents other transactions from running. Should the system
administrator decide to abort the transaction using the SET
TRANSACTION command, RTR sends a request-to-abort message to all
the participants (transaction branches) to abort each transaction
branch. After the abort, RTR presents a rtr_mt_rejected message
to each server with a status indicating that "TX was aborted by
Set Transaction operation". If the coordinating router is available,
a race condition is possible, in that the transaction coordinator
might be trying to commit the transaction at the instant that the
operator was attempting to abort the transaction. Under this
scenario, RTR may not allow the abort to proceed, if the
coordinating router has already decided to commit the transaction.
An operator log message on the router will be written to warn the
administrator of this situation.
If the journal state of the transaction is VOTED, you can use SET
TRANSACTION /NEW_STATE=COMMIT or ABORT, for example:
RTR> SET TRANSACTION "8fd01f10,0,0,0,0,8fd01f10,7d9f25b0" /PARTITION=PART1 -
_RTR> /STATE=VOTED /NEW=ABORT
%RTR-S-SETTRANDONE, 1 transaction(s) updated in partition PART1 of
facility RTR$DEFAULT_FACILITY
o The following changes and corrections have been made for RTR V3.2,
ECO7 for all platforms.
+ 14-3-252 ACP failed with insufficient memory
Previous versions of RTR could allocate too much memory
for replay queues. When insufficient virtual memory was
available, this behavior could cause the ACP to fail.
RTR V2 compatible functionality has been reintroduced,
in that a replay queue that grows too large is now
discarded.
+ 14-3-270 ACP crashed with INSVIRMEM while trying to
abort a very large transaction
This problem was caused by overlarge transaction replay
queues and has been corrected.
+ 14-3-340 Standby partition fails to take over on network
partitioning
Previously, it was possible that a standby partition
would fail to take over if network partitioning
occurred.
o The following changes and corrections have been made for RTR V3.2,
ECO6 for all platforms.
+ 14-1-929 Unable to suspend a partition on a secondary
site
It was possible to encounter a situation where you could
not suspend transaction presentation on a partition on
a secondary site. If you gave the suspend command at
a time when the secondary site only had transactions
awaiting assignment of a server channel, then the
command would not complete, leaving transaction
presentation in the 'suspending' state.
+ 14-1-940 Partition switching between standby and active
State instability could result if instances of a
partition were assigned an equal priority. This could
happen through use of SET PARTITION commands, or as a
result of network partitioning.
+ 14-5-157 Improved diagnostics
Improved diagnostics have been added for testing
network-related problems.
+ 14-8-304 Aborted transaction caused the RTR ACP to crash
A complex series of interactions could lead to an
aborted transaction causing the RTR ACP to crash on
another node where a standby server was being started.
+ 14-8-309 Partition switching between primary and standby
Temporary partition state instability could occur during
certain network failures causing application servers
that subscribe to SRSTANDBY and SRPRIMARY events to
receive repeated events in rapid sequence.
+ 14-8-310 Transaction on the front-end hangs in the
sending state
It was possible for a front-end to hang in the sending
state when a front-end failed over to the secondary
router and then back to the preferred router.
o The following changes and corrections have been made for
RTR V3.2, ECO5 for all platforms.
+ 14-1-864 MONITOR GROUP shows bad values in "server act"
and "vreq" fields
MONITOR GROUP previously showed invalid counts in
the "act" and "ack" columns if a secondary rejected a
transaction.
+ 14-1-865 Partition creation not adequately synchronised
with journal scan
Partitions are now created after the journal scans are
complete.
+ 14-1-893 Partitioned Standby allowed to enter lcl_rec
followed by "dual active"
During periods of network instability, previous
versions of RTR could incorrectly allow roles to
become temporarily quorate. As a consequence, certain
unexpected state transitions could occur, for example
standby servers could become active.
+ 14-5-156 Logging partition state transitions
Backend partition state transition logging contained
a bug which allowed state transitions to be recorded
incorrectly. This has been fixed. Note that this fix
addresses the accuracy of the log file entries; it
does not increase the number of entries in the RTR log
to cover those cases where a backend partition state
transition is not logged.
o The following changes and corrections have been made for RTR
V3.2, ECO4 for all platforms.
+ 14-1-743 Wrong return status RTR_STS_COMSTAUNO
In RTR V3.2 it was sometimes possible for a transaction
which had not yet been voted on by a server which exits
in mid-transaction to be aborted with incorrect status
RTR_STS_COMSTAUNO.
+ 14-1-805 Attempt to create a partition that already
exists returns incorrect status
An attempt to create a partition that already existed
used to return the error KRINUSE (key range in use).
This has been superseded by the more explicit PRTALREXI
(partition already exists).
+ 14-1-813 MONITOR SYSTEM shows WARNING on calls due to
invalid time call
The MONITOR SYSTEM monitor picture would sometimes
incorrectly display a warning state for the CALL row.
+ 14-1-841 Replayed shadow transaction stuck in VOTED
The implementation of RTR's cooperative recovery
protocol algorithm has been enhanced so that some
situations which would previously hang during permanent
network link outages are now recovered correctly using
the remaining connections.
+ 14-1-842 Transactions which do not specify a timeout
abort
If a frontend failed over to another router, then failed
back to the original router, transactions which were in
progress could sometimes be rejected with status RTR_
STS_TXTIMOUT.
+ 14-1-845 Transactions remain in VOTED state
In earlier versions of RTR it was occasionally possible
for transactions to remain hanging in VOTED state on a
shadow primary backend and be aborted with status RTR_
STS_FELINLOS on the secondary backend after network link
failures in a slow WAN environment.
+ 14-1-846 Transactions remain in SENDING state
In previous versions of RTR it was occasionally possible
for a transaction to remain hanging in SENDING state
on a backend after a network partition had forced the
backend to lose quorum.
+ 14-5-111 Certain RTR commands now recorded in the RTR
operator log
Operator log files created by previous versions of
RTR could sometimes be difficult to interpret. By
recording certain RTR commands, such as START RTR and
CREATE FACILITY, the RTR log file has become easier to
interpret.
+ 14-5-156 Logging partition state transitions
insufficient
Previous versions of RTR did not report backend
partition state transitions in the operator log file
with sufficient detail.
Backend partition state transitions are now reported as
follows:
- Previously unlogged state transitions are recorded in
the operator log with the new PRTSTATRA message.
- The PRTBEGIN message is no longer generated.
- The PRTCREATED and PRTEND message formats have been
changed to match that of the PRTSTATRA messages.
+ 14-8-267 V3.1D-to-V3.2 Journal incompatibility corrected
If you upgrade from V3.1D to earlier versions of V3.2,
it was possible to encounter situations which caused the
RTR ACP to crash.
+ 14-8-287 Named partition state change caused crash
When using the CREATE PARTITION command, it was possible
for RTR to crash on the backend node if the last channel
using the partition is closed at the same instant that a
state-change message from the router is pending.
o The following correction has been made in RTR V3.2, ECO2
for all platforms.
+ 14-8-268 ACP crash after death of concurrent server
Under rare circumstances, after the death of a
concurrent server, RTR would try to reschedule a
transaction in commit state resulting in an RTR ACP
crash. This bug was present in RTR V3.2 and V3.2 ECO1.
o The following changes and corrections have been made for
RTR V3.2, ECO1 for all platforms.
+ 14-1-50 Looping RTR process for empty node string, e.g.,
/NODE=dna.
Specifying an incomplete node specification, such as
one with only the protocol prefix, e.g., "RTR SHOW
RTR /NODE=dna." could cause the RTR process to loop,
consuming CPU.
+ 14-1-433 Show transactions not recovered on link
break/reconnect
If a secondary shadow backend lost its link to the RTR
router after the router had sent a vote request, and the
server on the primary shadow accepts the transaction,
then in unusual circumstances it was possible that
the transaction would not be immediately recovered on
the secondary shadow after the link to the router was
re-established. In such cases it required a cycle of
the servers on the secondary site for the remembered
transaction to be recovered from the primary shadow
journal.
+ 14-1-582 ACP access violation
If a number of concurrent servers died in sequence
while processing the same transaction, then under rare
circumstances it was possible the ACP could also abort.
This was due to a counter being incremented incorrectly.
+ 14-1-617 Problems with DUMP JOURNAL
In previous versions of RTR, qualifiers which required a
value did not generate an error if the value was not
supplied or was supplied incorrectly. Incorrect or
missing values now generate an error message.
If a string of fewer than five characters was passed
for partition record class, the partition record counter
was not updated and the record was not available. These
problems have been fixed by comparing each character
instead of five characters at a time.
+ 14-1-760 ACP crashed when modifying journal size
After a journal had been modified, the Flow Control
subsystem of RTR was not properly updated with the new
size. This could cause a hang or crash situation even
though the journal size was increased to accommodate
increased traffic.
+ 14-1-763 rtr_close_channel fails for distributed
transaction
Calling rtr_close_channel while a distributed
transaction was pending caused an incorrect status to
be returned.
+ 14-1-772 CALL CLOSE_CHANNEL defaults to IMMEDIATE
The flag RTR_F_CLO_IMMEDIATE is a new flag added in RTR
V3.2 that allows the caller to close a server channel
without acknowledging the transaction on the channel.
By default, the flag is not set when calling the rtr_
close_channel API. However, the /IMMEDIATE qualifier
is implicitly present in the RTR CLI version of the API
(rtr call rtr_close_channel).
Because this is incompatible with the behavior of
previous versions of RTR, functionality has been
restored to the same as before V3.2. When using the
CLI version of the API (rtr call rtr_close_channel),
/NOIMMEDIATE is now the default.
+ 14-1-774 TOOMANCHA and distributed transaction left open
after rtr_open_channel failure
If rtr_open_channel failed after the RTR ACP had been
stopped, then that channel remained available for a
subsequent open. The application could eventually run
out of channels and return RTR_STS_TOOMANCHA.
Now if rtr_open_channel fails after a distributed
transaction has been opened, the distributed transaction
is always closed.
+ 14-1-777 Transaction state is not getting EXCEPTION
after issuing rtr_close/imme
SET PARTITION /RECOVERY_RETRY_COUNT is new functionality
implemented in RTR V3.2. The scope of this command was
not fully documented, and is clarified here.
If an application server dies while processing a
transaction recovered from RTR journal, then RTR will
present the transaction to another (concurrent or
standby) server. The RECOVERY_RETRY_LIMIT indicates
the maximum number of times the transaction should be
presented to a server for recovery before being written
to the journal as an exception.
There are two types of recovery operations where
transactions are recovered from journal: local recovery
and shadow recovery. Shadow recovery is the process
of recovering the remembered transactions written to a
primary shadow journal while the secondary shadow site
is down.
The SET PARTITION /RECOVERY_RETRY_COUNT parameter
does not have an effect on remembered transactions
recovered during shadow recovery. That is, if there
is a killer transaction remembered in the journal on
a primary shadow node, on this node RTR does not count
the number of times the transaction is recovered by
a recovering secondary shadow node. To ensure that a
remembered transaction will be exceptioned by RTR, you
must start a sufficient number of concurrent servers on
the recovering secondary shadow node.
For this reason, RTR recommends that the number of
concurrent secondary shadow servers started be greater
than the value set for the RECOVERY_RETRY_LIMIT on a
partition. This will ensure that a remembered (killer)
transaction being recovered from a primary shadow
journal will be exceptioned if the retry limit is
exceeded.
Only those transactions that have reached voting stage
on a server can be exceptioned. If a server always dies
before voting on a transaction, then the transaction
will be aborted by RTR after the third try. This is
a hard-coded limit (the so called "three strikes and
you're out" feature).
+ 14-1-791 Backends incorrectly remain inquorate after
routers trimmed
In versions V3.1D, ECO14, and V3.2 of RTR it was
sometimes possible for nodes to incorrectly remain
inquorate following a TRIM FACILITY operation.
+ 14-1-792 Revised rtrreq.c and rtrsrv.c sample RTR
applications
The sample client and server used in the IVP have been
extensively revised. Please pay special attention to the
comments which explain how to write a wakeup handler,
and comments drawing attention to several common
programming mistakes we have seen in RTR applications.
+ 14-3-291 SHOW SERVER truncates shd_rec_icpl to shd_rec_
ic
Some of the values previously truncated by the brief
SHOW SERVER command are now displayed more fully.
+ 14-3-298 Application may crash if invoked before RTR
after a reboot
Normally the RTR executable must have been invoked at
least once since reboot before an RTR application can
be started. If an RTR application is invoked first, the
first RTR API call now always returns RTRNOTSTA, RTR not
started.
+ 14-7-420 IOS tid on IP only nodes is not unique
Using previous versions of RTR, if you ran client
applications that used the RTR V2 API on systems that
had DECnet disabled, then there was a remote possibility
that the same transaction identifier could be generated
on two such systems, if RTR was started on both systems
within milliseconds of each other.
+ 14-8-215 Faster loading of large journals on first
CREATE FACILITY
RTR now takes much less time to load journals containing
a large number of journaled transactions.
+ 14-8-257 The broadcast message was not delivered from BE
to client
If a frontend loses the connection to its original
router, and is the first frontend to connect to the
router it fails over to, then the frontend may stop
receiving broadcasts. Further, backends could also fail
to receive broadcasts delivered by routers added to a
facility after the server applications have started.
+ 14-8-262 RTR has both backends as primary for some
transactions (STR#1885690)
In a partitioned network situation (when each of two
routers have access to only half of the backend nodes),
RTR will choose the router with the lower network
address as the one that remains or becomes active. In
previous versions of RTR, this would sometimes result in
both sets of backends becoming active, due to a problem
with the network ID comparison algorithm.
The following changes and corrections have been made for
RTR V3.2, ECO1 for the Compaq Tru64 UNIX platform.
+ 14-3-190 Signals blocked in unthreaded UNIX applications
during RTR API calls
RTR now enables the usual termination signals during RTR
API calls. For example, an idle server RTR application
waiting in rtr_receive_message with no timeout will now
respond to Control-C.
+ 14-3-300 Terminated RTR application process that used
fork is still shown by RTR
RTR applications now have FD_CLOEXEC set for the IPC
sockets used to communicate with the RTR ACP, so that
these do not remain open in a child process after fork
and exec even after the parent process has terminated.
This means that the RTR ACP now notices when the parent
exits, and will not accumulate a wait queue of broadcast
messages or delay failover. The terminated process no
longer appears in RTR SHOW PROCESS.
+ 14-7-952 BADROWCOL and escape sequences visible on dumb
or unknown terminal
The default VT100-style terminal escape sequences can
now be completely suppressed with a suitable TERMCAP
environment variable setting. It is still necessary
to set a non-zero window size to avoid BADROWCOL, for
example:
stty rows 48 cols 120
TERM=dumb
TERMCAP="dumb:cm=:do=:le=:nd=:up=:ks=:ke=:cl=:ce=:
ho=:mb=:md=:mr=:us=:ue=:me=:cr=:bl=:"
This is particularly useful when running RTR in an Emacs
shell window, and gives reasonably clean output for all
RTR commands except MONITOR.
Known Problems with Workarounds:
o The following restrictions were described in previous release notes
and are still applicable to RTR.
+ 14-8-318 Maximum journal size
The RTR journal file is limited to 524287 blocks on a
single disk. If you want to create a larger journal
file, you have to distribute the journal across more
than one disk.
+ 14-1-39 Declaring exit handlers in RTR applications
If an exit handler contains calls to RTR, then the exit
handler must be declared after the first call to RTR.
Using the RTR V2 or V3 API, if the exit handler is
declared before the first call to RTR, then any call to
RTR made within the exit handler will return an error.
Under the V3 API, the error status returned is RTR_STS_
INVCHANNEL. Under the V2 API, the error status returned
is RTR$_INVALCH.
+ 14-1-103 Using rtr_set_wakeup() in a threaded program
After calling rtr_set_wakeup() in a threaded program,
you should also call rtr_set_wakeup(NULL) wherever your
program can exit. This will prevent any wakeups in other
threads while the main thread is already running the
RTR exit handler, which could lead to a server core dump
when trying to stop the server.
+ 14-1-263 Non-English character sets are not supported
for identifiers
The supported character set for RTR identifiers such as
facility names is ASCII, with lowercase and uppercase
letters equivalent.
Eight bit characters are not supported because the
name might not interoperate with RTR processes using
a different locale or running another RTR version.
+ 14-1-419 SPUJOUFIL advice to CREATE JOURNAL/SUPERSEDE is
dangerous
If the operator copies journal files or copies disks
containing journal files without first remounting the
source disk read-only, then these are spurious because
RTR sees duplicates that it did not create. RTR then
displays the SPUJOUFIL message, which advises the
operator to use CREATE JOURNAL/SUPERSEDE to destroy
the original and all copies of the journal files, and
all the transactions contained in them on that node, and
then submit an SPR for something that is not in fact an
RTR problem.
This is not the correct action in situations like this.
The operator should examine the log file, which shows
the duplicate filenames, and then move any unwanted
duplicate copies of journal files to anywhere other than
a rtrjnl directory at the top level of a writable disk
file system visible to RTR, and then try again.
Only if SPUJOUFIL is caused by circumstances other
than operator intervention should the operator
consider making backup copies of the journal files,
and only then abandoning the existing journal files
and any transactions contained in them by using DELETE
JOURNAL and CREATE JOURNAL, or the equivalent CREATE
JOURNAL/SUPERSEDE.
+ 14-1-455 Last line of batch procedure sometimes ignored
The last line of a batch procedure or command file
must explicitly end with added by pressing
the Enter/Return key when creating the procedure.
Without the explicit , RTR ignores the line. The
workaround is to add a comment to the end of the file or
to explicitly add to the end of the last line of
the batch procedure.
+ 14-1-462 MODIFY JOURNAL with list of devices does not
give individual error messages
Although MODIFY JOURNAL now only lists all devices
that were successfully modified, if some disk devices
cannot be modified because they do not contain a journal
file at all, then nothing at all is reported for those
devices.
The workaround is to identify the omitted devices by
comparing the command parameters and the messages,
or modify them one device at a time. Verify the
modification with SHOW JOURNAL /FILES /FULL.
+ 14-1-604 Combined local and remote SET MODE/GROUP
The following combined local and remote command does not
perform the remote part of the command:
RTR> SET MODE/GROUP=newgroup/NODE=(THISNODE,ANOTHERNODE)
After recording a different new group or no group
setting in shared memory the comserver has to disconnect
itself, and does so immediately. When the comserver is
disconnected the SRVDISCON message is shown, and the
remote part of this command, together with any other
pending commands such as those issued by the same user
in other windows, is aborted immediately.
A workaround is to issue the local SET MODE/GROUP
command separately first.
+ 14-1-681 ACP crash in RDM subsystem
It has been observed that with a large number of
concurrently active transactions, where each transaction
sends back a large number of very large replies, it is
possible to exhaust the virtual memory requirements of
RTR in order to store the replies for possible recovery
after a link glitch. This would cause RTR to crash on a
backend node. The workaround is to reduce the number of
concurrent server channels, so as to limit the number
of concurrently active transactions on each backend.
Another possibility would be to limit the number of
replies per transaction. In our tests we were able to
exceed the RTR limits using 10 servers, each sending
back 200 replies of 64KB each.
+ 14-1-769 System time must not be set backwards
Correct operation of RTR is not guaranteed if the time
is not monotonic.
When performing Year 2000 testing, stop all RTR
processes and remove any journalled transactions before
setting the clock back.
When configuring the Network Time Protocol daemon ntp,
do not select, or accept by default, any options which
may result in the system clock being set backwards.
While RTR is running as a service on NT, do not allow
the time to be set backwards by ntp or in login scripts.
On OpenVMS systems, do not change to a different time
zone, or to or from Summer Time or Daylight Savings
Time, while using current versions of RTR which are
built on OpenVMS V6 to run on both OpenVMS V6 and V7. On
most UNIX systems it is in any case not recommended that
you change the date and time when running in multi-user
mode. Adjustments are normally made by speeding up or
slowing down the clock.
+ 14-1-948 Use of /MAXIMUM_BLOCKS qualifier
Use of the /MAXIMUM_BLOCKS qualifier on the CREATE
JOURNAL or MODIFY JOURNAL commands may cause RTR to
crash. Please use the /BLOCKS qualifier instead.
+ 14-3-33 Partition limit per facility now 500
The previous releases supported only up to 100
partitions per facility. The current release increases
this to 500 but is not extendable beyond that.
+ 14-3-50 Maximum number of application processes limit
An ACP crash that occurred when starting the last of a
great many applications has been corrected.
When the process open file limit is reached, the
application will now generally report ACPNOTVIA, "RTR
ACP is no longer a viable entity, restart RTR". In
actual fact the ACP continues to operate with all
previously connected processes, and only the new
rejected process thinks that the RTR ACP is not alive.
This message should be interpreted as "ACPINSRES, The
RTR ACP has insufficient resources".
Please ensure that your system is configured with
sufficient default per-process resources, or that
the ACP process is started with increased resource
limits. Allow at least one open file for each additional
application process, and at least one for each link.
+ 14-7-24 Transaction size limits
The number of bytes in any application message (that
is, a message sent with the rtr_send_to_server(), rtr_
reply_to_client() or rtr_broadcast_event() routines) is
currently restricted to 64000.
The number of messages sent (that is, using rtr_send_
to_server() ) in any single transaction is limited to
65534.
There is no fixed limit on the number of replies (that
is, sent with rtr_reply_to_client() ) in any single
transaction.
Problems Addressed in the RTR320_266 Kit:
The following changes and corrections have been made for RTR V3.2, ECO7
for all platforms.
o 14-3-252 ACP failed with insufficient memory
Previous versions of RTR could allocate too much memory
for replay queues. When insufficient virtual memory was
available, this behavior could cause the ACP to fail.
RTR V2 compatible functionality has been reintroduced, in
that a replay queue that grows too large is now discarded.
o 14-3-270 ACP crashed with INSVIRMEM while trying to
abort a very large transaction
This problem was caused by overlarge transaction replay
queues and has been corrected.
o 14-3-340 Standby partition fails to take over on network
partitioning
Previously, it was possible that a standby partition
would fail to take over if network partitioning
occurred.
The following changes and corrections have been made for RTR V3.2, ECO6
for all platforms.
o 14-1-929 Unable to suspend a partition on a secondary site
It was possible to encounter a situation where you could not
suspend transaction presentation on a partition on a secondary
site. If you gave the suspend command at a time when the
secondary site only had transactions awaiting assignment of
a server channel, then the command would not complete,
leaving transaction presentation in the 'suspending' state.
o 14-1-940 Partition switching between standby and active
State instability could result if instances of a partition
were assigned an equal priority. This could happen through
use of SET PARTITION commands, or as a result of network
partitioning.
o 14-5-157 Improved diagnostics
Improved diagnostics have been added for testing network-
related problems.
o 14-8-304 Aborted transaction caused the RTR ACP to crash
A complex series of interactions could lead to an
aborted transaction causing the RTR ACP to crash on
another node where a standby server was being started.
o 14-8-309 Partition switching between primary and standby
Temporary partition state instability could occur during
certain network failures causing application servers
that subscribe to SRSTANDBY and SRPRIMARY events to
receive repeated events in rapid sequence.
o 14-8-310 Transaction on the front-end hangs in the sending
state
It was possible for a front-end to hang in the sending
state when a front-end failed over to the secondary
router and then back to the preferred router.
The following changes and corrections have been made for RTR V3.2, ECO5
for all platforms.
o 14-1-864 MONITOR GROUP shows bad values in "server act"
and "vreq" fields
MONITOR GROUP previously showed invalid counts in the "act"
and "ack" columns if a secondary rejected a transaction.
o 14-1-865 Partition creation not adequately synchronised
with journal scan
Partitions are now created after the journal scans are
complete.
o 14-1-893 Partitioned Standby allowed to enter lcl_rec
followed by "dual active"
During periods of network instability, previous
versions of RTR could incorrectly allow roles to
become temporarily quorate. As a consequence, certain
unexpected state transitions could occur, for example
standby servers could become active.
o 14-5-156 Logging partition state transitions
Backend partition state transition logging contained
a bug which allowed state transitions to be recorded
incorrectly. This has been fixed. Note that this fix
addresses the accuracy of the log file entries; it
does not increase the number of entries in the RTR log
to cover those cases where a backend partition state
transition is not logged.
Problems Addressed in the RTR320_255 Kit:
The following changes and corrections have been made for RTR V3.2, ECO4
for all platforms.
o 14-1-743 Wrong return status RTR_STS_COMSTAUNO
In RTR V3.2 it was sometimes possible for a transaction which had
not yet been voted on by a server which exits in mid-transaction to
be aborted with incorrect status RTR_STS_COMSTAUNO.
o 14-1-805 Attempt to create a partition that already exists returns
incorrect status
An attempt to create a partition that already existed used to return
the error KRINUSE (key range in use). This has been superseded by
the more explicit PRTALREXI (partition already exists).
o 14-1-813 MONITOR SYSTEM shows WARNING on calls due to invalid time call
The MONITOR SYSTEM monitor picture would sometimes incorrectly
display a warning state for the CALL row.
o 14-1-841 Replayed shadow transaction stuck in VOTED
The implementation of RTR's cooperative recovery protocol algorithm
has been enhanced so that some situations which would previously
hang during permanent network link outages are now recovered
correctly using the remaining connections.
o 14-1-842 Transactions which do not specify a timeout abort
If a frontend failed over to another router, then failed back to the
original router, transactions which were in progress could sometimes
be rejected with status RTR_STS_TXTIMOUT.
o 14-1-845 Transactions remain in VOTED state
In earlier versions of RTR it was occasionally possible for
transactions to remain hanging in VOTED state on a shadow primary
backend and be aborted with status RTR_STS_FELINLOS on the secondary
backend after network link failures in a slow WAN environment.
o 14-1-846 Transactions remain in SENDING state
In previous versions of RTR it was occasionally possible for a
transaction to remain hanging in SENDING state on a backend after a
network partition had forced the backend to lose quorum.
o 14-5-111 Certain RTR commands now recorded in the RTR operator log
Operator log files created by previous versions of RTR could
sometimes be difficult to interpret. By recording certain RTR
commands, such as START RTR and CREATE FACILITY, the RTR log file
has become easier to interpret.
o 14-5-156 Logging partition state transitions insufficient
Previous versions of RTR did not report backend partition state
transitions in the operator log file with sufficient detail.
Backend partition state transitions are now reported as follows:
- Previously unlogged state transitions are recorded in the
operator log with the new PRTSTATRA message.
- The PRTBEGIN message is no longer generated.
- The PRTCREATED and PRTEND message formats have been changed to
match that of the PRTSTATRA messages.
o 14-8-267 V3.1D-to-V3.2 Journal incompatibility corrected
If you upgrade from V3.1D to earlier versions of V3.2, it was
possible to encounter situations which caused the RTR ACP to crash.
o 14-8-287 Named partition state change caused crash
When using the CREATE PARTITION command, it was possible for RTR to
crash on the backend node if the last channel using the partition is
closed at the same instant that a state-change message from the
router is pending.
Problems Addressed in the RTR320_249 Kit:
o 14-8-268 ACP crash after death of concurrent server
Under rare circumstances, after the death of a concurrent server,
RTR would try to reschedule a transaction in commit state resulting
in an RTR ACP crash. This bug was present in RTR V3.2 and V3.2 ECO1.
o 14-1-50 Looping RTR process for empty node string, e.g., /NODE=dna.
Specifying an incomplete node specification, such as one with only
the protocol prefix, e.g., "RTR SHOW RTR /NODE=dna." could cause the
RTR process to loop, consuming CPU.
o 14-1-433 Show transactions not recovered on link break/reconnect
If a secondary shadow backend lost its link to the RTR router after
the router had sent a vote request, and the server on the primary
shadow accepts the transaction, then in unusual circumstances it was
possible that the transaction would not be immediately recovered on
the secondary shadow after the link to the router was re-established.
In such cases it required a cycle of the servers on the secondary
site for the remembered transaction to be recovered from the primary
shadow journal.
o 14-1-582 ACP access violation
If a number of concurrent servers died in sequence while processing
the same transaction, then under rare circumstances it was possible
the ACP could also abort. This was due to a counter being
incremented incorrectly and has now been fixed.
o 14-1-617 Problems with DUMP JOURNAL
In previous versions of RTR, qualifiers which required a value did
not generate an error if the value was not supplied or was supplied
incorrectly. Incorrect or missing values now generate an error
message. If a string of less than five characters was passed for
partition record class, the partition record counter was not updated
and the record was not available. These problems have been fixed by
comparing each character instead of five characters at a time.
o 14-1-760 ACP crashed when modifying journal size
After a journal had been modified, the Flow Control subsystem of RTR
was not properly updated with the new size. This could result in a
hang or crash situation even though the journal size was increased
to accommodate increased traffic.
o 14-1-763 RTR_CLOSE_CHANNEL fails for distributed transaction
Calling RTR_CLOSE_CHANNEL while a distributed transaction was
pending caused an incorrect status to be returned.
o 14-1-772 CALL CLOSE_CHANNEL defaults to IMMEDIATE
The flag RTR_F_CLO_IMMEDIATE is a new flag added in RTR V3.2 that
allows the caller to close a server channel without acknowledging
the transaction on the channel. By default, the flag is not set
when calling the RTR_CLOSE_CHANNEL API. However, the /IMMEDIATE
qualifier is implicitly present in the RTR CLI version of the API
(RTR call RTR_CLOSE_CHANNEL).
Because this is incompatible with the behavior of previous versions
of RTR, functionality has been restored to the same as before V3.2.
When using the CLI version of the API (RTR call RTR_CLOSE_CHANNEL),
/NOIMMEDIATE is now the default.
o 14-1-774 TOOMANCHA and distributed transaction left open after
RTR_OPEN_CHANNEL failure
If RTR_OPEN_CHANNEL failed after the RTR acp had been stopped, then
that channel remained available for a subsequent open. The
application could eventually run out of channels and return
RTR_STS_TOOMANCHA.
Now if RTR_OPEN_CHANNEL fails after a distributed transaction has
been opened, the distributed transaction is always closed.
o 14-1-777 Transaction state is not getting EXCEPTION after issuing
RTR_CLOSE/IMME
SET PARTITION /RECOVERY_RETRY_COUNT is new functionality implemented
in RTR V3.2. The scope of this command was not fully documented, and
is clarified here.
If an application server dies while processing a transaction
recovered from RTR journal, then RTR will present the transaction to
another (concurrent or standby) server. The RECOVERY_RETRY_LIMIT
indicates the maximum number of times the transaction should be
presented to a server for recovery before being written to the
journal as an exception.
There are two types of recovery operations where transactions are
recovered from journal: local recovery and shadow recovery. Shadow
recovery is the process of recovering the remembered transactions
written to a primary shadow journal while the secondary shadow site
is down.
The SET PARTITION /RECOVERY_RETRY_COUNT parameter does not have an
effect on remembered transactions recovered during shadow recovery.
That is, if there is a killer transaction remembered in the journal
on a primary shadow node, on this node RTR does not count the number
of times the transaction is recovered by a recovering secondary
shadow node. The way to ensure that a remembered transaction will
be exceptioned by RTR is by starting a sufficient number of
concurrent servers on the recovering secondary shadow node.
For this reason, RTR recommends that the number of concurrent
secondary shadow servers started is greater than the value set for
the RECOVERY_RETRY_LIMIT on a partition. This will ensure that a
remembered (killer) transaction being recovered from a primary
shadow journal will be exceptioned if the retry limit is exceeded.
Only those transactions that have reached voting stage on a server
can be exceptioned. If a server always dies before voting on a
transaction, then the transaction will be aborted by RTR after the
third try. This is a hard-coded limit (the so called "three strikes
and you're out" feature).
o 14-1-791 Backends incorrectly remain inquorate after routers trimmed
In versions V3.1D - ECO14 and V3.2 of RTR it was sometimes possible
for nodes to erroneously remain inquorate following a TRIM FACILITY
operation.
o 14-1-792 Revised rtrreq.c and rtrsrv.c sample RTR applications
The sample client and server used in the IVP have been extensively
revised. Please pay special attention to the comments which explain
how to write a wakeup handler, and comments drawing attention to
several common programming mistakes we have seen in RTR applications.
o 14-3-291 SHOW SERVER truncates shd_rec_icpl to shd_rec_ic
Some of the values previously truncated by the brief SHOW SERVER
command are now displayed more fully.
o 14-3-298 Application may crash if invoked before RTR after a reboot
Normally the RTR executable must have been invoked at least once
since reboot before an RTR application can be started. If an RTR
application is invoked first, the first RTR api call now always
returns RTRNOTSTA, RTR not started.
o 4-7-420 IOS tid on IP only nodes is not unique
Using previous versions of RTR, if you ran client applications that
used the RTR V2 API on systems that had DECnet disabled, then there
was a remote possibility that the same transaction identifier could
be generated on two such systems if RTR was started on both systems
within milliseconds of each other.
o 14-8-215 Faster loading of large journals on first CREATE FACILITY
RTR now takes much less time to load journals containing a large
number of journaled transactions.
o 14-8-257 The broadcast message was not delivered from BE to client
If a frontend loses the connection to its original router, and is
the first frontend to connect to the router it fails over to, then
the frontend may stop receiving broadcasts. Further, backends could
also fail to receive broadcasts delivered by routers added to a
facility after the server applications have started.
o 14-8-262 RTR has both backends as primary for some transactions
(STR#1885690)
In a partitioned network situation (when each of two routers have
access to only half of the backend nodes), RTR will choose the
router with the lower network address as the one that remains or
becomes active. In previous versions of RTR, this would sometimes
result in both sets of backends becoming active, due to a problem
with the network ID comparison algorithm.
o 14-3-190 Signals blocked in unthreaded UNIX applications during RTR
API calls
RTR now enables the usual termination signals during RTR API calls.
For example, an idle server RTR application waiting in
rtr_receive_message with no timeout will now respond to Control-C.
o 14-3-300 Terminated RTR application process that used fork is still
shown by RTR
RTR applications now have FD_CLOEXEC set for the IPC sockets used to
communicate with the RTR ACP, so that these do not remain open in a
child process after fork and exec even after the parent process has
terminated.
This means that the RTR ACP now notices when the parent exits, and
will not accumulate a wait queue of broadcast messages or delay
failover. The terminated process no longer appears in RTR SHOW
PROCESS.
o 14-7-952 BADROWCOL and escape sequences visible on dumb or unknown
terminal
The default VT100-style terminal escape sequences can now be
completely suppressed with a suitable TERMCAP environment variable
setting. It is still necessary to set a non-zero window size to
avoid BADROWCOL, for example:
stty rows 48 cols 120
TERM=dumb
TERMCAP="dumb:cm=:do=:le=:nd=:up=:ks=:ke=:cl=:ce=:
ho=:mb=:md=:mr=:us=:ue=:me=:cr=:bl=:"
This is particularly useful when running RTR in an Emacs shell
window, and gives reasonably clean output for all RTR commands
except MONITOR.
Known Problems with Workarounds
The following restrictions were described in previous release notes and
are still applicable to RTR.
o 14-1-39 Declaring exit handlers in RTR applications
If an exit handler contains calls to RTR, then the exit handler must
be declared after the first call to RTR.
Using the RTR V2 or V3 API, if the exit handler is declared before
the first call to RTR, then any call to RTR made within the exit
handler will return an error. Under the V3 API, the error status
returned is RTR_STS_INVCHANNEL. Under the V2 API, the error status
returned is RTR$_INVALCH.
o 14-1-103 Using RTR_SET_WAKEUP() in a threaded program
After calling RTR_SET_WAKEUP() in a threaded program, you should
also call RTR_SET_WAKEUP(NULL) wherever your program can exit. This
will prevent any wakeup in other threads while the main thread is
already running the RTR exit handler, which could lead to a server
core dump when trying to stop the server.
o 14-1-263 Non-English character sets are not supported for identifiers
The supported character set for RTR identifiers such as facility
names is ESSAY, with lowercase and uppercase letters equivalent.
Eight bit characters are not supported because the name might not
interpret with RTR processes using a different locale or running
another RTR version.
o 14-1-419 SPUJOUFIL advice to CREATE JOURNAL/SUPERSEDE is dangerous
If the operator copies journal files or copies disks containing
journal files without first remounting the source disk read-only,
then these are spurious because RTR sees duplicates that it did not
create. RTR then displays the SPUJOUFIL message, which advises the
operator to use CREATE JOURNAL/SUPERSEDE to destroy the original and
all copies of the journal files, and all the transactions contained
in them on that node, and then submit an SPR for something that is
not in fact an RTR problem.
This is not the correct action in situations like this.
The operator should examine the log file, which shows the duplicate
filenames, and then move any unwanted duplicate copies of journal
files to anywhere other than a RTRJNL directory at the top level of
a writable disk file system visible to RTR, and then try again.
Only if SPUJOUFIL is caused by circumstances other than operator
intervention should the operator consider making backup copies of
the journal files, and only then abandoning the existing journal
files and any transactions contained in them by using DELETE JOURNAL
and CREATE JOURNAL, or the equivalent CREATE JOURNAL/SUPERSEDE.
o 14-1-455 Last line of batch procedure sometimes ignored
The last line of a batch procedure or command file must explicitly
end with added by pressing the Enter/Return key when creating
the procedure. Without the explicit , RTR ignores the line.
The workaround is to add a comment to the end of the file or to
explicitly add to the end of the last line of the batch
procedure.
o 14-1-462 MODIFY JOURNAL with list of devices does not give
individual error messages
Although MODIFY JOURNAL now only lists all devices that were
successfully modified, if some disk devices cannot be modified
because they do not contain a journal file at all, then nothing at
all is reported for those devices.
The workaround is to identify the omitted devices by comparing the
command parameters and the messages, or modify them one device at a
time. Verify the modification with SHOW JOURNAL /FILES /FULL.
o 14-1-604 Combined local and remote SET MODE/GROUP
The following combined local and remote command does not perform the
remote part of the command:
RTR> SET MODE/GROUP=newgroup/NODE=(THISNODE,ANOTHERNODE)
After recording a different new group or no group setting in shared
memory the comserver has to disconnect itself, and does so
immediately. When the comserver is disconnected the SRVDISCON
message is shown, and the remote part of this command, together with
any other pending commands such as those issued by the same user in
other windows, is aborted immediately.
A workaround is to issue the local SET MODE/GROUP command separately
first.
o 14-1-681 ACP crash in RDM subsystem
It has been observed that with a large number of concurrently active
transactions, where each transaction sends back a large number of
very large replies, it is possible to exhaust the virtual memory
requirements of RTR in order to store the replies for possible
recovery after a link glitch. This would cause RTR to crash on a
backend node. The workaround is to reduce the number of concurrent
server channels, so as to limit the number of concurrently active
transactions on each backend. Another possibility would be to limit
the number of replies per transaction. In our tests we were able to
exceed the RTR limits using 10 servers, each sending back 200
replies of 64KB each.
o 14-1-769 System time must not be set backwards
Correct operation of RTR is not guaranteed if the time is not
monotonic.
When performing Year 2000 testing, stop all RTR processes and remove
any journaled transactions before setting the clock back.
When configuring the Network Time Protocol daemon ntp, do not
select, or accept by default, any options which may result in the
system clock being set backwards.
While RTR is running as a service on NT, do not allow the time to be
set backwards by ntp or in login scripts.
On OpenVMS systems, do not change to a different time zone, or to or
from Summer Time or Daylight Savings Time, while using current
versions of RTR which are built on OpenVMS V6 to run on both OpenVMS
V6 and V7. On most UNIX systems it is in any case not recommended
that you change the date and time when running in multi-user mode.
Adjustments are normally made by speeding up or slowing down the
clock.
o 14-3-33 Partition limit per facility now 500
The previous releases supported only up to 100 partitions per
facility. The current release increases this to 500 but is not
extendable beyond that.
o 14-3-50 Maximum number of application processes limit
An ACP crash that occurred when starting the last of a great many
applications has been corrected.
When the process open file limit is reached, the application will
now generally report ACPNOTVIA, "RTR ACP is no longer a viable
entity, restart RTR". In actual fact the ACP continues to operate
with all previously connected processes, and only the new rejected
process thinks that the RTR ACP is not alive. This message should
be interpreted as "ACPINSRES, the RTR ACP has insufficient resources".
Please ensure that your system is configured with sufficient default
per-process resources, or that the ACP process is started with
increased resource limits. Allow at least one open file for each
additional application process, and at least one for each link.
o 14-3-207 Client application in questionable flow control when RTR
journal fills
There is a known issue with flow control when the journal starts
filling up. There is a race condition where, if the client can send
more data than can be placed in the journal before flow control
kicks in, then the transaction is aborted with the correct error
notification. However, if flow control kicks in first, then a
deadlock occurs where the journal space never frees up and hence RTR
does not allow the client to proceed with the transaction. There
are two workarounds: either specify a timeout with the transaction,
or increase the size of the journal.
o 14-3-253 Restrictions on the RTR wakeup handler
The use of RTR_REPLY_TO_CLIENT, RTR_SEND_TO_SERVER, or
RTR_BROADCAST_EVENT in an RTR wakeup handler is not recommended.
They may block when they need transaction ids or flow control. This
will cause undesired behavior.
Functions permitted in an RTR_SET_WAKEUP() handler:
- In an RTR wakeup handler in an AST in an unthreaded OpenVMS
application, the use of RTR_REPLY_TO_CLIENT(), RTR_SEND_TO_SERVER(),
RTR_BROADCAST_EVENT(), or RTR_RECEIVE_MESSAGE() with a non-zero
timeout is not recommended. They may block when they need
transaction ids or flow control, which will cause the whole
application to hang until the wakeup completes.
- The same rules apply in an RTR wakeup handler in a threaded
application. Note that wakeup are unnecessary in a threaded
paradigm, but they may be used in common code in applications
that also need to run on OpenVMS. Please note that your
mainline code continues to run while your wakeup is executing,
so extra synchronization may be required. Also note that if
the wakeup does block then it does not generally hang the whole
application.
- In an RTR wakeup handler in a signal in an unthreaded UNIX
application, no RTR API functions and only the very few
asynch-safe system and library functions may be called, because
the wakeup is performed in a signal handler context. An
application can write to a pipe or access a volatile
sig_atomic_t variable, but using malloc() and printf(), for
example, will cause unexpected failures. Alternatively, on
most UNIX platforms, you can compile and link the application
as a threaded application with the reentrant RTR shared library
-lrtr_r.
For maximum portability, the wakeup handler should do the minimum
necessary to wake up the mainline event loop. You should assume
that mainline code and other threads might continue to run in
parallel with the wakeup, especially on machines with more than one
CPU.
o 14-7-24 Transaction size limits
The number of bytes in any application message (that is, a message
sent with the RTR_SEND_TO_SERVER(), RTR_REPLY_TO_CLIENT() or
RTR_BROADCAST_EVENT() routines) is currently restricted to 64000.
The number of messages sent (that is, using RTR_SEND_TO_SERVER() )
in any single transaction is limited to 65534.
There is no fixed limit on the number of replies (that is, sent with
RTR_REPLY_TO_CLIENT() ) in any single transaction.
INSTALLATION NOTES:
The Reliable Transaction Router V3.2 ECO8 installation procedure is the
same as the installation procedure for RTR V3.2. Refer to the
Installation Guide for further information.
This patch can be found at any of these sites:
Colorado Site
Georgia Site
Files on this server are as follows:
rtr320_268_us.README
rtr320_268_us.CHKSUM
rtr320_268_us.CVRLET_TXT
rtr320_268_us.tar
rtr320_268_us.CVRLET_TXT
|