RTR V3.2 RTRSUN320_255 Reliable Transaction Router SUN UNIX ECO Summary
TITLE: RTR V3.2 RTRSUN320_255 Reliable Transaction Router SUN UNIX ECO Summary
Copyright (c) Compaq Computer Corporation 1999. All rights reserved.
Modification Date: 04-Jan-2000
Modification Type: New Kit
PRODUCT: RTR V3.2 Sun Solaris (Reliable Transaction Router)
OP/SYS: UNIX for SUN
SOURCE: Compaq Computer Corporation
ECO INFORMATION:
ECO Kit Name: RTRSUN320_255
ECO Kits Superseded by This ECO Kit: None
ECO Kit Approximate Size: 16969 Blocks (8688128 Bytes)
Kit Applies To: RTR V3.2
SUN UNIX V2.5, V2.5.1, V7.0
System/Cluster Reboot Necessary: No
ECO KIT SUMMARY:
An ECO kit exists for Reliable Transaction Router on UNIX for SUN
V2.5 through V7.0. This kit addresses the following problems:
Problems Addressed in the RTRSUN320_255 Kit:
The following changes and corrections have been made for RTR V3.2, ECO4
for all platforms.
o 14-1-743 Wrong return status RTR_STS_COMSTAUNO
In RTR V3.2 it was sometimes possible for a transaction which had
not yet been voted on by a server which exits in mid-transaction to
be aborted with incorrect status RTR_STS_COMSTAUNO.
o 14-1-805 Attempt to create a partition that already exists returns
incorrect status
An attempt to create a partition that already existed used to return
the error KRINUSE (key range in use). This has been superseded by
the more explicit PRTALREXI (partition already exists).
o 14-1-813 MONITOR SYSTEM shows WARNING on calls due to invalid time call
The MONITOR SYSTEM monitor picture would sometimes incorrectly
display a warning state for the CALL row.
o 14-1-841 Replayed shadow transaction stuck in VOTED
The implementation of RTR's cooperative recovery protocol algorithm
has been enhanced so that some situations which would previously
hang during permanent network link outages are now recovered
correctly using the remaining connections.
o 14-1-842 Transactions which do not specify a timeout abort
If a frontend failed over to another router, then failed back to the
original router, transactions which were in progress could sometimes
be rejected with status RTR_STS_TXTIMOUT.
o 14-1-845 Transactions remain in VOTED state
In earlier versions of RTR it was occasionally possible for
transactions to remain hanging in VOTED state on a shadow primary
backend and be aborted with status RTR_STS_FELINLOS on the secondary
backend after network link failures in a slow WAN environment.
o 14-1-846 Transactions remain in SENDING state
In previous versions of RTR it was occasionally possible for a
transaction to remain hanging in SENDING state on a backend after a
network partition had forced the backend to lose quorum.
o 14-5-111 Certain RTR commands now recorded in the RTR operator log
Operator log files created by previous versions of RTR could
sometimes be difficult to interpret. By recording certain RTR
commands, such as START RTR and CREATE FACILITY, the RTR log file
has become easier to interpret.
o 14-5-156 Logging partition state transitions insufficient
Previous versions of RTR did not report backend partition state
transitions in the operator log file with sufficient detail.
Backend partition state transitions are now reported as follows:
- Previously unlogged state transitions are recorded in the
operator log with the new PRTSTATRA message.
- The PRTBEGIN message is no longer generated.
- The PRTCREATED and PRTEND message formats have been changed to
match that of the PRTSTATRA messages.
o 14-8-267 V3.1D-to-V3.2 Journal incompatibility corrected
If you upgrade from V3.1D to earlier versions of V3.2, it was
possible to encounter situations which caused the RTR ACP to crash.
o 14-8-287 Named partition state change caused crash
When using the CREATE PARTITION command, it was possible for RTR to
crash on the backend node if the last channel using the partition is
closed at the same instant that a state-change message from the
router is pending.
The ECO1 kit fixed the following problems:
o 14-8-268 ACP crash after death of concurrent server
Under rare circumstances, after the death of a concurrent server,
RTR would try to reschedule a transaction in commit state resulting
in an RTR ACP crash. This bug was present in RTR V3.2 and V3.2 ECO1.
o 14-1-50 Looping RTR process for empty node string, e.g., /NODE=dna.
Specifying an incomplete node specification, such as one with only
the protocol prefix, e.g., "RTR SHOW RTR /NODE=dna." could cause the
RTR process to loop, consuming CPU.
o 14-1-433 Show transactions not recovered on link break/reconnect
If a secondary shadow backend lost its link to the RTR router after
the router had sent a vote request, and the server on the primary
shadow accepts the transaction, then in unusual circumstances it was
possible that the transaction would not be immediately recovered on
the secondary shadow after the link to the router was re-established.
In such cases it required a cycle of the servers on the secondary
site for the remembered transaction to be recovered from the primary
shadow journal.
o 14-1-582 ACP access violation
If a number of concurrent servers died in sequence while processing
the same transaction, then under rare circumstances it was possible
the ACP could also abort. This was due to a counter being
incremented incorrectly and has now been fixed.
o 14-1-617 Problems with DUMP JOURNAL
In previous versions of RTR, qualifiers which required a value did
not generate an error if the value was not supplied or was supplied
incorrectly. Incorrect or missing values now generate an error
message. If a string of less than five characters was passed for
partition record class, the partition record counter was not updated
and the record was not available. These problems have been fixed by
comparing each character instead of five characters at a time.
o 14-1-760 ACP crashed when modifying journal size
After a journal had been modified, the Flow Control subsystem of RTR
was not properly updated with the new size. This could result in a
hang or crash situation even though the journal size was increased
to accommodate increased traffic.
o 14-1-763 RTR_CLOSE_CHANNEL fails for distributed transaction
Calling RTR_CLOSE_CHANNEL while a distributed transaction was
pending caused an incorrect status to be returned.
o 14-1-772 CALL CLOSE_CHANNEL defaults to IMMEDIATE
The flag RTR_F_CLO_IMMEDIATE is a new flag added in RTR V3.2 that
allows the caller to close a server channel without acknowledging
the transaction on the channel. By default, the flag is not set
when calling the RTR_CLOSE_CHANNEL API. However, the /IMMEDIATE
qualifier is implicitly present in the RTR CLI version of the API
(RTR call RTR_CLOSE_CHANNEL).
Because this is incompatible with the behavior of previous versions
of RTR, functionality has been restored to the same as before V3.2.
When using the CLI version of the API (RTR call RTR_CLOSE_CHANNEL),
/NOIMMEDIATE is now the default.
o 14-1-774 TOOMANCHA and distributed transaction left open after
RTR_OPEN_CHANNEL failure
If RTR_OPEN_CHANNEL failed after the RTR acp had been stopped, then
that channel remained available for a subsequent open. The
application could eventually run out of channels and return
RTR_STS_TOOMANCHA.
Now if RTR_OPEN_CHANNEL fails after a distributed transaction has
been opened, the distributed transaction is always closed.
o 14-1-777 Transaction state is not getting EXCEPTION after issuing
RTR_CLOSE/IMME
SET PARTITION /RECOVERY_RETRY_COUNT is new functionality implemented
in RTR V3.2. The scope of this command was not fully documented, and
is clarified here.
If an application server dies while processing a transaction
recovered from RTR journal, then RTR will present the transaction to
another (concurrent or standby) server. The RECOVERY_RETRY_LIMIT
indicates the maximum number of times the transaction should be
presented to a server for recovery before being written to the
journal as an exception.
There are two types of recovery operations where transactions are
recovered from journal: local recovery and shadow recovery. Shadow
recovery is the process of recovering the remembered transactions
written to a primary shadow journal while the secondary shadow site
is down.
The SET PARTITION /RECOVERY_RETRY_COUNT parameter does not have an
effect on remembered transactions recovered during shadow recovery.
That is, if there is a killer transaction remembered in the journal
on a primary shadow node, on this node RTR does not count the number
of times the transaction is recovered by a recovering secondary
shadow node. The way to ensure that a remembered transaction will
be exceptioned by RTR is by starting a sufficient number of
concurrent servers on the recovering secondary shadow node.
For this reason, RTR recommends that the number of concurrent
secondary shadow servers started is greater than the value set for
the RECOVERY_RETRY_LIMIT on a partition. This will ensure that a
remembered (killer) transaction being recovered from a primary
shadow journal will be exceptioned if the retry limit is exceeded.
Only those transactions that have reached voting stage on a server
can be exceptioned. If a server always dies before voting on a
transaction, then the transaction will be aborted by RTR after the
third try. This is a hard-coded limit (the so called "three strikes
and you're out" feature).
o 14-1-791 Backends incorrectly remain inquorate after routers trimmed
In versions V3.1D - ECO14 and V3.2 of RTR it was sometimes possible
for nodes to erroneously remain inquorate following a TRIM FACILITY
operation.
o 14-1-792 Revised rtrreq.c and rtrsrv.c sample RTR applications
The sample client and server used in the IVP have been extensively
revised. Please pay special attention to the comments which explain
how to write a wakeup handler, and comments drawing attention to
several common programming mistakes we have seen in RTR applications.
o 14-3-291 SHOW SERVER truncates shd_rec_icpl to shd_rec_ic
Some of the values previously truncated by the brief SHOW SERVER
command are now displayed more fully.
o 14-3-298 Application may crash if invoked before RTR after a reboot
Normally the RTR executable must have been invoked at least once
since reboot before an RTR application can be started. If an RTR
application is invoked first, the first RTR api call now always
returns RTRNOTSTA, RTR not started.
o 4-7-420 IOS tid on IP only nodes is not unique
Using previous versions of RTR, if you ran client applications that
used the RTR V2 API on systems that had DECnet disabled, then there
was a remote possibility that the same transaction identifier could
be generated on two such systems if RTR was started on both systems
within milliseconds of each other.
o 14-8-215 Faster loading of large journals on first CREATE FACILITY
RTR now takes much less time to load journals containing a large
number of journaled transactions.
o 14-8-257 The broadcast message was not delivered from BE to client
If a frontend loses the connection to its original router, and is
the first frontend to connect to the router it fails over to, then
the frontend may stop receiving broadcasts. Further, backends could
also fail to receive broadcasts delivered by routers added to a
facility after the server applications have started.
o 14-8-262 RTR has both backends as primary for some transactions
(STR#1885690)
In a partitioned network situation (when each of two routers have
access to only half of the backend nodes), RTR will choose the
router with the lower network address as the one that remains or
becomes active. In previous versions of RTR, this would sometimes
result in both sets of backends becoming active, due to a problem
with the network ID comparison algorithm.
o 14-3-190 Signals blocked in unthreaded UNIX applications during RTR
API calls
RTR now enables the usual termination signals during RTR api
calls. For example, an idle server RTR application waiting in
RTR_RECEIVE_MESSAGE with no timeout will now respond to Control-C.
o 14-3-300 Terminated RTR application process that used fork is still
shown by RTR
RTR applications now have FD_CLOEXEC set for the IPC sockets used to
communicate with the RTR ACP, so that these do not remain open in a
child process after fork and exec even after the parent process has
terminated.
This means that the RTR ACP now notices when the parent exits, and
will not accumulate a wait queue of broadcast messages or delay
failover. The terminated process no longer appears in RTR SHOW
PROCESS.
o 14-7-952 BADROWCOL and escape sequences visible on dumb or unknown
terminal
The default VT100-style terminal escape sequences can now be
completely suppressed with a suitable TERMCAP environment variable
setting. It is still necessary to set a non-zero window size to
avoid BADROWCOL, for example:
stty rows 48 cols 120
TERM=dumb
TERMCAP="dumb:cm=:do=:le=:nd=:up=:ks=:ke=:cl=:ce=:
ho=:mb=:md=:mr=:us=:ue=:me=:cr=:bl=:"
This is particularly useful when running RTR in an Emacs shell
window, and gives reasonably clean output for all RTR commands
except MONITOR.
Known Problems with Workarounds
-------------------------------
The following restrictions were described in previous release notes and
are still applicable to RTR.
o 14-1-39 Declaring exit handlers in RTR applications
If an exit handler contains calls to RTR, then the exit handler must
be declared after the first call to RTR.
Using the RTR V2 or V3 API, if the exit handler is declared before
the first call to RTR, then any call to RTR made within the exit
handler will return an error. Under the V3 API, the error status
returned is RTR_STS_INVCHANNEL. Under the V2 API, the error status
returned is RTR$_INVALCH.
o 14-1-103 Using RTR_SET_WAKEUP() in a threaded program
After calling RTR_SET_WAKEUP() in a threaded program, you should
also call RTR_SET_WAKEUP(NULL) wherever your program can exit. This
will prevent any wakeup in other threads while the main thread is
already running the RTR exit handler, which could lead to a server
core dump when trying to stop the server.
o 14-1-263 Non-English character sets are not supported for identifiers
The supported character set for RTR identifiers such as facility
names is ESSAY, with lowercase and uppercase letters equivalent.
Eight bit characters are not supported because the name might not
interpret with RTR processes using a different locale or running
another RTR version.
o 14-1-419 SPUJOUFIL advice to CREATE JOURNAL/SUPERSEDE is dangerous
If the operator copies journal files or copies disks containing
journal files without first remounting the source disk read-only,
then these are spurious because RTR sees duplicates that it did not
create. RTR then displays the SPUJOUFIL message, which advises the
operator to use CREATE JOURNAL/SUPERSEDE to destroy the original and
all copies of the journal files, and all the transactions contained
in them on that node, and then submit an SPR for something that is
not in fact an RTR problem.
This is not the correct action in situations like this.
The operator should examine the log file, which shows the duplicate
filenames, and then move any unwanted duplicate copies of journal
files to anywhere other than a RTRJNL directory at the top level of
a writable disk file system visible to RTR, and then try again.
Only if SPUJOUFIL is caused by circumstances other than operator
intervention should the operator consider making backup copies of
the journal files, and only then abandoning the existing journal
files and any transactions contained in them by using DELETE JOURNAL
and CREATE JOURNAL, or the equivalent CREATE JOURNAL/SUPERSEDE.
o 14-1-455 Last line of batch procedure sometimes ignored
The last line of a batch procedure or command file must explicitly
end with added by pressing the Enter/Return key when creating
the procedure. Without the explicit , RTR ignores the line.
The workaround is to add a comment to the end of the file or to
explicitly add to the end of the last line of the batch
procedure.
o 14-1-462 MODIFY JOURNAL with list of devices does not give
individual error messages
Although MODIFY JOURNAL now only lists all devices that were
successfully modified, if some disk devices cannot be modified
because they do not contain a journal file at all, then nothing at
all is reported for those devices.
The workaround is to identify the omitted devices by comparing the
command parameters and the messages, or modify them one device at a
time. Verify the modification with SHOW JOURNAL /FILES /FULL.
o 14-1-604 Combined local and remote SET MODE/GROUP
The following combined local and remote command does not perform the
remote part of the command:
RTR> SET MODE/GROUP=newgroup/NODE=(THISNODE,ANOTHERNODE)
After recording a different new group or no group setting in shared
memory the comserver has to disconnect itself, and does so
immediately. When the comserver is disconnected the SRVDISCON
message is shown, and the remote part of this command, together with
any other pending commands such as those issued by the same user in
other windows, is aborted immediately.
A workaround is to issue the local SET MODE/GROUP command separately
first.
o 14-1-681 ACP crash in RDM subsystem
It has been observed that with a large number of concurrently active
transactions, where each transaction sends back a large number of
very large replies, it is possible to exhaust the virtual memory
requirements of RTR in order to store the replies for possible
recovery after a link glitch. This would cause RTR to crash on a
backend node. The workaround is to reduce the number of concurrent
server channels, so as to limit the number of concurrently active
transactions on each backend. Another possibility would be to limit
the number of replies per transaction. In our tests we were able to
exceed the RTR limits using 10 servers, each sending back 200
replies of 64KB each.
o 14-1-769 System time must not be set backwards
Correct operation of RTR is not guaranteed if the time is not
monotonic.
When performing Year 2000 testing, stop all RTR processes and remove
any journaled transactions before setting the clock back.
When configuring the Network Time Protocol daemon ntp, do not
select, or accept by default, any options which may result in the
system clock being set backwards.
While RTR is running as a service on NT, do not allow the time to be
set backwards by ntp or in login scripts.
On OpenVMS systems, do not change to a different time zone, or to or
from Summer Time or Daylight Savings Time, while using current
versions of RTR which are built on OpenVMS V6 to run on both OpenVMS
V6 and V7. On most UNIX systems it is in any case not recommended
that you change the date and time when running in multi-user mode.
Adjustments are normally made by speeding up or slowing down the
clock.
o 14-3-33 Partition limit per facility now 500
The previous releases supported only up to 100 partitions per
facility. The current release increases this to 500 but is not
extendable beyond that.
o 14-3-50 Maximum number of application processes limit
An ACP crash that occurred when starting the last of a great many
applications has been corrected.
When the process open file limit is reached, the application will
now generally report ACPNOTVIA, "RTR ACP is no longer a viable
entity, restart RTR". In actual fact the ACP continues to operate
with all previously connected processes, and only the new rejected
process thinks that the RTR ACP is not alive. This message should
be interpreted as "ACPINSRES, the RTR ACP has insufficient resources".
Please ensure that your system is configured with sufficient default
per-process resources, or that the ACP process is started with
increased resource limits. Allow at least one open file for each
additional application process, and at least one for each link.
o 14-3-207 Client application in questionable flow control when RTR
journal fills
There is a known issue with flow control when the journal starts
filling up. There is a race condition where, if the client can send
more data than can be placed in the journal before flow control
kicks in, then the transaction is aborted with the correct error
notification. However, if flow control kicks in first, then a
deadlock occurs where the journal space never frees up and hence RTR
does not allow the client to proceed with the transaction. There
are two workarounds: either specify a timeout with the transaction,
or increase the size of the journal.
o 14-3-253 Restrictions on the RTR wakeup handler
The use of RTR_REPLY_TO_CLIENT, RTR_SEND_TO_SERVER, or
RTR_BROADCAST_EVENT in an RTR wakeup handler is not recommended.
They may block when they need transaction ids or flow control. This
will cause undesired behavior.
Functions permitted in an RTR_SET_WAKEUP() handler:
- In an RTR wakeup handler in an AST in an unthreaded OpenVMS
application, the use of RTR_REPLY_TO_CLIENT(), RTR_SEND_TO_SERVER(),
RTR_BROADCAST_EVENT(), or RTR_RECEIVE_MESSAGE() with a non-zero
timeout is not recommended. They may block when they need
transaction ids or flow control, which will cause the whole
application to hang until the wakeup completes.
- The same rules apply in an RTR wakeup handler in a threaded
application. Note that wakeup are unnecessary in a threaded
paradigm, but they may be used in common code in applications
that also need to run on OpenVMS. Please note that your
mainline code continues to run while your wakeup is executing,
so extra synchronization may be required. Also note that if
the wakeup does block then it does not generally hang the whole
application.
- In an RTR wakeup handler in a signal in an unthreaded UNIX
application, no RTR API functions and only the very few
asynch-safe system and library functions may be called, because
the wakeup is performed in a signal handler context. An
application can write to a pipe or access a volatile
sig_atomic_t variable, but using malloc() and printf(), for
example, will cause unexpected failures. Alternatively, on
most UNIX platforms, you can compile and link the application
as a threaded application with the reentrant RTR shared library
-lrtr_r.
For maximum portability, the wakeup handler should do the minimum
necessary to wake up the mainline event loop. You should assume
that mainline code and other threads might continue to run in
parallel with the wakeup, especially on machines with more than one
CPU.
o 14-7-24 Transaction size limits
The number of bytes in any application message (that is, a message
sent with the RTR_SEND_TO_SERVER(), RTR_REPLY_TO_CLIENT() or
RTR_BROADCAST_EVENT() routines) is currently restricted to 64000.
The number of messages sent (that is, using RTR_SEND_TO_SERVER() )
in any single transaction is limited to 65534.
There is no fixed limit on the number of replies (that is, sent with
RTR_REPLY_TO_CLIENT() ) in any single transaction.
INSTALLATION NOTES:
The Reliable Transaction Router V3.2 ECO4 installation procedure is the
same as the installation procedure for RTR V3.2. Refer to the
Installation Guide for further information.
This patch can be found at any of these sites:
Colorado Site
Georgia Site
Files on this server are as follows:
rtrsun320_255.README
rtrsun320_255.CHKSUM
rtrsun320_255.CVRLET_TXT
rtrsun320_255.tar
rtrsun320_255.CVRLET_TXT
|