AlphaServer SC patch kit:
==========================

AlphaServer SC 2.5 UK1
Kit Name: SCV25UK1_RMS_CUMULATIVE_PATCH
Release Date:012304

Abstract: This is a cumulative patch of recent RMS fixes.


Description of Patch:
=====================

The kit contains the following fixes:


1.	Fix for resource suspend/resume problems when partition is blocked.

	Problem Description:
	If the partition blocks as a rcontrol suspend/resume command is 
	issued, rinfo will not report that the resource is suspended when 
	the partition starts running again. The partition must be stopped/
	started before rinfo reports the correct status.


2. 	Fix for rinfo reporting error.

	Problem Description:
	rinfo incorrectly reports a resource still allocated after after a 
	node on which its job has been running is rebooted. However, the 
	prun command will report that the resource has been deallocated.
	The jobs table entry is updated with a status of 'aborted', but the
	resource table is not updated until after the partition is restarted.


3.	Fix for bug which can result in acctstats.utime being greater than 
	acctstats.atime after a job is killed

	Problem Description:
	RMS was incrementing acctstats.utime more than once as the job is killed.


4. 	Fix for RMS resource cleanup bug.

	Problem Description:
	Failure by the RMS resource destructor to handle an exception that is 
	generated if tbe core file directory doesn't exist.


5. 	Fix for problem with queued jobs when a node is configured into partition.

	Problem Description:
	Once the node was configured in, then the queued job would go to allocated
	but would never get to a running state.


6. 	Fix for SIGUSR2 signal is not killing rms jobs.

	Problem Description:
	The signal was being delivered, but it was being ignored
	by rmsloader and not being restored to its default behaviour
	in its children prior to exec.


7. 	Fix for incorrect value in partitions.cpus when node is booted using 3 cpus.

	Problem Description:
	When a node is booted using 3 instead of 4 cpus, the partitions.cpus field
	will incorrectly count all cpus after the node is configured in.
	The partitions.free_cpus will hold the correct value.
	The partitions.cpus value is corrected when the node is configured out
	and in again.


8. 	Fix to ensure that RMS API function rms_resourceId() now returns correct id 
	when multiple resource requests are made with the same batchid.

	Problem Description:
	rms_resourceId() would incorrectly return the first of a cached list of resource 
	id's for the same batchid.


9. 	Fixes a bug relating to the calculation of acctstats.cpus and acctstats.utime 
	during the deallocation of a resource after after one of the nodes running a 
	job is configured out.

	Problem Description:
	If reconnect is still set when deallocating the resource or if the resource is 
	being deallocated because a node has been configured out then acctstats should
	be updated with the last good set of stats.


10.	Fixes a bug in the calculation of acctstats.cpus


	Problem Description:
	Starting a number of jobs in an allocated resource could result in an incorrect 
	value for the acctstats.cpus field.

	Related Documentation Issue:
	On a related issue, the description of the acctstats.cpus field is incorrect in 
	the V2.5 SC Admin Guide and the V2.5 RMS Reference Manual.

	The correct description for acctstats.cpus is : 'The number of CPUs used by the resource'.

	The number of CPUs allocated to a resource is stored in the recources.ncpus field.


11.	Support for new RMS attribute : attribute memsplit-enabled 

	Problem Description:
	RMS was failing to schedule jobs despite the fact that nodes with sufficent free 
	memory were available.

	The pmanager was not restricting its check for the requested amount of memory to 
	the requested set of nodes.

	Note: this functionality is not enabled by the installation.
	For further details see the README located in 
	ftp://ftp.ilo.cpqcorp.net/pub/sierra/patches/V2.5/UK1/2-1153/


12.	Fix of a bug in the check of whether envmon is supported.

	Problem Description:
	This could result in the rmsd repeatidly crashing if running on a node not 
	supporting envmon.


This kit also contains earlier fixes which have already been listed as a Recommended 
Patches for V2.5 UK1 in SC Customer Bulletin #10.

1.	Fix to bug in RMS support for CAA failover on clustered management server.

	Problem Description:
	RMS jobs failed to reconnect after CAA failover.
	rmsd's failed to identify new rmshost after CAA failover.


2. 	Fix to prevent rmsd coredump on invalid hostname information from clu_get_info.

	Problem Description:
	The rmsd on that last node of a cluster repeatidly crashed and the node was
	eventually configured out of the partition. This repeated for all subsequent
	end-of-cluster nodes.
 

3.	Correction to CPU usage counts after frequent resource suspend/resume.


4.	Support for 'Partition Blocked Timeout'.


5. 	More frequent polling of housekeeper when partition is blocked.

For further information on these last 3 fixes see the README in
ftp://ftp.ilo.cpqcorp.net/pub/sierra/patches/V2.5/UK1/99935/


Kit location:
=============
The patch kit is SCV25UK1_RMS_CUMULATIVE_PATCH.tar.gz and it is available from
http://www.irtc.hp.com


Kit checksum:
=============
bash-2.02$ cksum SCV25UK1_RMS_CUMULATIVE_PATCH.tar.gz 
3240514823 3437207 SCV25UK1_RMS_CUMULATIVE_PATCH.tar.gz


Updated files:
==============
/usr/bin/pmanager
/usr/lib/librmscall.a
/usr/opt/rms/bin/pmanager
/usr/opt/rms/lib/librms.so
/usr/opt/rms/lib/librmsapi.so
/usr/opt/rms/lib/librmscall.a
/usr/opt/rms/lib/librmscall.so
/usr/opt/rms/sbin/rmsd
/usr/opt/rms/sbin/rmsloader
/usr/sbin/rmsd
/usr/sbin/rmsloader
/usr/shlib/librms.so
/usr/shlib/librmsapi.so
/usr/shlib/librmscall.so


Dependencies:
=============
This patch should be installed over the RMS kit shipped with UK1.


Instructions:
=============
This patch is provided as a setld installable kit.  Unpack it
into a directory that is NFS mounted on all domains e.g.
/usr/kits/<kit name> and install it as follows:

1. Stop Partitions, eg 
# rcontrol stop partition=parallel 

2. Stop RMS on all nodes eg:
# sra command -domains all -m 1 -command "CluCmd /sbin/init.d/rms stop"

3. Stop RMS and msql on Management Server 
# /sbin/init.d/rms stop
# /sbin/init.d/msqld stop

4. Install on Management Server:
# /usr/sbin/setld -l <path-to-kit> SCV25UK1_RMS_CUMULATIVE_PATCH

5. Start RMS and msql on Management Server:
# /sbin/init.d/msqld start
# /sbin/init.d/rms start

6. Install across all domains, eg:
# sra command -domains all -m 1 -command "/usr/sbin/setld -l <path-to-kit> SCV25UK1_RMS_CUMULATIVE_PATCH"

7. Start RMS on all nodes eg:
# sra command -domains all -m 1 -command "CluCmd /sbin/init.d/rms start"

8. Restart Parallel partition
# rcontrol start partition=parallel

--------
To remove the patch use the following steps:

1. Stop Partitions, eg 
# rcontrol stop partition=parallel 

2. Stop RMS on all nodes eg:
# sra command -domains all -m 1 -command "CluCmd /sbin/init.d/rms stop"

3. Delete across all domains, eg:
# sra command -domains all -m 1 -command "/usr/sbin/setld -d  SCV25UK1_RMS_CUMULATIVE_PATCH"

4. Stop RMS and msql on Management Server:
# /sbin/init.d/rms stop
# /sbin/init.d/msqld stop

5. Delete from Management Server:
# /usr/sbin/setld -d SCV25UK1_RMS_CUMULATIVE_PATCH

7. Start RMS and msql on Management Server 
# /sbin/init.d/msqld start
# /sbin/init.d/rms start

8. Start RMS on all nodes eg:
# sra command -domains all -m 1 -command "CluCmd /sbin/init.d/rms start"

9. Restart Parallel partition
# rcontrol start partition=parallel