AlphaServer SC patch kit: ========================== AlphaServer SC 2.5 UK1 Kit Name: SCV25UK1_RMS_CUMULATIVE_PATCH Release Date:012304 Abstract: This is a cumulative patch of recent RMS fixes. Description of Patch: ===================== The kit contains the following fixes: 1. Fix for resource suspend/resume problems when partition is blocked. Problem Description: If the partition blocks as a rcontrol suspend/resume command is issued, rinfo will not report that the resource is suspended when the partition starts running again. The partition must be stopped/ started before rinfo reports the correct status. 2. Fix for rinfo reporting error. Problem Description: rinfo incorrectly reports a resource still allocated after after a node on which its job has been running is rebooted. However, the prun command will report that the resource has been deallocated. The jobs table entry is updated with a status of 'aborted', but the resource table is not updated until after the partition is restarted. 3. Fix for bug which can result in acctstats.utime being greater than acctstats.atime after a job is killed Problem Description: RMS was incrementing acctstats.utime more than once as the job is killed. 4. Fix for RMS resource cleanup bug. Problem Description: Failure by the RMS resource destructor to handle an exception that is generated if tbe core file directory doesn't exist. 5. Fix for problem with queued jobs when a node is configured into partition. Problem Description: Once the node was configured in, then the queued job would go to allocated but would never get to a running state. 6. Fix for SIGUSR2 signal is not killing rms jobs. Problem Description: The signal was being delivered, but it was being ignored by rmsloader and not being restored to its default behaviour in its children prior to exec. 7. Fix for incorrect value in partitions.cpus when node is booted using 3 cpus. Problem Description: When a node is booted using 3 instead of 4 cpus, the partitions.cpus field will incorrectly count all cpus after the node is configured in. The partitions.free_cpus will hold the correct value. The partitions.cpus value is corrected when the node is configured out and in again. 8. Fix to ensure that RMS API function rms_resourceId() now returns correct id when multiple resource requests are made with the same batchid. Problem Description: rms_resourceId() would incorrectly return the first of a cached list of resource id's for the same batchid. 9. Fixes a bug relating to the calculation of acctstats.cpus and acctstats.utime during the deallocation of a resource after after one of the nodes running a job is configured out. Problem Description: If reconnect is still set when deallocating the resource or if the resource is being deallocated because a node has been configured out then acctstats should be updated with the last good set of stats. 10. Fixes a bug in the calculation of acctstats.cpus Problem Description: Starting a number of jobs in an allocated resource could result in an incorrect value for the acctstats.cpus field. Related Documentation Issue: On a related issue, the description of the acctstats.cpus field is incorrect in the V2.5 SC Admin Guide and the V2.5 RMS Reference Manual. The correct description for acctstats.cpus is : 'The number of CPUs used by the resource'. The number of CPUs allocated to a resource is stored in the recources.ncpus field. 11. Support for new RMS attribute : attribute memsplit-enabled Problem Description: RMS was failing to schedule jobs despite the fact that nodes with sufficent free memory were available. The pmanager was not restricting its check for the requested amount of memory to the requested set of nodes. Note: this functionality is not enabled by the installation. For further details see the README located in ftp://ftp.ilo.cpqcorp.net/pub/sierra/patches/V2.5/UK1/2-1153/ 12. Fix of a bug in the check of whether envmon is supported. Problem Description: This could result in the rmsd repeatidly crashing if running on a node not supporting envmon. This kit also contains earlier fixes which have already been listed as a Recommended Patches for V2.5 UK1 in SC Customer Bulletin #10. 1. Fix to bug in RMS support for CAA failover on clustered management server. Problem Description: RMS jobs failed to reconnect after CAA failover. rmsd's failed to identify new rmshost after CAA failover. 2. Fix to prevent rmsd coredump on invalid hostname information from clu_get_info. Problem Description: The rmsd on that last node of a cluster repeatidly crashed and the node was eventually configured out of the partition. This repeated for all subsequent end-of-cluster nodes. 3. Correction to CPU usage counts after frequent resource suspend/resume. 4. Support for 'Partition Blocked Timeout'. 5. More frequent polling of housekeeper when partition is blocked. For further information on these last 3 fixes see the README in ftp://ftp.ilo.cpqcorp.net/pub/sierra/patches/V2.5/UK1/99935/ Kit location: ============= The patch kit is SCV25UK1_RMS_CUMULATIVE_PATCH.tar.gz and it is available from http://www.irtc.hp.com Kit checksum: ============= bash-2.02$ cksum SCV25UK1_RMS_CUMULATIVE_PATCH.tar.gz 3240514823 3437207 SCV25UK1_RMS_CUMULATIVE_PATCH.tar.gz Updated files: ============== /usr/bin/pmanager /usr/lib/librmscall.a /usr/opt/rms/bin/pmanager /usr/opt/rms/lib/librms.so /usr/opt/rms/lib/librmsapi.so /usr/opt/rms/lib/librmscall.a /usr/opt/rms/lib/librmscall.so /usr/opt/rms/sbin/rmsd /usr/opt/rms/sbin/rmsloader /usr/sbin/rmsd /usr/sbin/rmsloader /usr/shlib/librms.so /usr/shlib/librmsapi.so /usr/shlib/librmscall.so Dependencies: ============= This patch should be installed over the RMS kit shipped with UK1. Instructions: ============= This patch is provided as a setld installable kit. Unpack it into a directory that is NFS mounted on all domains e.g. /usr/kits/ and install it as follows: 1. Stop Partitions, eg # rcontrol stop partition=parallel 2. Stop RMS on all nodes eg: # sra command -domains all -m 1 -command "CluCmd /sbin/init.d/rms stop" 3. Stop RMS and msql on Management Server # /sbin/init.d/rms stop # /sbin/init.d/msqld stop 4. Install on Management Server: # /usr/sbin/setld -l SCV25UK1_RMS_CUMULATIVE_PATCH 5. Start RMS and msql on Management Server: # /sbin/init.d/msqld start # /sbin/init.d/rms start 6. Install across all domains, eg: # sra command -domains all -m 1 -command "/usr/sbin/setld -l SCV25UK1_RMS_CUMULATIVE_PATCH" 7. Start RMS on all nodes eg: # sra command -domains all -m 1 -command "CluCmd /sbin/init.d/rms start" 8. Restart Parallel partition # rcontrol start partition=parallel -------- To remove the patch use the following steps: 1. Stop Partitions, eg # rcontrol stop partition=parallel 2. Stop RMS on all nodes eg: # sra command -domains all -m 1 -command "CluCmd /sbin/init.d/rms stop" 3. Delete across all domains, eg: # sra command -domains all -m 1 -command "/usr/sbin/setld -d SCV25UK1_RMS_CUMULATIVE_PATCH" 4. Stop RMS and msql on Management Server: # /sbin/init.d/rms stop # /sbin/init.d/msqld stop 5. Delete from Management Server: # /usr/sbin/setld -d SCV25UK1_RMS_CUMULATIVE_PATCH 7. Start RMS and msql on Management Server # /sbin/init.d/msqld start # /sbin/init.d/rms start 8. Start RMS on all nodes eg: # sra command -domains all -m 1 -command "CluCmd /sbin/init.d/rms start" 9. Restart Parallel partition # rcontrol start partition=parallel