9    Managing File Systems and Devices

This chapter contains information specific to managing storage devices in a TruCluster Server system. The chapter discusses the following subjects:

You can find other information on device management in the Tru64 UNIX Version 5.1A documentation that is listed in Table 9-1.

Table 9-1:  Sources of Information of Storage Device Management

Topic Tru64 UNIX Manual
Administering devices System Administration manual
Administering file systems System Administration manual
Administering the archiving services System Administration manual
Managing AdvFS AdvFS Administration manual

For information about Logical Storage Manager (LSM) and clusters, see Chapter 10.

9.1    Working with CDSLs

A context-dependent symbolic link (CDSL) is a link that contains a variable that identifies a cluster member. This variable is resolved at run time into a target.

A CDSL is structured as follows:

/etc/rc.config -> ../cluster/members/{memb}/etc/rc.config

When resolving a CDSL pathname, the kernel replaces the string {memb} with the string membern, where n is the member ID of the current member. For example, on a cluster member whose member ID is 2, the pathname /cluster/members/{memb}/etc/rc.config resolves to /cluster/members/member2/etc/rc.config.

CDSLs provide a way for a single file name to point to one of several files. Clusters use this to allow member-specific files that can be addressed throughout the cluster by a single file name. System data and configuration files tend to be CDSLs. They are found in the root (/), /usr, and /var directories.

9.1.1    Making CDSLs

The mkcdsl command provides a simple tool for creating and populating CDSLs. For example, to make a new CDSL for the file /usr/accounts/usage-history, enter the following command:

# mkcdsl /usr/accounts/usage-history
 

When you list the results, you see the following output:

# ls -l /usr/accounts/usage-history
 
... /usr/accounts/usage-history -> cluster/members/{memb}/accounts/usage-history

The CDSL usage-history is created in /usr/accounts. No files are created in any member's /usr/cluster/members/{memb} directory.

To move a file into a CDSL, enter the following command:

# mkcdsl -c targetname
 

To replace an existing file when using the copy (-c) option, you must also use the force (-f) option.

The -c option copies the source file to the member-specific area on the cluster member where the mkcdsl command executes and then replaces the source file with a CDSL. To copy a source file to the member-specific area on all cluster members and then replace the source file with a CDSL, use the -a option to the command as follows:

# mkcdsl -a filename
 

Remove a CDSL with the rm command, as you would any symbolic link.

The file /var/adm/cdsl_admin.inv stores a record of the cluster's CDSLs. When you use mkcdsl to add CDSLs, the command updates /var/adm/cdsl_admin.inv. If you use the ln -s command to create CDSLs, /var/adm/cdsl_admin.inv is not updated.

To update /var/adm/cdsl_admin.inv, enter the following:

# mkcdsl -i targetname

Update the inventory when you remove a CDSL, or if you use the ln -s command to create a CDSL.

For more information, see mkcdsl(8).

9.1.2    Maintaining CDSLs

The following tools can help you maintain CDSLs:

The following example shows the output (and the pointer to a log file containing the errors) when clu_check_config finds a bad or missing CDSL:

# clu_check_config -s check_cdsl_config
Starting Cluster Configuration Check...
check_cdsl_config : Checking installed CDSLs
check_cdsl_config : CDSLs configuration errors : See /var/adm/cdsl_check_list
clu_check_config : detected one or more configuration errors
 

As a general rule, before you move a file, make sure that the destination is not a CDSL. If by mistake you do overwrite a CDSL on the appropriate cluster member, use the mkcdsl -c filename command to copy the file and re-create the CDSL.

9.1.3    Kernel Builds and CDSLs

When you build a kernel in a cluster, use the mv command to move the new kernel from /sys/HOSTNAME/vmunix to /cluster/members/membern/boot_partition/vmunix. If you move the kernel to /vmunix, you will overwrite the /vmunix CDSL. The result will be that the next time that cluster member boots, it will use the old vmunix in /sys/HOSTNAME/vmunix.

9.1.4    Exporting and Mounting CDSLs

CDSLs are intended for use when files of the same name must necessarily have different contents on different cluster members. Because of this, CDSLs are not intended for export.

Mounting CDSLs through the cluster alias is problematic, because the file contents differ depending on which cluster system gets the mount request. However, nothing prevents CDSLs from being exported. If the entire directory is a CDSL, then the node that gets the mount request provides a file handle corresponding to the directory for that node. If a CDSL is contained within an exported clusterwide directory, then the Network File System (NFS) server that gets the request will do the expansion. As with normal symbolic links, the client cannot read the file or directory unless that area is also mounted on the client.

9.2    Managing Devices

Device management in a cluster is similar to that in a standalone system, with the following exceptions:

The rest of this section describes these differences.

9.2.1    Managing the Device Special File

When using dsfmgr, the device special file management utility, in a cluster, keep the following in mind:

For more information, see dsfmgr(8). For information on devices, device naming, and device management, see the chapter on hardware management in the Tru64 UNIX System Administration manual.

9.2.2    Determining Device Locations

The Tru64 UNIX hwmgr command can list all hardware devices in the cluster, including those on private buses, and correlate bus-target-LUN names with /dev/disks/dsk* names. For example:

# hwmgr -view devices -cluster
HWID: Device Name         Mfg     Model            Hostname   Location       
-------------------------------------------------------------------------------
  3: kevm                                         pepicelli
 28: /dev/disk/floppy0c          3.5in floppy     pepicelli  fdi0-unit-0
 40: /dev/disk/dsk0c     DEC     RZ28M    (C) DEC pepicelli  bus-0-targ-0-lun-0
 41: /dev/disk/dsk1c     DEC     RZ28L-AS (C) DEC pepicelli  bus-0-targ-1-lun-0
 42: /dev/disk/dsk2c     DEC     RZ28     (C) DEC pepicelli  bus-0-targ-2-lun-0
 43: /dev/disk/cdrom0c   DEC     RRD46    (C) DEC  pepicelli bus-0-targ-6-lun-0
 44: /dev/disk/dsk3c     DEC     RZ28M    (C) DEC pepicelli  bus-1-targ-1-lun-0
 44: /dev/disk/dsk3c     DEC     RZ28M    (C) DEC polishham  bus-1-targ-1-lun-0
 44: /dev/disk/dsk3c     DEC     RZ28M    (C) DEC provolone  bus-1-targ-1-lun-0
 45: /dev/disk/dsk4c     DEC     RZ28L-AS (C) DEC pepicelli  bus-1-targ-2-lun-0
 45: /dev/disk/dsk4c     DEC     RZ28L-AS (C) DEC polishham  bus-1-targ-2-lun-0
 45: /dev/disk/dsk4c     DEC     RZ28L-AS (C) DEC provolone  bus-1-targ-2-lun-0
 46: /dev/disk/dsk5c     DEC     RZ29B    (C) DEC pepicelli  bus-1-targ-3-lun-0
 46: /dev/disk/dsk5c     DEC     RZ29B    (C) DEC polishham  bus-1-targ-3-lun-0
 46: /dev/disk/dsk5c     DEC     RZ29B    (C) DEC provolone  bus-1-targ-3-lun-0
 47: /dev/disk/dsk6c     DEC     RZ28D    (C) DEC pepicelli  bus-1-targ-4-lun-0
 47: /dev/disk/dsk6c     DEC     RZ28D    (C) DEC polishham  bus-1-targ-4-lun-0
 47: /dev/disk/dsk6c     DEC     RZ28D    (C) DEC provolone  bus-1-targ-4-lun-0
 48: /dev/disk/dsk7c     DEC     RZ28L-AS (C) DEC pepicelli  bus-1-targ-5-lun-0
 48: /dev/disk/dsk7c     DEC     RZ28L-AS (C) DEC polishham  bus-1-targ-5-lun-0
 48: /dev/disk/dsk7c     DEC     RZ28L-AS (C) DEC provolone  bus-1-targ-5-lun-0
 49: /dev/disk/dsk8c     DEC     RZ1CF-CF (C) DEC pepicelli  bus-1-targ-8-lun-0
 49: /dev/disk/dsk8c     DEC     RZ1CF-CF (C) DEC polishham  bus-1-targ-8-lun-0
 49: /dev/disk/dsk8c     DEC     RZ1CF-CF (C) DEC provolone  bus-1-targ-8-lun-0
 50: /dev/disk/dsk9c     DEC     RZ1CB-CS (C) DEC pepicelli  bus-1-targ-9-lun-0
 50: /dev/disk/dsk9c     DEC     RZ1CB-CS (C) DEC polishham  bus-1-targ-9-lun-0
 50: /dev/disk/dsk9c     DEC     RZ1CB-CS (C) DEC provolone  bus-1-targ-9-lun-0
 51: /dev/disk/dsk10c    DEC     RZ1CF-CF (C) DEC pepicelli  bus-1-targ-10-lun-0
 51: /dev/disk/dsk10c    DEC     RZ1CF-CF (C) DEC polishham  bus-1-targ-10-lun-0
 51: /dev/disk/dsk10c    DEC     RZ1CF-CF (C) DEC provolone  bus-1-targ-10-lun-0
 52: /dev/disk/dsk11c    DEC     RZ1CF-CF (C) DEC pepicelli  bus-1-targ-11-lun-0
 52: /dev/disk/dsk11c    DEC     RZ1CF-CF (C) DEC polishham  bus-1-targ-11-lun-0
 52: /dev/disk/dsk11c    DEC     RZ1CF-CF (C) DEC provolone  bus-1-targ-11-lun-0
 53: /dev/disk/dsk12c    DEC     RZ1CF-CF (C) DEC pepicelli  bus-1-targ-12-lun-0
 53: /dev/disk/dsk12c    DEC     RZ1CF-CF (C) DEC polishham  bus-1-targ-12-lun-0
 53: /dev/disk/dsk12c    DEC     RZ1CF-CF (C) DEC provolone  bus-1-targ-12-lun-0
 54: /dev/disk/dsk13c    DEC     RZ1CF-CF (C) DEC pepicelli  bus-1-targ-13-lun-0
 54: /dev/disk/dsk13c    DEC     RZ1CF-CF (C) DEC polishham  bus-1-targ-13-lun-0
 54: /dev/disk/dsk13c    DEC     RZ1CF-CF (C) DEC provolone  bus-1-targ-13-lun-0
 59: kevm                                         polishham  
 88: /dev/disk/floppy1c          3.5in floppy     polishham  fdi0-unit-0
 94: /dev/disk/dsk14c    DEC     RZ26L    (C) DEC polishham  bus-0-targ-0-lun-0
 95: /dev/disk/cdrom1c   DEC     RRD46   (C) DEC  polishham  bus-0-targ-4-lun-0
 96: /dev/disk/dsk15c    DEC     RZ1DF-CB (C) DEC polishham  bus-0-targ-8-lun-0
 99: /dev/kevm                                    provolone    
127: /dev/disk/floppy2c          3.5in floppy     provolone  fdi0-unit-0
134: /dev/disk/dsk16c    DEC     RZ1DF-CB (C) DEC provolone  bus-0-targ-0-lun-0
135: /dev/disk/dsk17c    DEC     RZ1DF-CB (C) DEC provolone  bus-0-targ-1-lun-0
136: /dev/disk/cdrom2c   DEC     RRD47   (C) DEC  provolone  bus-0-targ-4-lun-0

The drdmgr devicename command reports which members serve the device. Disks with multiple servers are on a shared SCSI bus. With very few exceptions, disks that have only one server are private to that server. For details on the exceptions, see Section 9.4.1.

To learn the hardware configuration of a cluster member, enter the following command:

# hwmgr -view hierarchy -member membername
 

If the member is on a shared bus, the command reports devices on the shared bus. The command does not report on devices private to other members.

To get a graphical display of the cluster hardware configuration, including active members, buses, both shared and private storage devices, and their connections, use the sms command to invoke the graphical interface for the SysMan Station, and then select Hardware from the View menu.

Figure 9-1 shows the SysMan Station representation of a two-member cluster.

Figure 9-1:  SysMan Station Display of Hardware Configuration

9.2.3    Adding a Disk to the Cluster

For information on physically installing SCSI hardware devices, see the TruCluster Server Cluster Hardware Configuration manual. After the new disk has been installed, follow these steps:

  1. So that all members recognize the new disk, run the following command on each member:

    # hwmgr -scan comp -cat scsi_bus
    

    Note

    You must run the hwmgr -scan comp -cat scsi_bus command on every cluster member that needs access to the disk.

    Wait a minute or so for all members to register the presence of the new disk.

  2. If the disk that you are adding is an RZ26, RZ28, RZ29, or RZ1CB-CA model, run the following command on each cluster member:

    # /usr/sbin/clu_disk_install
    

    If the cluster has a large number of storage devices, this command can take several minutes to complete.

  3. To learn the name of the new disk, enter the following command:

    # hwmgr -view devices -cluster
    

    You can also run the SysMan Station command and select Hardware from the Views menu to learn the new disk name.

For information about creating file systems on the disk, see Section 9.6.

9.2.4    Managing Third-party Storage

When a cluster member loses quorum, all of its I/O is suspended, and the remaining members erect I/O barriers against nodes that have been removed from the cluster. This I/O barrier operation inhibits non-cluster members from performing I/O with shared storage devices.

The method that is used to create the I/O barrier depends on the types of storage devices that the cluster members share. In certain cases, a Task Management function called a Target_Reset is sent to stop all I/O to and from the former member. This Task Management function is used in either of the following situations:

In either of these situations, there is a delay between the Target_Reset and the clearing of all I/O pending between the device and the former member. The length of this interval depends on the device and the cluster configuration. During this interval, some I/O with the former member might still occur. This I/O, sent after the Target_Reset, completes in a normal way without interference from other nodes.

During an interval configurable with the drd_target_reset_wait kernel attribute, the device request dispatcher suspends all new I/O to the shared device. This period allows time to clear those devices of the pending I/O that originated with the former member and were sent to the device after it received the Target_Reset. After this interval passes, the I/O barrier is complete.

The default value for drd_target_reset_wait is 30 seconds, which should be sufficient. However, if you have doubts because of third-party devices in your cluster, contact the device manufacturer and ask for the specifications on how long it takes their device to clear I/O after the receipt of a Target_Reset.

You can set drd_target_reset_wait at boot time and run time.

For more information about quorum loss and system partitioning, see the chapter on the connection manager in the TruCluster Server Cluster Technical Overview.

9.2.5    Tape Devices

You can access a tape device in the cluster from any member, regardless of whether it is located on that member's private bus, on a shared bus, or on another member's private bus.

Certain operations, such as mcutil, can be performed only on a device that is directly connected to the member where the operation is performed. For this reason, it is advantageous to place a tape device on a shared bus, where multiple members have direct access to the device.

Performance considerations also argue for placing a tape device on a shared bus. Backing up storage connected to a system on a shared bus with a tape drive is faster than having to go over the cluster interconnect. For example, in Figure 9-2, the backup of dsk9 and dsk10 to the tape drive requires the data to go over the cluster interconnect. For the backup of any other disk, including the semi-private disks dsk11, dsk12, dsk13, and dsk14, the data transfer rate will be faster.

Figure 9-2:  Cluster with Semi-private Storage

If the tape device is located on the shared bus, applications that access the device must be written to react appropriately to certain events on the shared SCSI bus, such as bus and device resets. Bus and device resets (such as those that result from cluster membership transitions) cause any tape device on the shared SCSI bus to rewind.

A read() or write() by a tape server application causes an errno to be returned. You must explicitly set up the tape server application to retrieve error information that is returned from its I/O call to reposition the tape. When a read() or write() operation fails, use ioctl() with the MTIOCGET command option to return a structure that contains the error information that is needed by the application to reposition the tape. For a description of the structure, see /usr/include/sys/mtio.h.

The commonly used utilities tar, cpio, dump, and vdump are not designed in this way, so they may unexpectedly terminate when used on a tape device that resides on a shared bus in a cluster. Currently, the only advantage to situating a tape device on a shared bus in this release is that multiple systems are physically connected to it, and any one of those systems can access it.

9.2.6    Formatting Floppy Disks in a Cluster

TruCluster Server Version 5.1A includes support for read/write UNIX File System (UFS) file systems, as described in Section 9.3.4, and you can use TruCluster Server Version 5.1A to format a floppy disk.

Versions of TruCluster Server prior to Version 5.1A do not support read/write UFS file systems. Because prior versions of TruCluster Server do not support read/write UFS file systems and AdvFS metadata overwhelms the capacity of a floppy disk, the typical methods to format a floppy cannot be used in a cluster.

If you must format a floppy disk in a cluster with a version of TruCluster Server prior to Version 5.1A, use the mtools or dxmtools tool sets. For more information, see mtools(1) and dxmtools(1).

9.2.7    CD-ROM and DVD-ROM

CD-ROM drives and DVD-ROM drives are always served devices. This type of drive must be connected to a local bus; it cannot be connected to a shared bus.

For information about managing a CD-ROM File System (CDFS) in a cluster, see Section 9.7.

9.3    Managing the Cluster File System

The Cluster File System (CFS) provides transparent access to files that are located anywhere on the cluster. Users and applications enjoy a single-system image for file access. Access is the same regardless of the cluster member where the access request originates, and where in the cluster the disk containing the file is connected. CFS follows a server/client model, with each file system served by a cluster member. Any cluster member can serve file systems on devices anywhere in the cluster. If the member serving a file system becomes unavailable, the CFS server automatically fails over to an available cluster member.

The primary tool for managing the cluster file system is the cfsmgr command. A number of examples of using the command appear in this section. For more information about the cfsmgr command, see cfsmgr(8).

To gather statistics about the CFS file system, use the cfsstat command or the cfsmgr -statistics command. An example of using cfsstat to get information about direct I/O appears in Section 9.3.3.5. For more information on the command, see cfsstat(8).

For file systems on devices on the shared bus, I/O performance depends on the load on the bus and the load on the member serving the file system. To simplify load balancing, CFS allows you to easily relocate the server to a different member. Access to file systems on devices that are private to a member is faster when the file systems are served by that member.

Use the cfsmgr command to learn which files systems are served by which member. For example, to learn the server of the clusterwide root file system (/), enter the following command:

# cfsmgr /
 
 Domain or filesystem name = /
 Server Name = systemb
 Server Status : OK
 

To move the CFS server to a different member, enter the following cfsmgr command to change the value of the SERVER attribute:

# cfsmgr -a server=systema /
# cfsmgr /
 
 Domain or filesystem name = /
 Server Name = systema
 Server Status : OK

Although you can relocate the CFS server of the clusterwide root, you cannot relocate the member root domain to a different member. A member always serves its own member root domain, rootmemberID_domain#root.

When a cluster member boots, that member serves any file systems on the devices that are on buses that are private to the member. However, when you manually mount a file system or mount it via the fstab file, the server is chosen based on connectivity to the device from available servers. This can result in a file system being served by a member that is not local to it. In this case, you might see a performance improvement if you manually relocate the CFS server to the local member.

9.3.1    When File Systems Cannot Fail Over

In most instances, CFS provides seamless failover for the file systems in the cluster. If the cluster member serving a file system becomes unavailable, CFS fails over the server to an available member. However, in the following situations, no path to the file system exists and the file system cannot fail over:

In either case, the cfsmgr command returns the following status for the file system (or domain):

Server Status : Not Served

Attempts to access the file system return the following message:

filename I/O error
 

When a cluster member that is connected to the storage becomes available, the file system becomes served again and accesses to the file system begin to work. Other than making the member available, you do not need to take any action.

9.3.2    Direct Access Cached Reads

TruCluster Server implements direct access cached reads, which is a performance enhancement for AdvFS file systems. Direct access cached reads allow CFS to read directly from storage simultaneously on behalf of multiple cluster members.

If the cluster member that issues the read is directly connected to the storage that makes up the file system, direct access cached reads access the storage directly and do not go through the cluster interconnect to the CFS server.

If a CFS client is not directly connected to the storage that makes up a file system (for example, if the storage is private to a cluster member), that client will still issue read requests directly to the devices, but the device request dispatcher layer sends the read request across the cluster interconnect to the device.

Direct access cached reads are consistent with the existing CFS served file-system model, and the CFS server continues to perform metadata and log updates for the read operation.

Direct access cached reads are implemented only for AdvFS file systems. In addition, direct access cached reads are performed only for files that are at least 64K in size. The served I/O method is more efficient when processing smaller files.

Direct access cached reads are enabled by default and are not user-settable or tunable. However, if an application uses direct I/O, as described in Section 9.3.3.5, that choice is given priority and direct access cached reads are not performed for that application.

Use the cfsstat directio command to display direct I/O statistics. The direct i/o reads field includes direct access cached read statistics. See Section 9.3.3.5.3 for a description of these fields.

# cfsstat directio
Concurrent Directio Stats:
     941 direct i/o reads
       0 direct i/o writes
       0 aio raw reads
       0 aio raw writes
       0 unaligned block reads
      29 fragment reads
      73 zero-fill (hole) reads
       0 file-extending writes
       0 unaligned block  writes
       0 hole writes
       0 fragment writes
       0 truncates
 

9.3.3    Optimizing CFS Performance

You can tune CFS performance by doing the following:

9.3.3.1    CFS Load Balancing

When a cluster boots, the TruCluster Server software ensures that each file system is directly connected to the member that serves it. This means that file systems on a device connected to a member's local bus are served by that member. A file system on a device on a shared SCSI bus is served by one of the members that is directly connected to that SCSI bus.

In the case of AdvFS, the first fileset that is assigned to a CFS server determines that all other filesets in that domain will have that same cluster member as their CFS server.

When a cluster boots, typically the first member up that is connected to a shared SCSI bus is the first member to see devices on the shared bus. This member then becomes the CFS server for all the file systems on all the devices on that shared bus. Because of this, most file systems are probably served by a single member. This situation can have negative consequences for performance. It is important to monitor file system activity on the cluster and load balance the CFS servers as necessary.

Use the cfsmgr command to determine good candidates for relocating the CFS servers. The cfsmgr command displays statistics on file system usage on a per-member basis. For example, suppose you want to determine whether to relocate the server for /accounts to improve performance. First, confirm the current CFS server of /accounts as follows:

# cfsmgr /accounts
 
 Domain or filesystem name = /accounts
 Server Name = systemb
 Server Status : OK
 

Then, get the CFS statistics for the current server and the candidate servers by entering the following commands:

# cfsmgr -h systemb -a statistics /accounts
 
 Counters for the filesystem /accounts:
        read_ops = 4149
        write_ops = 7572
        lookup_ops = 82563
        getattr_ops = 408165
        readlink_ops = 18221
        access_ops = 62178
        other_ops = 123112
 
 Server Status : OK
# cfsmgr -h systema -a statistics /accounts
 
 Counters for the filesystem /accounts:
        read_ops = 26836
        write_ops = 3773
        lookup_ops = 701764
        getattr_ops = 561806
        readlink_ops = 28712
        access_ops = 81173
        other_ops = 146263
 
 Server Status : OK
# cfsmgr -h systemc -a statistics /accounts
 
 Counters for the filesystem /accounts:
        read_ops = 18746
        write_ops = 13553
        lookup_ops = 475015
        getattr_ops = 280905
        readlink_ops = 24306
        access_ops = 84283
        other_ops =  103671
 
 Server Status : OK
# cfsmgr -h systemd -a statistics /accounts
 
 Counters for the filesystem /accounts:
        read_ops = 98468
        write_ops = 63773
        lookup_ops = 994437
        getattr_ops = 785618
        readlink_ops = 44324
        access_ops = 101821
        other_ops = 212331
 
 Server Status : OK
 

In this example, most of the read and write activity for /accounts is from member systemd, not from the member that is currently serving it, systemb. Assuming that systemd is physically connected to the storage for /accounts, systemd is a good choice as the CFS server for /accounts.

Determine whether systemd and the storage for /accounts are physically connected as follows:

  1. Find out where /accounts is mounted. You can either look in /etc/fstab or use the mount command. If there are a large number of mounted file systems, you might want to use grep as follows:

    # mount | grep accounts
    accounts_dmn#accounts on /accounts type advfs (rw)
     
    

  2. Look at the directory /etc/fdmns/accounts_dmn to learn the device where the AdvFS domain accounts_dmn is mounted as follows:

    # ls /etc/fdmns/accounts_dmn
    dsk6c
     
    

  3. Enter the drdmgr command to learn the servers of dsk6 as follows:

    # drdmgr -a server dsk6
                       Device Name: dsk6
                       Device Type: Direct Access IO Disk
                     Device Status: OK
                 Number of Servers: 4
                       Server Name: membera
                      Server State: Server
                       Server Name: memberb
                      Server State: Server
                       Server Name: memberc
                      Server State: Server
                       Server Name: memberd
                      Server State: Server
     
    

    Because dsk6 has multiple servers, it is on a shared bus. Because systemd is one of the servers, there is a physical connection.

  4. Relocate the CFS server of /accounts to systemd as follows:

    # cfsmgr -a server=systemd /accounts
    

Even in cases where the CFS statistics do not show an inordinate load imbalance, we recommend that you distribute the CFS servers among the available members that are connected to the shared bus. Doing so can improve overall cluster performance.

9.3.3.2    Automatically Distributing CFS Server Load

To automatically have a particular cluster member act as the CFS server for a file system or domain, you can place a script in /sbin/init.d that calls the cfsmgr command to relocate the server for the file system or domain to the desired cluster member.

For example, if you want cluster member alpha to serve the domain accounting, place the following cfsmgr command in a startup script:

# cfsmgr -a server=alpha -d accounting
 

Have the script look for successful relocation and retry the operation if it fails. The cfsmgr command returns a nonzero value on failure; however, it is not sufficient for the script to keep trying on a bad exit value. The relocation might have failed because a failover or relocation is already in progress.

On failure of the relocation, have the script search for one of the following messages:

	Server Status : Failover/Relocation in Progress
 
	Server Status : Cluster is busy, try later
 

If either of these messages occurs, have the script retry the relocation. On any other error, have the script print an appropriate message and exit.

9.3.3.3    Tuning the Block Transfer Size

During client-side reads and writes, CFS passes data in a predetermined block size. Generally, the larger the block size, the better the I/O performance.

There are two ways to control the CFS I/O blocksize:

Although a large block size generally yields better performance, there are special cases where doing CFS I/O in smaller block sizes can be advantageous. If reads and writes for a file system are small and random, then a large CFS I/O block size does not improve performance and the extra processing is wasted.

For example, if the I/O for a file system is 8K or less and totally random, then a value of 8 for FSBSIZE is appropriate for that file system.

The default value for FSBSIZE is determined by the value of the cfsiosize kernel attribute. To learn the current value of cfsiosize, use the sysconfig command. For example:

# sysconfig -q cfs cfsiosize
cfs:
cfsiosize = 65536
 

A file system where all the I/O is small in size but multiple threads are reading or writing the file system sequentially is not a candidate for a small value for FSBSIZE. Only when the I/O to a file system is both small and random does it make sense to set FSBSIZE for that file system to a small value.

9.3.3.4    Changing the Number of Read-Ahead and Write-Behind Threads

When CFS detects sequential accesses to a file, it employs read-ahead threads to read the next I/O block size worth of data. CFS also employs write-behind threads to buffer the next block of data in anticipation that it too will be written to disk. Use the cfs_async_biod_threads kernel attribute to set the number of I/O threads that perform asynchronous read ahead and write behind. Read-ahead and write-behind threads apply only to reads and writes originating on CFS clients.

The default size for cfs_async_biod_threads is 32. In an environment where at one time you have more than 32 large files sequentially accessed, increasing cfs_async_biod_threads can improve CFS performance, particularly if the applications using the files can benefit from lower latencies.

The number of read-ahead and write-behind threads is tunable from 0 through 128. When not in use, the threads consume few system resources.

9.3.3.5    Taking Advantage of Direct I/O

When an application opens an AdvFS file with the O_DIRECTIO flag in the open system call, data I/O is direct to the storage; the system software does no data caching for the file at the file-system level. In a cluster, this arrangement supports concurrent direct I/O on the file from any member in the cluster. That is, regardless of which member originates the I/O request, I/O to a file does not go through the cluster interconnect to the CFS server. Database applications frequently use direct I/O in conjunction with raw asynchronous I/O (which is also supported in a cluster) to improve I/O performance.

The best performance on a file that is opened for direct I/O is achieved under the following conditions:

The following conditions can result in less than optimal direct I/O performance:

An application that uses direct I/O is responsible for managing its own caching. When performing multithreaded direct I/O on a single cluster member or multiple members, the application must also provide synchronization to ensure that, at any instant, only one thread is writing a sector while others are reading or writing.

For a discussion of direct I/O programming issues, see the chapter on optimizing techniques in the Tru64 UNIX Programmer's Guide.

9.3.3.5.1    Differences Between Cluster and Standalone AdvFS Direct I/O

The following list presents direct I/O behavior in a cluster that differs from that in a standalone system:

9.3.3.5.2    Cloning a Fileset With Files Open in Direct I/O Mode

As described in Section 9.3.3.5, when an application opens a file with the O_DIRECTIO flag in the open system call, I/O to the file does not go through the cluster interconnect to the CFS server. However, if you clone a fileset that has files open in Direct I/O mode, the I/O does not follow this model and might cause considerable performance degradation. (Read performance is not impacted by the cloning.)

The clonefset utility, which is described in the clonefset(8) reference page, creates a read-only copy, called a clone fileset, of an AdvFS fileset. A clone fileset is a read-only snapshot of fileset data structures (metadata). That is, when you clone a fileset, the utility copies only the structure of the original fileset, not its data. If you then modify files in the original fileset, every write to the fileset causes a synchronous copy-on-write of the original data to the clone if the original data has not already been copied. In this way, the clone fileset contents remain the same as when you first created it.

If the fileset has files open in Direct I/O mode, when you modify a file AdvFS copies the original data to the clone storage. AdvFS does not send this copy operation over the cluster interconnect. However, CFS does send the write operation for the changed data in the fileset over the interconnect to the CFS server unless the application using Direct I/O mode happens to be running on the CFS server. Sending the write operation over the cluster interconnect negates the advantages of opening the file in Direct I/O mode.

To retain the benefits of Direct I/O mode, remove the clone as soon as the backup operation is complete so that writes are again written directly to storage and are not sent over the cluster interconnect.

9.3.3.5.3    Gathering Statistics on Direct I/O

If the performance gain for an application that uses direct I/O is less than you expected, you can use the cfsstat command to examine per-node global direct I/O statistics.

Use cfsstat to look at the global direct I/O statistics without the application running. Then execute the application and examine the statistics again to determine whether the paths that do not optimize direct I/O behavior were being executed.

The following example shows how to use the cfsstat command to get direct I/O statistics:

# cfsstat directio
Concurrent Directio Stats:
     160 direct i/o reads
     160 direct i/o writes
       0 aio raw reads
       0 aio raw writes
       0 unaligned block reads
       0 fragment reads
       0 zero-fill (hole) reads
     160 file-extending writes
       0 unaligned block  writes
       0 hole writes
       0 fragment writes
       0 truncates
 

The individual statistics have the following meanings:

9.3.3.6    Adjusting CFS Memory Usage

In situations where one cluster member is the CFS server for a large number of file systems, the client members may cache a great many vnodes from the served file systems. For each cached vnode on a client, even vnodes that are not actively used, the CFS server must allocate 800 bytes of system memory for the CFS token structure that is needed to track the file at the CFS layer. In addition to this, the CFS token structures typically require corresponding AdvFS access structures and vnodes, resulting in a near-doubling of the amount of memory that is used.

By default, each client can use up to 4 percent of memory to cache vnodes. When multiple clients fill up their caches with vnodes from a CFS server, system memory on the server can become overtaxed, causing it to hang.

The svrcfstok_max_percent kernel attribute is designed to prevent such system hangs. The attribute sets an upper limit on the amount of memory that is allocated by the CFS server to track vnode caching on clients. The default value is 25 percent. The memory is used only if the server load requires it. It is not allocated up front.

After the svrcfstok_max_percent limit is reached on the server, an application accessing files that are served by the member gets an EMFILE error. Applications that use perror() to check errno will return the message too many open files to the standard error stream, stderr, the controlling tty or log file used by the applications. Although you see EMFILE error messages, no cached data is lost.

If applications start getting EMFILE errors, follow these steps:

  1. Determine whether the CFS client is out of vnodes, as follows:

    1. Get the current value of the max_vnodes kernel attribute:

      # sysconfig -q vfs max_vnodes
      

    2. Use dbx to get the values of total_vnodes and free_vnodes:

      # dbx -k /vmunix /dev/mem
      dbx version 5.0
      Type 'help' for help.
      (dbx)pd total_vnodes
      total_vnodes_value
       
      

      Get the value for max_vnodes:

      (dbx)pd max_vnodes
      max_vnodes_value
       
      

      If total_vnodes equals max_vnodes and free_vnodes equals 0, then that member is out of vnodes. In this case, you can increase the value of the max_vnodes kernel attribute. You can use the sysconfig command to change max_vnodes on a running member. For example, to set the maximum number of vnodes to 20000, enter the following:

      # sysconfig -r vfs max_vnodes=20000
      

  2. If the CFS client is not out of vnodes, then determine whether the CFS server has used all the memory that is available for token structures (svrcfstok_max_percent), as follows:

    1. Log on to the CFS server.

    2. Start the dbx debugger and get the current value for svrtok_active_svrcfstok:

      # dbx -k /vmunix /dev/mem
      dbx version 5.0
      Type 'help' for help.
      (dbx)pd svrtok_active_svrcfstok
      active_svrcfstok_value
       
      

    3. Get the value for cfs_max_svrcfstok:

      (dbx)pd cfs_max_svrcfstok
      max_svrcfstok_value
       
      

    If svrtok_active_svrcfstok is equal to or greater than cfs_max_svrcfstok, then the CFS server has used all the memory that is available for token structures.

    In this case, the best solution to make the file systems usable again is to relocate some of the file systems to other cluster members. If that is not possible, then the following solutions are acceptable:

Typically, when a CFS server reaches the svrcfstok_max_percent limit, relocate some of the CFS file systems so that the burden of serving the file systems is shared among cluster members. You can use startup scripts to run the cfsmgr and automatically relocate file systems around the cluster at member startup.

Setting svrcfstok_max_percent below the default is recommended only on smaller memory systems that run out of memory because 25 percent default value is too high.

9.3.3.7    Using Memory Mapped Files

Using memory mapping to share a file across the cluster for anything other than read-only access can negatively affect performance. CFS I/O to a file does not perform well if multiple members are simultaneously modifying the data. This situation forces premature cache flushes to ensure that all nodes have the same view of the data at all times.

9.3.3.8    Avoid Full File Systems

If free space in a file system is less than 50 MB or less than 10 percent of the file system's size, whichever is smaller, then write performance to the file system from CFS clients suffers. This is because all writes to nearly full file systems are sent immediately to the server to guarantee correct ENOSPC semantics.

9.3.3.9    Other Strategies

The following measures can improve CFS performance:

9.3.4    MFS and UFS File Systems Supported

TruCluster Server Version 5.1A includes read/write support for Memory File System (MFS) and UNIX File System (UFS) file systems.

When you mount a UFS file system in a cluster for read/write access, or when you mount an MFS file system in a cluster for read-only or read/write access, the mount command server_only argument is used by default. These file systems are treated as partitioned file systems, as described in Section 9.3.5. That is, the file system is accessible for both read-only and read/write access only by the member that mounts it. Other cluster members cannot read from, or write to, the MFS or UFS file system. There is no remote access; there is no failover.

If you want to mount a UFS file system for read-only access by all cluster members, you must explicitly mount it read-only.

9.3.5    Partitioning File Systems

CFS makes all files accessible to all cluster members. Each cluster member has the same access to a file, whether the file is stored on a device that is connected to all cluster members or on a device that is private to a single member. However, CFS does make it possible to mount an AdvFS file system so that it is accessible to only a single cluster member. This is referred to as file system partitioning.

The Available Server Environment (ASE), which is an earlier version of the TruCluster Server product, offered functionality like that of file system partitioning. File partitioning is provided in TruCluster Server as of Version 5.1 to ease migration from ASE. File system partitioning in TruCluster Server is not intended as a general purpose method for restricting file system access to a single member.

To mount a partitioned file system, log on to the member that you want to give exclusive access to the file system. Run the mount command with the server_only option. This mounts the file system on the member where you execute the mount command and gives that member exclusive access to the file system. Although only the mounting member has access to the file system, all members, cluster-wide, can see the file system mount.

The server_only option can be applied only to AdvFS, MFS, and UFS file systems.

Partitioned file systems are subject to the following limitations:

9.3.6    Block Devices and Cache Coherency

A single block device can have multiple aliases. In this situation, multiple block device special files in the file system namespace will contain the same dev_t. These aliases can potentially be located across multiple domains or file systems in the namespace.

On a standalone system, cache coherency is guaranteed among all opens of the common underlying block device regardless of which alias was used on the open() call for the device. In a cluster, however, cache coherency can be obtained only among all block device file aliases that reside on the same domain or file system.

For example, if cluster member mutt serves a domain with a block device file and member jeff serves a domain with another block device file with the same dev_t, then cache coherency is not provided if I/O is performed simultaneously through these two aliases.

9.4    Managing the Device Request Dispatcher

The device request dispatcher subsystem makes physical disk and tape storage transparently available to all cluster members, regardless of where the storage is physically located in the cluster. When an application requests access to a file, CFS passes the request to AdvFS, which then passes it to the device request dispatcher. In the file system hierarchy, the device request dispatcher sits right above the device drivers.

The primary tool for managing the device request dispatcher is the drdmgr command. A number of examples of using the command appear in this section. For more information, see drdmgr(8).

9.4.1    Direct-Access I/O and Single-Server Devices

The device request dispatcher follows a client/server model; members serve devices, such as disks, tapes, and CD-ROM drives.

Devices in a cluster are either direct-access I/O devices or single-server devices. A direct-access I/O device supports simultaneous access from multiple cluster members. A single-server device supports access from only a single member.

Direct-access I/O devices on a shared bus are served by all cluster members on that bus. A single-server device, whether on a shared bus or directly connected to a cluster member, is served by a single member. All other members access the served device through the serving member. Note that direct-access I/O devices are part of the device request dispatcher subsystem, and have nothing to do with direct I/O (opening a file with the O_DIRECTIO flag to the open system call), which is handled by CFS. See Section 9.3.3.5 for information about direct I/O and CFS.

Typically, disks on a shared bus are direct-access I/O devices, but in certain circumstances, some disks on a shared bus can be single-server. The exceptions occur when you add an RZ26, RZ28, RZ29, or RZ1CB-CA disk to an established cluster. Initially, such devices are single-server devices. See Section 9.4.1.1 for more information. Tape devices are always single-server devices.

Although single-server disks on a shared bus are supported, they are significantly slower when used as member boot disks or swap files, or for the retrieval of core dumps. We recommend that you use direct-access I/O disks in these situations.

Figure 9-3 shows a four-node cluster with five disks and a tape drive on the shared bus. Note that SystemD is not on the shared bus. Its access to cluster storage is routed through the Memory Channel cluster interconnect.

Figure 9-3:  Four Node Cluster

Disks on the shared bus are served by all the cluster members on the bus. You can confirm this by looking for the device request dispatcher server of dsk3 as follows:

# drdmgr -a server dsk3
                   Device Name: dsk3
                   Device Type: Direct Access IO Disk
                 Device Status: OK
             Number of Servers: 3
                   Server Name: systema
                  Server State: Server
                   Server Name: systemb
                  Server State: Server
                   Server Name: systemc
                  Server State: Server
 

From the View line in the preceding output, you can see that the drdmgr command was executed on systemc.

Because dsk3 is a direct-access I/O device on the shared bus, all three systems on the bus serve it. This means that, when any member on the shared bus accesses the disk, the access is directly from the member to the device.

Disks on private buses are served by the system that they are local to. For example, the server of dsk7 is systemb:

# drdmgr -a server dsk7
                   Device Name: dsk7
                   Device Type: Direct Access IO Disk
                 Device Status: OK
             Number of Servers: 1
                   Server Name: systemb
                  Server State: Server
 

Tape drives are always single-server. Because tape0 is on a shared bus, any member on that bus can act as its server. When the cluster is started, the first member up that has access to the tape drive becomes the server for the tape drive.

The numbering of disks indicates that when the cluster booted, systema came up first. It detected its private disks first and labeled them, then it detected the disks on the shared bus and labeled them. Because systema came up first, it is also the server for tape0. To confirm this, enter the following command:

# drdmgr -a server tape0
                   Device Name: tape0
                   Device Type: Served Tape
                 Device Status: OK
             Number of Servers: 1
                   Server Name: systema
                  Server State: Server
 

To change tape0's server to systemc, enter the drdmgr command as follows:

# drdmgr -a server=systemc /dev/tape/tape0
 

For any single-server device, the serving member is also the access node. The following command confirms this:

# drdmgr -a accessnode tape0
                   Device Name: tape0
              Access Node Name: systemc
 

Unlike the device request dispatcher SERVER attribute, which for a given device is the same on all cluster members, the value of the ACCESSNODE attribute is specific to a cluster member.

Any system on a shared bus is always its own access node for the direct-access I/O devices on the same shared bus.

Because systemd is not on the shared bus, for each direct-access I/O device on the shared bus you can specify the access node to be used by systemd when it accesses the device. The access node must be one of the members on the shared bus.

The result of the following command is that systemc handles all device request dispatcher activity between systemd and dsk3:

# drdmgr -h systemd -a accessnode=systemc dsk3
 

9.4.1.1    Devices Supporting Direct-Access I/O

RAID-fronted disks are direct-access I/O capable. The following are Redundant Array of Independent Disks (RAID) controllers:

Any RZ26, RZ28, RZ29, and RZ1CB-CA disks already installed in a system at the time the system becomes a cluster member, either through the clu_create or clu_add_member command, are automatically enabled as direct-access I/O disks. To later add one of these disks as a direct-access I/O disk, you must use the procedure in Section 9.2.3.

9.4.1.2    Replacing RZ26, RZ28, RZ29, or RZ1CB-CA as Direct-Access I/O Disks

If you replace an RZ26, RZ28, RZ29, or RZ1CB-CA direct-access I/O disk with a disk of the same type (for example, replace an RZ28-VA with another RZ28-VA), follow these steps to make the new disk a direct-access I/O disk:

  1. Physically install the disk in the bus.

  2. On each cluster member, enter the hwmgr command to scan for the new disk as follows:

    # hwmgr -scan comp -cat scsi_bus
    

    Allow a minute or two for the scans to complete.

  3. If you want the new disk to have the same device name as the disk it replaced, use the hwmgr -redirect scsi command. For details, see hwmgr(8) and the section on replacing a failed SCSI device in the Tru64 UNIX System Administration manual.

  4. On each cluster member, enter the clu_disk_install command:

    # clu_disk_install
    

Note

If the cluster has a large number of storage devices, the clu_disk_install command can take several minutes to complete.

9.4.1.3    HSZ Hardware Supported on Shared Buses

For a list of hardware that is supported on shared buses, see the TruCluster Server Version 5.1A Software Product Description.

If you try to use an HSZ40A or an HSZ that does not have the proper firmware revision on a shared bus, the cluster might hang when there are multiple simultaneous attempts to access the HSZ.

9.5    Managing AdvFS in a Cluster

For the most part, the Advanced File System (AdvFS) on a cluster is like that on a standalone system. However, there are some cluster-specific considerations, which are described in this section:

9.5.1    Integrating AdvFS Files from a Newly Added Member

Suppose that you add a new member to the cluster and that new member has AdvFS volumes and filesets from when it ran as a standalone system. To integrate these volumes and filesets into the cluster, you need to do the following:

  1. Modify the /etc/fstab file listing the domains#filesets that you want to integrate into the cluster.

  2. Make the new domains known to the cluster, either by manually entering the domain information into /etc/fdmns or by running the advscan command.

For information on the advscan command, see advscan(8). For examples of reconstructing /etc/fdmns, see the section on restoring an AdvFS file system in the Tru64 UNIX AdvFS Administration manual.

9.5.2    Create Only One Fileset in Cluster Root Domain

The root domain, cluster_root, must contain only a single fileset. If you create more than one fileset in cluster_root (you are not prevented from doing so), it can lead to a panic if the cluster_root domain needs to fail over.

As an example of when this situation might occur, consider cloned filesets. As described in advfs(4), a clone fileset is a read-only copy of an existing fileset, which you can mount as you do other filesets. If you create a clone of the clusterwide root (/) and mount it, the cloned fileset is added to the cluster_root domain. If the cluster_root domain has to fail over while the cloned fileset is mounted, the cluster will panic.

Note

If you make backups of the clusterwide root from a cloned fileset, minimize the amount of time during which the clone is mounted. Mount the cloned fileset, perform the backup, and unmount the clone as quickly as possible.

9.5.3    Do Not Add a Volume to a Member's Root Domain

You cannot use the addvol command to add volumes to a member's root domain (rootmemberID_domain#root). Instead, you must delete the member from the cluster, use diskconfig or SysMan to configure the disk appropriately, and then add the member back into the cluster. For the configuration requirements for a member boot disk, see the Cluster Installation manual.

9.5.4    Using the addvol and rmvol Commands in a Cluster

You can manage AdvFS domains from any cluster member, regardless of whether the domains are mounted on the local member or a remote member. However, when you use the addvol or rmvol command from a member that is not the CFS server for the domain you are managing, the commands use rsh to execute remotely on the member that is the CFS server for the domain. This has the following consequences:

The rmvol and addvol commands use rsh when the member where the commands are executed is not the server of the domain. For rsh to function, the default cluster alias must appear in the /.rhosts file. The entry for the cluster alias in /.rhosts can take the form of the fully-qualified hostname or the unqualified hostname. Although the plus sign (+) can appear in place of the hostname, allowing all hosts access, this is not recommended for security reasons.

The clu_create command automatically places the cluster alias in /.rhosts, so rsh should work without your intervention. If the rmvol or addvol command fails because of rsh failure, the following message is returned:

rsh failure, check that the /.rhosts file allows cluster alias access.

9.5.5    User and Group File System Quotas Are Supported

TruCluster Server Version 5.1A includes quota support that allows you to limit both the number of files and the total amount of disk space that are allocated in an AdvFS filesystem on behalf of a given user or group.

Quota support in a TruCluster Server environment is similar to quota support in the Tru64 UNIX base system, with the following exceptions:

This section describes information that is unique to managing disk quotas in a TruCluster Server environment. For general information about managing quotas, see the Tru64 UNIX System Administration guide.

9.5.5.1    Quota Hard Limits

In a Tru64 UNIX system, a hard limit places an absolute upper boundary on the number of files or amount of disk space that a given user or group can allocate on a given filesystem. When a hard limit is reached, disk space allocations or file creations are not allowed. System calls that would cause the hard limit to be exceeded fail with a quota violation.

In a TruCluster Server environment, hard limits for the number of files are enforced as they are in a standalone Tru64 UNIX system.

However, hard limits on the total amount of disk space are not as rigidly enforced. For performance reasons, CFS allows client nodes to cache a configurable amount of data for a given user or group without any communication with the member serving that data. After the data is cached on behalf of a given write operation and the write operation returns to the caller, CFS guarantees that, barring a failure of the client node, the cached data will eventually be written to disk at the server.

Writing the cached data takes precedence over strictly enforcing the disk quota. If and when a quota violation occurs, the data in the cache is written to disk regardless of the violation. Subsequent writes by this group or user are not cached until the quota violation is corrected.

Because additional data is not written to the cache while quota violations are being generated, the hard limit is never exceeded by more than the sum of quota_excess_blocks on all cluster members. The actual disk space quota for a user or group is therefore determined by the hard limit plus the sum quota_excess_blocks on all cluster members.

The amount of data that a given user or group is allowed to cache is determined by the quota_excess_blocks value, which is located in the member-specific etc/sysconfigtab file. The quota_excess_blocks value is expressed in units of 1024-byte blocks and the default value of 1024 represents 1 MB of disk space. The value of quota_excess_blocks does not have to be the same on all cluster members. You might use a larger quota_excess_blocks value on cluster members on which you expect most of the data to be generated, and accept the default value for quota_excess_blocks on other cluster members.

9.5.5.2    Setting the quota_excess_blocks Value

The value for quota_excess_blocks is maintained in the /etc/sysconfigtab file in the cfs stanza.

Avoid making manual changes to this file. Instead, use the sysconfigdb command to make changes. This utility automatically makes any changes available to the kernel and preserves the structure of the file so that future upgrades merge in correctly.

Performance for a given user or group can be affected by quota_excess_blocks. If this value is set too low, CFS cannot use the cache efficiently. Setting quota_excess_blocks to less than 64K will have a severe performance impact. Conversely, setting quota_excess_blocks too high increases the actual amount of disk space that a user or group can consume.

We recommend accepting the quota_excess_blocks default of 1 MB, or increasing it as much as is considered practical given its effect of raising the potential upper limit on disk block usage. When determining how to set this value, consider that the worst-case upper boundary is determined as follows:

(admin specified hard limit) + 
  (sum of "quota_excess_blocks" on each client node)  
 

CFS makes a significant effort to minimize the amount by which the hard quota limit is exceeded, and it is very unlikely that you would reach the worst-case upper boundary.

9.5.6    Storage Connectivity and AdvFS Volumes

All volumes in an AdvFS domain must have the same connectivity if failover capability is desired. Volumes have the same connectivity when either one of the following conditions is true:

The drdmgr and hwmgr commands can give you information about which systems serve which disks. To get a graphical display of the cluster hardware configuration, including active members, buses, storage devices, and their connections, use the sms command to invoke the graphical interface for the SysMan Station, and then select Hardware from the Views menu.

9.6    Considerations When Creating New File Systems

Most aspects of creating new file systems are the same in a cluster and a standalone environment. The Tru64 UNIX AdvFS Administration manual presents an extensive description of how to create AdvFS file systems in a standalone environment.

For information about adding disks to the cluster, see Section 9.2.3.

The following are important cluster-specific considerations for creating new file systems:

9.6.1    Verifying Disk Connectivity

To ensure the highest availability, make sure that all disks that are used for volumes in an AdvFS domain have the same connectivity.

Disks have the same connectivity when either one of the following conditions is true:

The easiest way to verify disk connectivity is to use the sms command to invoke the graphical interface for the SysMan Station, and then select Hardware from the Views menu.

For example, in Figure 9-1, the SCSI bus that is connected to the pza0s is shared by all three cluster members. All disks on that base have the same connectivity.

You can also use the hwmgr command to view all the devices on the cluster and then pick out those disks that show up multiple times because they are connected to several members. For example:

# hwmgr -view devices -cluster
 
HWID: Device Name         Mfg     Model            Hostname   Location
-------------------------------------------------------------------------------
  3: kevm                                         pepicelli
 28: /dev/disk/floppy0c          3.5in floppy     pepicelli  fdi0-unit-0
 40: /dev/disk/dsk0c     DEC     RZ28M    (C) DEC pepicelli  bus-0-targ-0-lun-0
 41: /dev/disk/dsk1c     DEC     RZ28L-AS (C) DEC pepicelli  bus-0-targ-1-lun-0
 42: /dev/disk/dsk2c     DEC     RZ28     (C) DEC pepicelli  bus-0-targ-2-lun-0
 43: /dev/disk/cdrom0c   DEC     RRD46    (C) DEC pepicelli  bus-0-targ-6-lun-0
 44: /dev/disk/dsk13c    DEC     RZ28M    (C) DEC pepicelli  bus-1-targ-1-lun-0
 44: /dev/disk/dsk13c    DEC     RZ28M    (C) DEC polishham  bus-1-targ-1-lun-0
 44: /dev/disk/dsk13c    DEC     RZ28M    (C) DEC provolone  bus-1-targ-1-lun-0
 45: /dev/disk/dsk14c    DEC     RZ28L-AS (C) DEC pepicelli  bus-1-targ-2-lun-0
 45: /dev/disk/dsk14c    DEC     RZ28L-AS (C) DEC polishham  bus-1-targ-2-lun-0
 45: /dev/disk/dsk14c    DEC     RZ28L-AS (C) DEC provolone  bus-1-targ-2-lun-0
 46: /dev/disk/dsk15c    DEC     RZ29B    (C) DEC pepicelli  bus-1-targ-3-lun-0
 46: /dev/disk/dsk15c    DEC     RZ29B    (C) DEC polishham  bus-1-targ-3-lun-0
 46: /dev/disk/dsk15c    DEC     RZ29B    (C) DEC provolone  bus-1-targ-3-lun-0
        .
        .
        .

In this partial output, dsk0, dsk1, and dsk2 are private disks that are connected to pepicelli's local bus. None of these are appropriate for a file system that needs failover capability, and they are not good choices for Logical Storage Manager (LSM) volumes.

dsk13 (HWID 44), dsk14 (HWID 45), and dsk15 (HWID 46) are connected to pepicelli, polishham, and provolone. These three disks all have the same connectivity.

9.6.2    Looking for Available Disks

When you want to determine whether disks are already in use, look for the quorum disk, disks containing the clusterwide file systems, and member boot disks and swap areas.

9.6.2.1    Looking for the Location of the Quorum Disk

You can learn the location of the quorum disk by using the clu_quorum command. In the following example, the partial output for the command shows that dsk10 is the cluster quorum disk:

# clu_quorum
 Cluster Quorum Data for: deli as of Wed Apr 25 09:27:36 EDT 2001
 
Cluster Common Quorum Data
Quorum disk:   dsk10h
        .
        .
        .

You can also use the disklabel command to look for a quorum disk. All partitions in a quorum disk should be unused, except for the h partition, which has fstype cnx.

9.6.2.2    Looking for the Location of Member Boot Disks and Clusterwide AdvFS File Systems

To learn the locations of member boot disks and clusterwide AdvFS file systems, look for the file domain entries in the /etc/fdmns directory. You can use the ls command for this. For example:

# ls /etc/fdmns/*
 
/etc/fdmns/cluster_root:
dsk3c
 
/etc/fdmns/cluster_usr:
dsk5c
 
/etc/fdmns/cluster_var:
dsk6c
 
/etc/fdmns/projects1_data:
dsk9c
 
/etc/fdmns/projects2_data:
dsk11c
 
/etc/fdmns/projects_tools:
dsk12c
 
/etc/fdmns/root1_domain:
dsk4a
 
/etc/fdmns/root2_domain:
dsk8a
 
/etc/fdmns/root3_domain:
dsk2a
 
/etc/fdmns/root_domain:
dsk0a
 
/etc/fdmns/usr_domain:
dsk0g

This output from the ls command indicates the following:

9.6.2.3    Looking for Member Swap Areas

A member's primary swap area is always the b partition of the member boot disk. (For information about member boot disks, see Section 11.1.4.) However, a member might have additional swap areas. If a member is down, be careful not to use the member's swap area. To learn whether a disk has swap areas on it, use the disklabel -r command. Look in the fstype column in the output for partitions with fstype swap.

In the following example, partition b on dsk11 is a swap partition:

# disklabel -r dsk11
        .
        .
        .
8 partitions:
#         size     offset    fstype   [fsize bsize cpg] # NOTE: values not exact
 a:     262144          0     AdvFS                     # (Cyl.    0 - 165*)
 b:     401408     262144      swap                     # (Cyl.  165*- 418*)
 c:    4110480          0    unused        0     0      # (Cyl.    0 - 2594)
 d:    1148976     663552    unused        0     0      # (Cyl.  418*- 1144*)
 e:    1148976    1812528    unused        0     0      # (Cyl. 1144*- 1869*)
 f:    1148976    2961504    unused        0     0      # (Cyl. 1869*- 2594)
 g:    1433600     663552     AdvFS                     # (Cyl.  418*- 1323*)
 h:    2013328    2097152     AdvFS                     # (Cyl. 1323*- 2594)
 

9.6.3    Editing /etc/fstab

You can use the SysMan Station graphical user interface (GUI) to create and configure an AdvFS volume. However, if you choose to use the command line, when it comes time to edit /etc/fstab, you need do it only once, and you can do it on any cluster member. The /etc/fstab file is not a CDSL. A single file is used by all cluster members.

9.7    Managing CDFS File Systems

In a cluster, a CD-ROM drive is always a served device. The drive must be connected to a local bus; it cannot be connected to a shared bus. The following are restrictions on managing a CD-ROM File System (CDFS) in a cluster:

To manage a CDFS file system, follow these steps:

  1. Enter the cfsmgr command to learn which member currently serves the CDFS:

    # cfsmgr
     
    

  2. Log in on the serving member.

  3. Use the appropriate commands to perform the management tasks.

For information about using library functions that manipulate the CDFS, see the TruCluster Server Cluster Highly Available Applications manual.

9.8    Backing Up and Restoring Files

Back up and restore for user data in a cluster is similar to that in a standalone system. You back up and restore CDSLs like any other symbolic links. To back up all the targets of CDSLs, back up the /cluster/members area.

Make sure that all restore software that you plan to use is available on the Tru64 UNIX disk of the system that was the initial cluster member. Treat this disk as the emergency repair disk for the cluster. If the cluster loses the root domain, cluster_root, you can boot the initial cluster member from the Tru64 UNIX disk and restore cluster_root.

The bttape utility is not supported in clusters.

9.8.1    Suggestions for Files to Back Up

You should regularly back up data files and the following file systems:

9.9    Managing Swap Space

Do not put swap entries in /etc/fstab. In Tru64 UNIX Version 5.0 the list of swap devices was moved from the /etc/fstab file to the /etc/sysconfigtab file. Additionally, you no longer use the /sbin/swapdefault file to indicate the swap allocation; use the /etc/sysconfigtab file for this purpose as well. The swap devices and swap allocation mode are automatically placed in the /etc/sysconfigtab file during installation of the base operating system. For more information, see the Tru64 UNIX System Administration manual and swapon(8) .

Put each member's swap information in that member's sysconfigtab file. Do not put any swap information in the clusterwide /etc/fstab file.

Swap information in sysconfigtab is identified by the swapdevice attribute. The format for swap information is as follows:

swapdevice=disk_partition,disk_partition,...

For example:

swapdevice=/dev/disk/dsk1b,/dev/disk/dsk3b

Specifying swap entries in /etc/fstab does not work in a cluster because /etc/fstab is not member-specific; it is a clusterwide file. If swap were specified in /etc/fstab, the first member to boot and form a cluster would read and mount all the file systems in /etc/fstab. The other members would never see that swap space.

The file /etc/sysconfigtab is a context-dependent symbolic link (CDSL), so that each member can find and mount its specific swap partitions. The installation script automatically configures one swap device for each member, and puts a swapdevice= entry in that member's sysconfigtab file.

If you want to add additional swap space, specify the new partition with swapon, and then put an entry in sysconfigtab so the partition is available following a reboot. For example, to configure dsk3b for use as a secondary swap device for a member already using dsk1b for swap, enter the following command:

swapon -s /dev/disk/dsk3b

Then, edit that member's /etc/sysconfigtab and add /dev/disk/dsk3b. The final entry in /etc/sysconfigtab will look like the following:

	swapdevice=/dev/disk/dsk1b,/dev/disk/dsk3b

9.9.1    Locating Swap Device for Improved Performance

Locating a member's swap space on a device on a shared bus results in additional I/O traffic on the bus. To avoid this, you can place swap on a disk on the member's local bus.

The only downside to locating swap local to the member is the unlikely case where the member loses its path to the swap disk, as can happen when an adapter fails. In this situation, the member will fail. When the swap disk is on a shared bus, the member can still use its swap partition as long as at least one member still has a path to the disk.

9.10    Fixing Problems with Boot Parameters

If a cluster member fails to boot due to parameter problems in the member's root domain (rootN_domain), you can mount that domain on a running member and make the needed changes to the parameters. However, before booting the down member, you must unmount the newly updated member root domain from the running cluster member.

Failure to do so can cause a crash and result in the display of the following message:

cfs_mountroot: CFS server already exists for node boot partition.

For more information, see Section 11.1.9.

9.11    Using the verify Utility in a Cluster

The verify utility examines the on-disk metadata structures of AdvFS file systems. Before using the utility, you must unmount all filesets in the file domain to be verified.

If you are running the verify utility and the cluster member on which it is running fails, extraneous mounts may be left. This can happen because the verify utility creates temporary mounts of the filesets that are in the domain that is being verified. On a single system these mounts go away if the system fails while running the utility, but, in a cluster, the mounts fail over to another cluster member. The fact that these mounts fail over also prevents you from mounting the filesets until you remove the spurious mounts.

When verify runs, it creates a directory for each fileset in the domain and then mounts each fileset on the corresponding directory. A directory is named as follows: /etc/fdmns/domain/set_verify_XXXXXX, where XXXXXX is a unique ID.

For example, if the domain name is dom2 and the filesets in dom2 are fset1, fset2, and fset3, enter the following command:

# ls -l /etc/fdmns/dom2
total 24
lrwxr-xr-x   1 root     system        15 Dec 31 13:55 dsk3a -> /dev/disk/dsk3a
lrwxr-x---   1 root     system        15 Dec 31 13:55 dsk3d -> /dev/disk/dsk3d
drwxr-xr-x   3 root     system      8192 Jan  7 10:36 fset1_verify_aacTxa
drwxr-xr-x   4 root     system      8192 Jan  7 10:36 fset2_verify_aacTxa
drwxr-xr-x   3 root     system      8192 Jan  7 10:36 fset3_verify_aacTxa
 

To clean up the failed-over mounts, follow these steps:

  1. Unmount all the filesets in /etc/fdmns:

    # umount /etc/fdmns/*/*_verify_*
    

  2. Delete all failed over mounts with the following command:

    # rm -rf /etc/fdmns/*/*_verify_*
    

  3. Remount the filesets as you would after a normal completion of the verify utility.

For more information about verify, see verify(8).

9.11.1    Using the verify Utility on Cluster Root

The verify utility has been modified to allow it to run on active domains. Use the -a option to examine the cluster root file system, cluster_root.

You must execute the verify -a utility on the member that is serving the domain that you are examining. Use the cfsmgr command to determine which member serves the domain.

When verify runs with the -a option, it only examines the domain. No fixes can be done on the active domain. The -f and -d options cannot be used with the -a option.