From a configuration and administration point of view, perhaps the most important feature of TruCluster Server is the creation of a single, clusterwide namespace for files and directories. This namespace provides each cluster member with the same view of all file systems. In addition, there is a single copy of most configuration files. With few exceptions, the directory structure of a cluster is identical to that of a standalone system.
The clusterwide namespace is implemented by several new TruCluster Server technologies, including the Cluster File System (CFS) and the device request dispatcher, both of which are described in this chapter.
This chapter discusses the following topics:
Supported file systems (Section 2.1)
Cluster File System (CFS) (Section 2.2)
Device request dispatcher (Section 2.3)
CFS and device request dispatcher FAQ (Section 2.4)
Context-dependent symbolic links (CDSLs) (Section 2.5)
Device names (Section 2.6)
Worldwide ID (Section 2.7)
Clusters and the Logical Storage Manager (LSM) (Section 2.8)
To begin to understand how storage software works in a cluster,
examine
Figure 2-1.
This figure
shows a high-level view of storage software layering in a
cluster.
Note that the device request dispatcher controls all I/O to
physical devices; all cluster I/O passes through this subsystem.
Also
note that CFS layers on top of existing file systems such as the
Advanced File System (AdvFS).
Figure 2-1: Storage Software Layering in a Cluster
Table 2-1
summarizes supported file systems.
Table 2-1: File Systems Supported in a Cluster
| Type | How Supported | Failure Characteristics |
| Advanced File System (AdvFS) | Read/write | A file domain is served by a member selected on the basis of its connectivity to the storage containing the file system. Upon member failure, CFS selects a new server for the domain. Upon path failure, CFS uses an alternate device request dispatcher path to the storage. |
| CD-ROM File System (CDFS) | Read-only | A CD-ROM device is served for read-only access by the member that is directly connected to the device. Because TruCluster Server does not support CD-ROM devices on a shared bus, a CD-ROM device becomes inaccessible to the cluster when the member to which it is locally connected fails, even if it is being served by another member. The device becomes accessible again when the member that failed rejoins the cluster. |
| DVD-ROM File System (DVDFS) | Read-only | A DVD-ROM device is served for read-only access by the member that is directly connected to the device. Because TruCluster Server does not support DVD-ROM devices on a shared bus, a DVD-ROM device becomes inaccessible to the cluster when the member to which it is locally connected fails, even if it is being served by another member. The device becomes accessible again when the member that failed rejoins the cluster. |
| File-on-File Mounting (FFM) file system | Read/write (local use) | Can be mounted and accessed only on the local member. |
| Memory File System (MFS) | Read/write (local use) | A cluster member can mount an MFS file system read-only or read/write. The file system is accessible only by that member. There is no remote access; there is no failover. |
| Named pipes | Read/write (local use) | Reader and writer must be on the same member. |
| Network File System (NFS) server | Read/write | External clients use the
default cluster alias, or an alias listed in
/etc/exports.aliases, as the host name when
mounting file systems NFS-exported by the cluster.
File system
failover
and recovery is transparent to external NFS
clients. |
| NFS client | Read/write | A cluster member can mount an NFS
file system whose server is outside the cluster.
If the cluster member
fails, the file system is automatically unmounted.
If the cluster uses
automount
or
autofs, the file system is remounted
automatically; otherwise, the file system must be remounted manually. |
| PC-NFS server | Read/write | PC clients use the default
cluster alias, or an alias listed in
/etc/exports.aliases, as the host name when
mounting file systems NFS-exported by the cluster.
File system
failover and recovery is transparent to external NFS
clients. |
/proc
file system |
Read/write (local use) | Each cluster member has its own
/proc
file system, which is accessible only by
that member. |
| UNIX File System (UFS) | Read-only (clusterwide) Read/write (local use) |
A UFS file system explicitly mounted read-only is served for clusterwide read-only access by a member selected for its connectivity to the storage containing the file system. Upon member failure, CFS selects a new server for the file system. Upon path failure, CFS uses an alternate device request dispatcher path to the storage. Read/write support is identical to MFS read/write support. A cluster member can mount a UFS file system read/write. The file system is accessible only by that member (only that member can read it, only that member can write it). There is no remote access; there is no failover. |
If you know how to manage a Tru64 UNIX system, you already know how
to manage a TruCluster Server cluster because TruCluster Server extends
single-system management capabilities to clusters.
It provides a
clusterwide namespace for files and directories, including a single
root (/) file system that
all cluster members share.
In a like manner, it provides a clusterwide
namespace for storage devices; each storage device has the same unique
device name throughout the cluster.
The SysMan suite of graphical management utilities
provides an integrated view of the cluster environment, letting you
manage a single member or the entire cluster.
Figure 2-2
shows the SysMan Station hardware view
for a cluster named
deli
with two members:
provolone
and
polishham.
Figure 2-2: A Cluster's View of Hardware
TruCluster Server preserves the following availability and performance features of the TruCluster products provided for the Tru64 UNIX Version 4.0 series operating system:
Like the TruCluster Available Server Software and TruCluster Production Server products, TruCluster Server lets you deploy highly available services that can access their disk data from any member in the cluster.
Any application that can run on Tru64 UNIX can run as a highly available single-instance application in a cluster. The application is automatically relocated (failed over) to another cluster member in the event that a required resource, or the the current member itself, becomes unavailable.
Like the TruCluster Production Server Software product, TruCluster Server lets you run components of distributed applications in parallel, providing high availability while taking advantage of cluster-specific synchronization mechanisms and performance optimizations.
The Cluster File System (CFS) makes all files visible to and accessible by all cluster members. Each cluster member has the same view; it does not matter whether a file is stored on a device that is connected to all cluster members or on one that is private to a single member. By maintaining cache coherency across cluster members, CFS guarantees that all members at all times have the same view of file systems mounted in the cluster.
From the perspective of the CFS, each file system or AdvFS domain is served to the entire cluster by a single cluster member. Any cluster member can serve file systems on devices anywhere in the cluster. File systems mounted at cluster boot time are served by the first cluster member to have access to them. This means that file systems on devices on a bus private to one cluster member are served by that member.
This client/server model means that a cluster member can be a client
for some domains and a server for others.
In addition, you can
transition a member between the client/server roles.
For example, if
you enter the
/usr/sbin/cfsmgr
command without
options, it returns the names of domains and file systems, where each
is mounted, the name of the server of each, and the server status.
You can use this information to relocate file systems to other CFS
servers, which balances the load across the cluster.
Because CFS preserves full X/Open and POSIX semantics for file-system access, file management interfaces and utilities work in the same way they do on a standalone system.
Figure 2-3
shows the relationship
between file systems contained by disks on a shared bus and the
resulting cluster directory structure.
Each member boots from its own
boot partition, but then mounts that file system at its
boot_partition
mount point in the clusterwide
name space.
This figure is only an example to show how each cluster
member has the same view of file systems in a cluster.
There are many
physical configurations possible, and a real cluster provides
additional storage to mirror the critical root
(/),
/usr, and
/var
file systems.
Figure 2-3: CFS Makes File Systems Available to All Cluster Members
CFS provides several performance enhancements:
Direct I/O: When direct I/O is enabled for a file by opening the file
with the
O_DIRECTIO
flag, read and write requests
on it are executed to and from disk storage through direct memory
access, bypassing AdvFS and CFS caching.
This may improve I/O
performance for database applications that do their own caching and
file region synchronization.
Remote CFS clients, as well as
applications that are local to the CFS server, can read and write
directly to file systems that are opened for direct
I/O.
Regardless of which member originates the I/O request, direct I/O
to a file does not go through the
cluster interconnect
to
the CFS server.
Direct-access cached reads: A performance enhancement for AdvFS file systems when reading files 64 KB or larger in size. Direct-access cached reads allow CFS to read directly from storage simultaneously from multiple cluster members. If the cluster member that issues the read request is directly connected to the storage containing the file system, direct-access cached reads access the storage directly and do not go through the cluster interconnect to the CFS server. This enhancement maintains the served file system model by having the server perform metadata and log updates, but offloads the cluster interconnect and the CFS server by performing file I/O directly to storage from CFS clients.
Any application that performs read and writes to a file of 64 KB or larger in size uses direct-access cached reads when reading from that file. For example the following types of applications benefit from direct-access cached reads:
Multi-instance read mostly applications such as Web servers and proxy servers because they can perform simultaneous direct access reads from multiple cluster nodes.
Backup applications because, regardless of which node the application runs on, the file system contents do not pass through the cluster interconnect.
Mounting file systems: When a mount command is issued, if the member on which the mount command is issued does not have connectivity, a member with connectivity to the underlying storage is chosen as the CFS server for the file system.
For more information, see the
Cluster Administration
manual.
2.3 Device Request Dispatcher
In a TruCluster Server cluster, the device request dispatcher subsystem controls all I/O to physical devices. All cluster I/O passes through this subsystem, which enforces single-system open semantics so only one program can open a device at any one time. The device request dispatcher makes physical disk and tape storage available to all cluster members, regardless of where the storage is physically located in the cluster. It uses the new device-naming model to make device names consistent throughout the cluster. This provides great flexibility when configuring hardware. A member does not need to be directly attached to the bus on which a disk resides to access storage on that disk.
When necessary, the device request dispatcher uses a client/server model. While CFS serves file systems and AdvFS serves domains, the device request dispatcher serves devices, such as disks, tapes, and CD-ROM drives. However, unlike the client/server model of CFS in which each file system or AdvFS domain is served to the entire cluster by a single cluster member, the device request dispatcher supports the use of many simultaneous servers.
In the device request dispatcher model, devices in a cluster are either single-served or direct-access I/O devices. A single-served device, such as a tape device, supports access from only a single member: the server of that device. A direct-access I/O device supports simultaneous access from multiple cluster members. Direct-access I/O devices on a shared bus are served by all cluster members on that bus.
You can use the
drdmgr
command to look at the
device request dispatcher's view of a device.
In the following
example, device
dsk6
is on a shared bus, and is
served by three cluster members.
# drdmgr dsk6
View of Data from member polishham as of 2000-07-26:10:52:40
Device Name: dsk6
Device Type: Direct Access IO Disk
Device Status: OK
Number of Servers: 3
Server Name: polishham
Server State: Server
Server Name: pepicelli
Server State: Server
Server Name: provolone
Server State: Server
Access Member Name: polishham
Open Partition Mask: 0x4 < c >
Statistics for Client Member: polishham
Number of Read Operations: 737
Number of Write Operations: 643
Number of Bytes Read: 21176320
Number of Bytes Written: 6184960
The device request dispatcher supports clusterwide access to both
character and block disk devices.
You access a raw disk device
partition in a TruCluster Server configuration in the same way you do on
a Tru64 UNIX standalone system; that is, by using the
device's special file name in the
/dev/rdisk
directory.
Note
Before TruCluster Server Version 5.0, cluster administrators had to define special Distributed Raw Disk (DRD) services to provide this level of physical access to storage. Starting with TruCluster Server Version 5.0, this access is built into the cluster architecture and is automatically available to all cluster members.
2.4 CFS and Device Request Dispatcher FAQ
This section answers frequently asked questions about CFS and the device request dispatcher in the following areas:
CFS, I/O, and the cluster interconnect (Section 2.4.1)
AdvFS requested block caching (Section 2.4.2)
The device request dispatcher and file opens (Section 2.4.3)
Relocating the CFS server (Section 2.4.4)
2.4.1 CFS, I/O, and the Cluster Interconnect
Question: On a shared bus with direct-access I/O disks, does I/O have to pass through the cluster interconnect?
Answer:
For raw I/O, any node
that is directly connected to a device has direct access via the
device request dispatcher to a raw partition on that device.
(The
drdmgr
command lists nodes that are servers for a
device.)
Block I/O to directly connected storage, not a file system, goes through the CFS server for the device special file.
For generic file I/O writes, and reads of files less than 64 KB in size, the I/O passes through the CFS server for the file system. If the CFS client node is not the CFS server for the file system, the request is passed across the cluster interconnect to the node that is the CFS server for the file system, and then to the device request dispatcher on the CFS server node. The request never has to go to one node for the CFS server and then to another node for the device request dispatcher. (Asynchronous writes are written into memory and flushed to the server via write-behinds.)
For reads of files 64 KB or larger in size, CFS clients can read the files directly from storage using direct-access cached reads.
In addition, when a program opens a file with
O_DIRECTIO, read and write requests are executed to
and from disk storage through direct memory access, bypassing both AdvFS
and CFS caching.
Regardless of which member originates the I/O
request, direct I/O to a file does not go across the cluster
interconnect.
Section 2.2
has more detail on direct access cached
reads and direct I/O.
Also see
open(2).
To summarize, I/O goes directly to storage in the following cases:
Raw I/O to directly connected storage
The CFS client is also the CFS server
File I/O to a file opened with
O_DIRECTIO
Reads of files 64 KB or larger in size
2.4.2 AdvFS Requested Block Caching
Question: Are requested blocks of an AdvFS file system cached on the CFS client node?
Answer:
Yes.
CFS clients
cache data and do write-behinds.
2.4.3 The Device Request Dispatcher and File Opens
Question: When a program opens a file, at what point does the device request dispatcher become involved?
Answer:
The
open()
is CFS only;
read()
and
write()
involve CFS and the device request
dispatcher.
The device request dispatcher becomes involved on a
read()
when the cache CFS is reading needs filling,
and on a
write()
when the cache CFS is writing
needs emptying.
2.4.4 Relocating the CFS server
Question: When does it make sense to relocate the CFS server?
Answer:
Look at output from the
cfsmgr
command to determine which members handle
the most I/O.
In general, the goal is to avoid having one node serving
all file systems.
(CFS uses a lot of memory; you can see a slowdown
when all file systems are served by the same member.) The simplest
approach is to monitor I/O for a while, decide which members should be
CFS servers for which file systems, and then write some simple boot
scripts (for example, in
/sbin/init.d/) that
automatically relocate systems to the correct host.
For example, consider a two-member cluster (M1 and M2) and six file systems (A, B, C, D, E, F). After watching I/O, you decide that M1 should serve A, D, and E; and M2 should serve B, C, and F. You write a boot-time script that has M1 relocate A, D, and E to itself, and has M2 relocate B, C, and F to itself.
When balancing I/O among cluster members, balance at the CFS level
rather than at the device request dispatcher level.
In other words, use
cfsmgr
rather than
drdmgr
to balance
I/O among cluster members.
2.5 Context-Dependent Symbolic Links
Although the single namespace greatly simplifies system management,
some configuration files and directories should not be shared
by all cluster members.
For example, a member's
/etc/sysconfigtab
file contains
information about that system's kernel component configuration,
and only that system should use that configuration.
Consequently,
the cluster must employ a mechanism that lets each member read and write
the file named
/etc/sysconfigtab, while actually
reading and writing its own member-specific
sysconfigtab
file.
Tru64 UNIX Version 5.0 introduced a special form of symbolic link called a context-dependent symbolic link (CDSL), which TruCluster Server uses to create a namespace with the following characteristics. CDSLs allow a file or directory to be accessed by a single name, regardless of whether the name represents a clusterwide file or directory, or a member-specific file or directory. CDSLs keep traditional naming conventions while providing a behind-the-scenes mechanism that makes sure each member reads and writes its own copy of member-specific system configuration files.
CDSLs contain a variable whose value is determined only during pathname
resolution.
The
{memb}
variable is used to access
member-specific files in a cluster.
The following example shows the
CDSL for
/etc/rc.config:
/etc/rc.config -> ../cluster/members/{memb}/etc/rc.config
When resolving a CDSL pathname, the kernel replaces the
{memb}
variable with the string
membern, where
n
is the
member ID
of the current
member.
Therefore, on a cluster member whose member ID is 2, the
pathname
/cluster/members/{memb}/etc/rc.config
resolves to
/cluster/members/member2/etc/rc.config.
Figure 2-4
shows the relationship between
{memb}
and CDSL pathname resolution.
CDSLs are useful when running multiple instances of an
application on different cluster members when each member operates on
a different set of data.
The
Cluster Highly Available Applications
manual describes how applications can use CDSLs to
maintain member-specific data sets and log files.
Figure 2-4: CDSL Pathname Resolution
As a general rule, before you move a file or directory, make sure that
the destination is not a CDSL.
Moving files to CDSLs requires special
care on your part to ensure that the member-specific files are
maintained.
For example, consider the file
/etc/rc.config
as shown in the following example:
/etc/rc.config -> ../cluster/members/{memb}/etc/rc.config
If you move a file to
/etc/rc.config, you
replace the symbolic link with the actual file;
/etc/rc.config
will no longer be a symbolic link to
/cluster/members/{memb}/etc/rc.config.
The
mkcdsl
command lets system administrators
create CDSLs and update a CDSL inventory file.
The
cdslinvchk
command verifies the current CDSL
inventory.
For more information on these commands, see
mkcdsl(8)
and
cdslinvchk(8).
For more information about CDSLs, see the Tru64 UNIX
System Administration
manual,
hier(5),
ln(1), and
symlink(2).
2.6 Device Names
This section provides an introduction to the device-naming model introduced in Tru64 UNIX Version 5.0. For a detailed discussion of this device-naming model, see the Tru64 UNIX System Administration manual.
Device names are consistent clusterwide:
They are persistent beyond boot.
A device name stays with the device even when you move a disk or tape to a new location in the cluster.
Prior to the release of Tru64 UNIX Version 5.0, disk device names encoded the I/O path for the disk. This path incorporated many pieces of data, and minimally included the following pieces of information: the device driver used to access the controller to which the disk is connected, the instance of the controller within the system that the driver manages, and a per-controller device unit ID.
For example, the
rz
device driver was used to access
both SCSI and ATAPI/IDE device controllers.
Disks connected to these
controllers had names of the form
rzn, where
n
identified both the controller to which
the disk was connected and the unit ID.
For example, a disk with SCSI
ID=3 on the second SCSI/ATAPI/IDE controller was known as
rz11.
If that disk was moved to the third
controller, it was accessed as
rz19.
Tru64 UNIX Version 5.0 introduced a new device naming model in which
the device name simply consists of a descriptive name for the device
and an instance number.
These two elements form the base name of the
device, such as
dsk0.
Note that the instance number
in a device's new name does not correlate to the unit number in its
old name: the operating system assigns the instance numbers in
sequential order, beginning with 0 (zero), as it discovers devices.
Additionally, most modern disks have IDs that can be used to uniquely
identify the disk.
For disks that support this feature, Tru64 UNIX
Version 5.0 keeps track of this ID and uses it to build and maintain a
table that maps disks to device names.
As a result, moving one of
these disks from one physical connection to another does not change
the device name for the disk.
This gives the system administrator
greater flexibility when configuring disks in the system.
In a TruCluster environment, the flexibility provided by the new device naming model is particularly useful because each disk within the cluster has a unique name.
Note
Although Tru64 UNIX supports old-style device names as a compatibility option, TruCluster Server supports only new-style device names. Applications that depend on old-style device names (or the structure of
/dev) must be modified to use the new device-naming model.
Table 2-2
lists some examples
of new device names.
Table 2-2: Examples of New Device Names
| Old Name | New Name | Description |
/dev/rz4c |
/dev/disk/dsk4c |
The
c
partition of the
fifth disk recognized by the operating system. |
/dev/rz19c |
/dev/disk/dsk5c |
The
c
partition of the
sixth disk recognized by the operating system. |
The suffix assigned to the device name special files differs depending on the type of device, as follows:
Disks In general, disk device file names consist of the base
name and a one-letter suffix from
a
through
z; for example,
/dev/disk/dsk0a.
Disks use
a
through
h
to identify partitions.
By default,
floppy disk and CD-ROM devices use only the letters
a
and
c; for example,
floppy0a
and
cdrom1c.
For raw device names, the same device names are in the directory
/dev/rdisk.
Tapes These device file names have the base
name and a suffix composed of the characters
_d
followed by a single digit; for example,
tape0_d0.
This suffix indicates the density of
the tape device, according to the entry for the device in the
/etc/ddr.dbase
file; for example:
| Device | Density |
tape0 |
default density |
tape0c |
default density with compression |
tape0_d0 |
density associated with
entry 0 in
/etc/ddr.dbase |
tape0_d1 |
density associated with
entry 1 in
/etc/ddr.dbase |
With the new device special file naming for tapes, the old name suffix directly mapping to the new name suffix, as follows:
| Old Suffix | New Suffix |
l
(low) |
_d0 |
m
(medium) |
_d2 |
h
(high) |
_d1 |
a
(alternative) |
_d3 |
There are two sets of device names for tapes; both conform to the new
naming convention the
/dev/tape
directory
for rewind devices and the
/dev/ntape
directory
for no-rewind devices.
To determine which device special file to use,
look in the
/etc/ddr.dbase
file.
Tru64 UNIX
provides utilities to identify device names.
For example, the following
hwmgr
commands display device and device hierarchy
information in a cluster:
# hwmgr -view devices -cluster # hwmgr -view hierarchy -cluster
You can use
hwmgr
to list a member's hardware configuration
and correlate bus-target-LUN names with
/dev/disk/dskn
names.
For more information on the
hwmgr
command, see
hwmgr(8).
Note
The Logical Storage Manager (LSM) naming conventions have not changed.
Tru64 UNIX associates the new device name with the
worldwide ID (WWID)
of a disk.
A disk's WWID is
unique; it is set by the manufacturers for devices that support
WWID.
No two disks can have the same WWID.
Using the WWID to
identify a disk has two implications.
After a disk is recognized by the
operating system, the disk's
/dev/disk/dsk
name
stays the same even if its SCSI address changes.
This ability to recognize a disk lets Tru64 UNIX support multipathing to a disk where the disk is accessible through different SCSI adapters. If disks are moved within a TruCluster Server environment, their device names and how users access them remain the same.
Note
The names of disks behind RAID array controllers are associated with both the WWID of their controller module and their own bus, target, and LUN position. In this case, moving a disk changes its device name. However, you can use the
hwmgrutility to reassociate such a disk with its previous device name.
The following
hwmgr
command displays the WWIDs for
a cluster:
# hwmgr -get attr -a name -cluster
2.8 Clusters and the Logical Storage Manager
The Logical Storage Manager (LSM) provides shared access to all LSM volumes from any cluster member. LSM consists of physical disk devices, logical entities, and the mappings that connect both. LSM builds virtual disks, called volumes, on top of UNIX physical disks. LSM transparently places a volume between a physical disk and an application, which then operates on the volume rather than on the physical disk. For example, you can create a file system on an LSM volume rather than on a physical disk.
As previously shown in Figure 2-1, LSM is layered on top of the device request dispatcher. Using LSM in a cluster is like using LSM in a single system. The same LSM software subsets are used for both clusters and noncluster configurations, and you can make configuration changes from any cluster member. LSM keeps the configuration state consistent clusterwide.
The following list outlines LSM support for basic clusterwide file systems:
Supported: root (/),
/usr,
/var
file systems; and member swap partitions.
Not supported: quorum disk and member boot disks.
See the Cluster Administration manual for configuration and usage issues that are specific to LSM in a TruCluster Server environment.