4 File System

4.1 Overview

Digital UNIX Version 4.0 supports the following file systems which are accessed through the OSF/1 Version 1.0 Virtual File System (VFS):

UNIX File System (UFS)
Network File System (NFS)
CD-ROM File System (CDFS)
Memory File System (MFS)
/proc File system (PROCFS)
File-on-File Mounting File System (FFM)
File Descriptor File System (FDFS)
POLYCENTER Advanced File System (AdvFS)

Note that all of the file systems are integrated with the Virtual Memory Unified Buffer Cache (UBC).

In addition, Digital UNIX Version 4.0 supports the Logical Storage Manager (LSM) and the Prestoserve file system accelerator.

Note that the Logical Volume Manager is being retired in this release.

The following sections briefly discuss VFS, the file systems supported in Digital UNIX Version 4.0, the Logical Storage Manager, and the Prestoserve file system accelerator.

4.2 Virtual File System

The Virtual File System (VFS), which is based on the Berkeley 4.3 Reno Virtual File System, provides a uniform interface abstracted from the file system layer which allows common access to files, regardless of the file system on which the files reside. A structure known as a vnode (analogous to an inode) contains information about each file in a mounted file system and is more or less a wrapper around file system-specific nodes. If, for example, a read or write is requested on a file, the vnode points the system call to the system call appropriate for that file system (a read request is pointed to a ufs_read if the request is made on a file in a UFS file system or to an nfs_read if the request is made on a file in an NFS-mounted file system). As a result, file access across different file systems is transparent to the user.

Digital's VFS implementation also supports Extended File Attributes (XFAs). Although originally intended to provide support for system security (Access Control Lists) and the Pathworks PC server (so that a Pathworks PC server could assign PC-specific attributes to a file, such as icon color, the startup size of the application, its backup date, and so forth), the XFA implementation was expanded to provide support for any application that wants to assign an XFA to a file. Currently, both UFS and AdvFS support XFAs, as well as the pax backup utility which has a tar and cpio front-end. XFAs are also supported for remote UFS file systems, to a server which supports a special protocol which currently only Digital supports. For more information on XFAs, see setproplist(2). For more information on pax, see pax(1).

Information for File System Developers

In Digital UNIX Version 4.0, the VOP_READDIR kernel vnode operation interface has been changed to accommodate a new structure, kdirent, in addition to the existing dirent structure.

The new kdirent structure was developed to make file systems other than UFS work properly over NFS.

Note, however, that if you implement a file system under Digital UNIX, you do not need to make any changes to your VOP_READDIR interface routine for Digital UNIX Version 4.0, and applications see the same interface as before the addition of the new kdirent structure.

Unlike the dirent structure, the kdirent structure has a kd_off field that subordinate file systems can set to point to the on-disk offset of the next directory entry. Arrays of struct kdirent must be padded to 8-byte boundaries, using the KDIRSIZE macro, so that the off_t is properly aligned; arrays of struct dirent are only padded to 4 bytes.

Each mounted file system has the option of setting the M_NEWRDDIR flag in the mount structure m_flag field. If the M_NEWRDDIR flag is set, then the routine calling VOP_READDER expects the readdir on that vnode to return an array of struct kdirent; if the M_NEWRDDIR flag is clear (the default), then the the readdir on that vnode returns an array of struct dirent.

In terms of NFS, if the M_NEWRDDIR flag is not set, then the NFS server uses the dirent structures and then calculates the necessary offset to pass back to the server. Thus, to ensure proper operation over NFS, any file system that does not have the M_NEWRDDIR flag set must be prepared to have VOP_READDIR called with offsets based on a packed array of struct dirent, which may be in conflict with the offsets on the on-disk directory structure. However, if the M_NEWRDDIR flag is set, then the NFS server uses the kd_off fields of the kdirent structures to generate the necessary offsets to pass back to the server.

A new vnode operation VOP_PATHCONF was added to the kernel in order to return filesystem-specific information for the fpathconf() and pathconf() system calls. This vnode operation takes as arguments the pointer to struct vnode, the pathconf name int, return value pointer to long and error int. It also sets the return value and ERRNO. Note that each filesystem must implement the vnode operation by providing a function in the vnodeops structure after the vn_delproplist component (at the end of the structure). This function takes as arguments the pointer to vnode, the pathconf name, and the return value pointer to long. The function sets the return value and returns zero for succes or an error number.

4.3 UNIX File System

The UNIX File System (UFS) is compatible with the Berkeley 4.3 Tahoe release. UFS allows a pathname component to be 255 bytes, with the fully qualified pathname length restriction of 1023 bytes. The Digital UNIX Version 4.0 implementation of UFS supports file sizes which exceed 2 GBs.

Digital added support for file block clustering which provides sequential read and write access that is equivalent to the raw device speed of the disk and up to a 300% performance increase over previous releases of the operating system; file-on-file mounting (FFM) for STREAMS; and integrated UFS with the Unified Buffer Cache. UFS also supports Extended File Attributes (XFAs). For more information on XFAs, see Section 4.2.

4.4 Network File System

The Network File System (NFS) is a facility for sharing files in a heterogeneous environment of processors, operating systems, and networks, by mounting a remote file system or directory on a local system and then reading or writing the files as though they were local.

Digital UNIX Version 4.0 supports NFS Version 3, in addition to NFS Version 2. NFS Version 2 code is based on ONC Version 4.2, which Digital licensed from Sun Microsystems. The NFS Version 3 code supersedes ONC Version 4.2, although at the time that NFS Version 3 was ported to Digital UNIX, Sun Microsystems had not yet released a newer, public version of ONC with NFS Version 3 support.

4.4.1 NFS Version 3 Functionality

NFS Version 3 supports all the features of NFS Version 2 as well as the following:

64-bit remote access
Allows users to access files larger than 2 GBs over NFS
Improved performance
- Support for reliable asynchronous writes which improves write performance over NFS Version 2 by a factor of seven, thereby reducing client response latency and server I/O loading
- Support for a READDIRPLUS procedure that returns file handles and attributes with directory names to eliminate LOOKUP calls when scanning a directory
- Support for servers to return metadata on all operations to reduce the number of subsequent GETATTR procedure calls
- Support for weak cache consistency data to allow a client to manage its caches more effectively
Improved security
- Provides an ACCESS procedure that fixes the problems in NFS Version 2 with superuser permission mapping, and allows access checks at file-open time, so that the server can better support Access Control Lists (ACLs)
- File names and pathnames specified as strings of variable length, with the maximum length negotiated between the client and server using the PATHCONF procedure
Guaranteed exclusive creation of files

Since Digital UNIX supports both NFS Version 3 and Version 2, the NFS client and server bind at mount time using the highest NFS version number they both support. For example, a Digital UNIX Version 4.0 client will use NFS Version 3 when it is served by a Digital UNIX Version 4.0 NFS server; however, when it is served by an NFS server running an earlier version of Digital UNIX, the Digital UNIX Version 4.0 NFS client will use NFS Version 2.

For more detailed information on NFS Version 3, see the paper NFS Version 3: Design and Implementation (USENIX, 1994).

4.4.2 Digital Enhancements to NFS

In addition to the NFS Version 3.0 functionality, Digital UNIX supports the following Digital enhancements to NFS:

NFS over TCP
NFS has been traditionally run over the UDP protocol. Digital Unix V4.0 now supports NFS over the TCP protocol. See mount(8) for additional details.
Write-gathering
On an NFS server, multiple write requests to the same file are combined to reduce the number of actual writes as much as possible. The data portions of successive writes are cached and a single metadata update is done that applies to all the writes. Replies are not sent to the client until all data and associated metadata are written to disk to ensure that write-gathering does not violate the NFS crash recovery design.
As a result, write-gathering increases write throughput by up to 100 % and the CPU overhead associated with writes is substantially reduced, thereby further increasing server capacity.
NFS-locking
Using the fcntl system call to control access to file regions, NFS-locking allows you to place locks on file records over NFS, thereby protecting, among other things, segments of a shared, NFS-served database. The status daemon, rpc.statd, monitors the NFS-servers and maintains the NFS lock if the server goes down. When the NFS server comes back up, a reclaiming process allows the lock to be reattached.
Automounting
The automount daemon automatically and transparently mounts and unmounts NFS file systems on an as-needed basis. It provides an alternative to using the /etc/fstab file for NFS mounting file systems on client machines.
The automount daemon can be started from the /etc/rc.config file or from the command line. Once started, it sleeps until a user attempts to access a directory that is associated with an automount map or any directory or file in the directory structure. The daemon awakes and consults the appropriate map and mounts the NFS file system. After a specified period of inactivity on a file system, 5 minutes by default, the automount daemon unmounts that file system.
The maps indicate where to find the file system to be mounted and the mount options to use. An individual automount map is either local or served by NIS. A system, however, can use both local and NIS automount maps.
Automounting NFS-mounted file systems provides the following advantages over static mounts:
- If NIS maps are used and file systems are moved to other servers, users do not need to do anything to access the moved files. Every time the file systems need to be mounted, the daemon will mount them from the correct locations.
- In the case of read-only files, if more than one NFS-server is serving a given file system, automount will connect you to the fastest server that responds. If at least one of the servers is available, the mount will not hang.
- By unmounting NFS-mounted file systems that have not been accessed for more than a certain interval (5 minutes by default), the automount daemon conserves system resources, particularly memory.
PC-NFS
PC-NFS, a product for PC clients available from Sun Microsystems, allows personal computers running DOS to access NFS servers as well as providing a variety of other functionality.
Digital supports the PC-NFS server daemon, pcnfsd, which allows PC clients with PC-NFS configured to do the following:
- Mount NFS file systems
  The PC-NFS pcnfsd daemon, in compliance with Versions 1.0 and 2.0 of the pcnfsd protocol, assigns UIDs and GIDs to PC clients so that they can talk to NFS.
  The pcnfsd daemon performs UNIX login-like password and username verification on the server for the PC client. If the authentication succeeds, the pcnfsd daemon then grants the PC client the same permissions accorded to that username. The PC client can mount NFS file systems by talking to the mountd daemon as long as the NFS file systems are exported to the PC client in the /etc/exports file on the server. Since there is no mechanism in DOS to perform file permission checking, the PC client calls the authentication server to perform checking of the user's credentials against the file's attributes. This happens when the PC client makes NFS requests to the server for file-access that requires permission checking, such as opening of a file.
- Access network printers
  The pcnfsd daemon authenticates the PC client and then spools and prints the file on behalf of the client.

4.5 CD-ROM File System

Digital UNIX Version 4.0 supports the ISO-9660 CDFS standard for data interchange between multiple vendors; High Sierra Group standard for backward compatibility with earlier CD-ROM formats; and an implementation of the Rock Ridge Interchange Protocol (RRIP), Version 1.0, Revision 1.09. The RRIP extends ISO-9660 using the system use areas defined by ISO-9660 to provide mixed-case and long filenames; symbolic links; device nodes; deep directory structures (deeper than ISO-9660 allows); UIDs, GIDs, and permissions on files; and POSIX time stamps.

This code was taken from the public domain and enhanced by Digital.

In addition, Digital UNIX Version 4.0 also supports X/Open Preliminary Specification (1991) CD-ROM Support Component (XCDR). XCDR allows users to examine selected ISO-9660 attributes through defined utilities and shared libraries, and allows system administrators to substitute different file protections, owners, and file names for the default CD-ROM files.

4.6 Memory File System

Digital UNIX Version 4.0 supports a Memory File System (MFS) which is essentially a UNIX File System that resides in memory. No permanent file structures or data are written to disk, so the contents of an MFS file system are lost on reboots, unmounts, or power failures. Since it does not write data to disk, the MFS is a very fast file system and is quite useful for storing temporary files or read-only files that are loaded into it after it is created.

For example, if you are performing a software build which would have to be restarted if it failed, the MFS is a very appropriate choice to use for storing the temporary files that are created during the build, since by virtue of its speed it would reduce the build time. For more information, see the newfs(8) reference page.

4.7 /proc File System

The /proc file system enables running processes to be accessed and manipulated as files by the system calls open, close, read, write, lseek, and ioctl. While the /proc file system is most useful for debuggers, it enables any process with the correct permissions to control another running process. Thus, a parent/child relationship does not have to exist between a debugger and the process being debugged. The dbx debugger that ships in Digital UNIX Version 4.0 supports attaching to running processes through /proc. For more information, see the proc(4) and dbx(1) reference pages.

4.8 File-on-File Mounting File System

The File-on-File Mounting (FFM) file system allows regular, character, or block-special files to be mounted over regular files, and, for the most part, is only used by the SVR4-compatible system calls fattach and fdetach of a STREAMS-based pipe (or FIFO). With FFM, a FIFO, which normally has no file system object associated with it, is given a name in the file system space. As a result, a process that is unrelated to the process that created the FIFO can then access the FIFO.

In addition to programs using FFM through the fattach system call, users can mount one regular file on top of another using the mount command. Mounting a file on top of another file does not destroy the contents of the covered file; it simply associates the name of the covered file with the mounted file, making the contents of the covered file temporarily unavailable. The covered file can be accessed after the file mounted on top of it is unmounted, either by a reboot or by a call to fdetach, or by entering the umount command. Note that the contents of the covered file are still available to any process which had the file open at the time of the call to fattach or when a user issued a mount command that covered the file.

4.9 File Descriptor File System

The File Descriptor File System (FDFS) allows applications to reference a process's open file descriptors (0, 1, 2, 3, and so forth) as if they were files in the UNIX File System (for example, /dev/fd/0, /dev/fd/1, /dev/fd/2) by aliasing a process's open file descriptors to file objects. When the FDFS is mounted, opening or creating a file descriptor file has the same effect as calling the dup(2) system call.

The FDFS allows applications that were not written with support for UNIX I/O to avail themselves of pipes, named pipes, and I/O redirection.

The FDFS is not mounted by default and must either be mounted by hand or by an entry placed in the /etc/fstab file.

For more information on the FDFS, see the fd(4) reference page.

4.10 POLYCENTER Advanced File System

The POLYCENTER Advanced File System (AdvFS), which consists of a file system that ships with the base system and a set of file system utilities that are available as a separate, layered product, is a log-based (journaled) file system that is especially valuable on systems with large amounts of storage. Because it maintains a log of active file-system transactions, AdvFS avoids lengthy file system checks on reboot and can therefore recover from a system failure in seconds. AdvFS ensures that log records are written to disk before data records, ensuring that file domains (file systems) are recovered to a consistent state. AdvFS uses extent-based allocation for optimal performance.

To users and applications, AdvFS looks like any other UNIX file system. It is compliant with POSIX and SPEC 1170 file-system specifications. AdvFS file domains and other Digital UNIX file systems, like UFS, can exist on the same system and are integrated with the Virtual File System (VFS) and the Unified Buffer Cache (UBC). AdvFS file domains can also be remote-mounted with NFS and support extended file attributes (XFAs). For more information on XFAs, see Section 4.2.

In addition to providing rapid restart and increased file-system integrity, AdvFS supports files and file systems much larger than 2 GBs and, by separating the file system directory layer from the logical storage layer, provides increased file-system flexibility and manageability.

In addition to the Advanced File System that ships as part of the base operating system, the POLYCENTER Advanced File System Utilities are available as a layered product. The AdvFS Utilities enable a system administrator to create multivolume file domains, add and remove volumes online, clone filesets for online backup, unfragment and balance file domains online, stripe individual files, and establish trashcans so that users can restore their deleted files. The AdvFS Utilities also provide a Graphical User Interface for configuring and managing AdvFS file domains. The AdvFS Utilities require a separate license Product Authorization Key (PAK). Contact your Digital representative for additional information on the AdvFS Utilities product. For more information on AdvFS, see the System Administration guide and the POLYCENTER Advanced File System Utilities Technical Summary.

4.11 Logical Storage Manager

Digital UNIX Version 4.0 supports the Logical Storage Manager (LSM), a more robust logical storage manager than Logical Volume Manager (LVM), which it has replaced. LSM supports all of the following:

Disk spanning
Disk spanning allows you to concatenate entire disks or parts (regions) of multiple disks together to use as one, logical volume. So, for example, you could "combine" two RZ26s and have them contain the /usr file system.
Mirroring
Mirroring allows you to write simultaneously to two or more disk drives to protect against data loss in the event of disk failure.
Striping
Striping improves performance by breaking data into segments that are written to several different physical disks in a "stripe set."
Comprehensive disk management capabilities
LSM supports disk management utilities that, among other things, change the disk configuration without disrupting users while the system is up and running.

Mirroring, striping and the graphical interface require a separate license PAK. The LSM code came from VERITAS (the VERITAS Volume Manager) and was enhanced by Digital.

For each logical volume defined in the system, the LSM volume device driver maps logical volume I/O to physical disk I/O. In addition, LSM uses a user-level volume configuration daemon (vold) that controls changes to the configuration of logical volumes. Users can administer LSM either through a series of command-line utilities or by availing themselves of an intuitive Motif-based graphical interface.

To ensure a smooth migration from LVM to LSM, Digital has developed a migration utility that maps existing LVM volumes into nonstriped, nonmirrored LSM volumes that preserves all of the LVM data. After the migration is complete, administrators can mirror the volumes if they so desire.

Similarly, to help users transform their existing UFS or AdvFS partitions into LSM logical volumes, Digital has developed a utility that will transform each partition in use by UFS or AdvFS into a nonstriped, nonmirrored LSM volume. After the transformation is complete, administrators can mirror the volumes if they so desire.

Note that LSM volumes can be used in conjunction with AdvFS, as part of an AdvFS domain; with RAID disks; and with the Available Server Environment (ASE), since LSM supports logical volume failover. For more information on LSM, see the Logical Storage Manager.

4.12 Overlap Partition Checking

The enhancements related to Overlap Partition Checking are described next.

4.12.1 Partition Overlap Checks Added to Utilities

Partion overlap checks were added to a number of commands in Digital UNIX Version 4.0. Some of the commands which use these checks are: newfs, fsck, mount, mkfdnm, swapon, voldisksetup, and voldisk. The enhanced checks require a disk label to be installed on the disk. Refer to the disklabel(8) reference page for further information.

The checks ensure that if a partition or an overlapping partition is already in use (for example, mounted or used as a swap device), the partition will not be overwritten. Additionally, the checks ensure that partitions will not be overwritten if the specific partition or an overlapping partition is marked in use in the fstype field on the disk label.

If a partition or an overlapping partition has an in-use fstype field in the disklabel, some commands inquire interactively if a partition can be overwritten.

4.12.2 Library Functions for Partition Overlap Checking

Two new functions, check_usage(3) and set_usage(3) are available for use by applications. These functions check whether a disk partition is marked for use and set the fstype of the partition in the disk label. See the appropriate reference pages for these functions for more information.

4.13 Prestoserve File System Accelerator

The Prestoserve file system accelerator is a hardware option that speeds up synchronous disk writes, including NFS server access, by reducing the amount of disk I/O. Frequently-written data blocks are cached in nonvolatile memory and then written to disk asynchronously.

The software required to drive the board ships as an optional subset in Digital UNIX Version 4.0 and once it is installed can be accessed with a PAK that comes with the board.

Prestoserve uses a write cache for synchronous disk I/O. Prestoserve works in a way that is similar to the way the system buffer cache speeds up asynchronous disk I/O requests. Prestoserve is interposed between the operating system and the device drivers for the disks on a server. Mounted file systems and unmounted block devices selected by the administrator are accelerated.

When a synchronous write request is issued to a disk with accelerated file systems or block devices, it is intercepted by the Prestoserve pseudodevice driver, which stores the data in nonvolatile memory instead of on the disk. Thus, synchronous writes occur at memory speeds, not at disk speeds.

As the nonvolatile memory in the Prestoserve cache fills up, it asynchronously flushes the cached data to the disk in portions that are large enough to allow the disk drivers to optimize the order of the writes. A modified form of Least Recently Used (LRU) replacement is used to determine the order. Reads that hit (match blocks) in the Prestoserve cache also benefit.

Nonvolatile memory is required because data must not be lost if the power fails or if the system crashes. As a result, the hardware board contains a battery that protects data in case the system crashes. From the point of view of the operating system, Prestoserve appears to be a very fast disk.

Note that there is a substantial performance gain when Prestoserve is used on an NFSV2 server.

The dxpresto command allows you to monitor Prestoserve activity and to enable or disable Prestoserve on machines that allow that operation. For more information on Prestoserve see the Guide to Prestoserve and the dxpresto(8X) reference page.