13 Administering Events and Errors

This chapter provides information on the following topics:

Event logging, which is a way to record informational and error messages that are generated by the system. You use the event logs to solve system problems or verify system operations and you can configure event logging to select events in which you have a particular interest. Understanding and configuring the event-logging facilities is described in Section 13.1 and Section 13.2.

Recovering the event logs after a system crash is described in Section 13.3.

System log files require disk space and may periodically require removal and archiving to save space. Maintenance of log files is described in Section 13.4.

When a system or program halts abnormally a crash dump file or a core file may be created. Options for configuring the crash dump facility and for storing and naming core files are described in Section 13.5 and Section 13.6

Certain systems allow you to monitor the status of the system hardware, such as temperature and power status. Environmental Monitoring is described in Section 13.7

A related topic is use of the system exerciser tools, which are described in Appendix F.

13.1 Understanding the Event-Logging Facilities

The Tru64 UNIX operating system uses two mechanisms to log system events:

The system event-logging facility

The binary event-logging facility

The log files that the system and binary event-logging facilities create have the default protection of 640, are owned by root, and belong to the system group. You must have the proper authority to examine the files.

The following sections describe the event-logging facilities.

13.1.1 System Event Logging

The primary systemwide event-logging facility uses the syslog function to log events in ASCII format. The syslog function uses the syslogd daemon to collect the messages that are logged by the various kernel, command, utility, and application programs. The syslogd daemon logs the messages to a local file or forwards the messages to a remote system, as specified in the /etc/syslog.conf file.

When you install your Tru64 UNIX operating system, the /etc/syslog.conf file is created and specifies the default event-logging configuration. The /etc/syslog.conf file specifies the file names that are the destination for the event messages, which are in ASCII format. Section 13.2.1.1 discusses the /etc/syslog.conf file.

13.1.2 Binary Event Logging

The binary event-logging facility detects hardware and software events in the kernel and logs the detailed information in binary format records. Events that are logged by the binary event-logging facility are also logged by the syslog function in a less detailed, but still informative, summary message.

The binary event-logging facility uses the binlogd daemon to collect various event-log records. The binlogd daemon logs these records to a local file or forwards the records to a remote system, as specified in the /etc/binlog.conf default configuration file, which is created when you install your Tru64 UNIX system.

In this release, , the event management utility of choice is the DECevent component, in place of the uerf error logging facility. You can examine the binary event-log files by using the dia command (preferred) or by using the uerf command. Both commands translate the records from binary format to ASCII format.

Note

The uerf facility remains as a component of Tru64 UNIX, but will be retired in a future release of the operating system. See Appendix D or uerf(8) for more information about using uerf.

The DECevent utility is an event managment utility that you can use to produce ASCII reports from entries in the system's event log files. The DECevent utility can be used from the command line and it can be run by selecting it from the Common Desktop Environment (CDE) Application Manager.

For information about administering the DECevent utility, see the following Tru64 UNIX documentation:

DECevent Translation and Reporting Guide

dia(8)

A new anlysis utility that supports recent processors only is provided in Tru64 UNIX. Compaq Analyze is designated to be the replacement for uerf in EV6-series processors. See the Compaq Analyze Installation Guide for Compaq Tru64 UNIX which can be found on the Associated Products CD-ROM. The Tru64 UNIX Installation Guide contains information on installing associated products.

Note that the sys_check utility uses DECevent translation and reporting tools to read system error files such as binary.errlog.saved. Refer to the sys_check(8) reference page for more information.

13.2 Configuring Event Logging

When you install your system, the default system and binary event-logging configuration is used. You can change the default configuration by modifying the configuration files. You can also modify the binary event-logging configuration, if necessary.

To enable system and binary event-logging, the special files must exist and the event-logging daemons must be running. Refer to Section 13.2.2 and Section 13.2.3 for more information.

13.2.1 Editing the Configuration Files

If you do not want to use the default system or binary event-logging configuration, edit the /etc/syslog.conf or /etc/binlog.conf configuration file to specify how the system should log events. In the files, you specify the facility, which is the source of a message or the part of the system that generates a message; the priority, which is the message's level of severity; and the destination for messages.

The following sections describe how to edit the configuration files.

13.2.1.1 The syslog.conf File

If you want the syslogd daemon to use a configuration file other than the default, you must specify the file name with the syslogd -f config_file command.

The following is an example of the default /etc/syslog.conf file:

#
# syslogd config file
#
# facilities: kern user mail daemon auth syslog lpr binary
# priorities: emerg alert crit err warning notice info debug
#
# [1]    [2]                              [3]
kern.debug               /var/adm/syslog.dated/kern.log
user.debug               /var/adm/syslog.dated/user.log
daemon.debug             /var/adm/syslog.dated/daemon.log
auth.crit;syslog.debug   /var/adm/syslog.dated/syslog.log
mail,lpr.debug           /var/adm/syslog.dated/misc.log
msgbuf.err               /var/adm/crash.dated/msgbuf.savecore
kern.debug               /var/adm/messages
kern.debug               /dev/console
*.emerg                  *

Each /etc/syslog.conf file entry has the following entry syntax:

Specifies the facility, which is the part of the system generating the message. [Return to example]

Specifies the severity level. The syslogd daemon logs all messages of the specified severity level plus all messages of greater severity. For example, if you specify level err, all messages of levels err, crit, alert, and emerg or panic are logged. [Return to example]

Specifies the destination where the messages are logged. [Return to example]

The syslogd daemon ignores blank lines and lines that begin with a number sign (#). You can specify a number sign (#) as the first character in a line to include comments in the /etc/syslog.conf file or to disable an entry.

The facility and severity level are separated from the destination by one or more tabs.

You can specify more than one facility and its severity level by separating them with semicolons. In the preceding example, messages from the auth facility of crit severity level and higher and messages from the syslog facility of debug severity level and higher are logged to the /var/adm/syslog.dated/syslog.log file.

You can specify more than one facility by separating them with commas. In the preceding example, messages from the mail and lpr facilities of debug severity level and higher are logged to the /var/adm/syslog.dated/misc.log file.

You can specify the following facilities:

Facility	Description
`kern`	Messages generated by the kernel. These messages cannot be generated by any user process.
`user`	Messages generated by user processes. This is the default facility.
`mail`	Messages generated by the mail system.
`daemon`	Messages generated by the system daemons.
`auth`	Messages generated by the authorization system (for example: `login`, `su`, and `getty`).
`lpr`	Messages generated by the line printer spooling system (for example: `lpr`, `lpc`, and `lpd`).
`local0`	Reserved for local use, along with local1 to local7.
`mark`	Receives a message of priority `info` every 20 minutes, unless a different interval is specified with the `syslogd` `-m` option.
`msgbuf`	Kernel syslog message buffer recovered from a system crash. The `savecore` command and the `syslogd` daemon use the `msgbuf` facility to recover system event messages from a crash.
`*`	Messages generated by all parts of the system.

You can specify the following severity levels, which are listed in order of highest to lowest severity:

Severity Level	Description
`emerg` or `panic`	A panic condition. You can broadcast these messages to all users.
`alert`	A condition that you should immediately correct, such as a corrupted system database.
`crit`	A critical condition, such as a hard device error.
`err`	Error messages.
`warning` or `warn`	Warning messages.
`notice`	Conditions that are not error conditions, but are handled as special cases.
`info`	Informational messages.
`debug`	Messages containing information that is used to debug a program.
`none`	Disables a specific facility's messages.

You can specify the following message destinations:

Destination	Description
Full pathname	Appends messages to the specified file. You should direct each facility's messages to separate files (for example: `kern.log`, `mail.log`, or `lpr.log`).
Host name preceded by an at sign (@)	Forwards messages to the `syslogd` daemon on the specified host.
List of users separated by commas	Writes messages to the specified users if they are logged in.
`*`	Writes messages to all the users who are logged in.

You can specify in the /etc/syslog.conf file that the syslogd daemon create daily log files. To create daily log files, use the following syntax to specify the pathname of the message destination:

/var/adm/syslog.dated/ {file}

The file variable specifies the name of the log file, for example, mail.log or kern.log.

If you specify a /var/adm/syslog.dated/file pathname destination, each day the syslogd daemon creates a subdirectory under the /var/adm/syslog.dated directory and a log file in the subdirectory by using the following syntax:

/var/adm/syslog.dated/ date / file

The date variable specifies the day, month, and time that the log file was created.

The file variable specifies the name of the log file you previously specified in the /etc/syslog.conf file.

The syslogd daemon automatically creates a new date directory every 24 hours and also when you boot the system.

For example, to create a daily log file of all mail messages of level info or higher, edit the /etc/syslog.conf file and specify an entry similar to the following:

mail.info		/var/adm/syslog.dated/mail.log

If you specify the previous command, the syslogd daemon could create the following daily directory and file:

/var/adm/syslog.dated/11-Jan-12:10/mail.log

13.2.1.2 The binlog.conf File

If you want the binlogd daemon to use a configuration file other than the default, specify the file name with the binlogd -f config_file command.

The following is an example of a /etc/binlog.conf file:

#
# binlogd configuration file
#
# format of a line:   event_code.priority         destination
#
# where:
# event_code - see codes in binlog.h and man page, * = all events
# priority   - severe, high, low, * = all priorities
# destination - local file pathname or remote system hostname
#
#
*.*			/usr/adm/binary.errlog
dumpfile		/usr/adm/crash/binlogdumpfile
102.high		/usr/adm/disk.errlog
[1]    [2]                     [3]

Each entry in the /etc/binlog.conf file, except the dumpfile event class entry, contains three fields:

Specifies the event class code that indicates the part of the system generating the event. [Return to example]

Specifies the severity level of the event. Do not specify a severity level if you specify dumpfile for an event class. [Return to example]

Specifies the destination where the binary event records are logged. [Return to example]

The binlogd daemon ignores blank lines and lines that begin with a number sign (#). You can specify a number sign (#) as the first character in a line to include comments in the file or to disable an entry.

The event class and severity level are separated from the destination by one or more tabs.

You can specify the following event class codes:

Class Code	General
*	All event classes.
`dumpfile`	Specifies the recovery of the kernel binary event log buffer from a crash dump. A severity level cannot be specified.

Class Code	Hardware-Detected Events
100	CPU machine checks and exceptions
101	Memory
102	Disks
103	Tapes
104	Device controller
105	Adapters
106	Buses
107	Stray interrupts
108	Console events
109	Stack dumps
199	SCSI CAM events

Class Code	Software-Detected Events
201	CI port-to-port-driver events
202	System communications services events

Class Code	Informational ASCII Messages
250	Generic

Class Code	Operational Events
300	Startup ASCII messages
301	Shutdown ASCII messages
302	Panic messages
310	Time stamp
350	Diagnostic status messages
351	Repair and maintenance messages

You can specify the following severity levels:

Severity Level	Description
*	All severity levels
`severe`	Unrecoverable events that are usually fatal to system operation
`high`	Recoverable events or unrecoverable events that are not fatal to system operation
`low`	Informational events

You can specify the following destinations:

Destination	Description
Full pathname	Specifies the file name to which the `binlogd` daemon appends the binary event records.
`@hostname`	Specifies the name of the host (preceded by an @) to which the `binlogd` daemon forwards the binary event records. If you specify `dumpfile` for an event class, you cannot forward records to a host.

13.2.2 Creating the Special Files

The syslogd daemon cannot log kernel messages unless the /dev/klog character special file exists. If the /dev/klog file does not exist, create it by using the following command syntax:

/dev/MAKEDEV /dev/klog

Also, the binlogd daemon cannot log local system events unless the /dev/kbinlog character special file exists. If the /dev/kbinlog file does not exist, create it by using the following command syntax:

/dev/MAKEDEV /dev/kbinlog

Refer to the MAKEDEV(8) reference page for more information.

13.2.3 Starting and Stopping Event-Logging Daemons

The syslogd and binlogd daemons are automatically started by the init program during system startup. However, you must ensure that the daemons are started. You can also specify options with the command that starts the daemons. Refer to the init(8) reference page for more information.

13.2.3.1 The syslogd Daemon

You must ensure that the syslogd daemon is started by the init program. If the syslogd daemon is not started or if you want to specify options with the command that starts the syslogd daemon, you must edit the /sbin/init.d/syslog file and either include or modify the syslogd command line. Note that you can also invoke the command manually.

The command that starts the syslogd daemon has the following syntax:

/usr/sbin/syslogd [-d] [-fconfig_file] [-mmark_interval]

Refer to the syslogd(8) reference page for information about command options.

Note

You must ensure that the /var/adm directory is mounted, or the syslogd daemon will not work correctly.

The syslogd daemon reads messages from the following:

The Tru64 UNIX domain socket /dev/log file, which is automatically created by the syslogd daemon

An Internet domain socket, which is specified in the /etc/services file

The special file /dev/klog, which logs only kernel messages

Messages from other programs use the openlog, syslog, and closelog calls.

When the syslogd daemon is started, it creates the /var/run/syslog.pid file, where the syslogd daemon stores its process identification number. Use the process identification number to stop the syslogd daemon before you shut down the system.

During normal system operation, the syslogd daemon is called if data is put in the kernel syslog message buffer, located in physical memory. The syslogd daemon reads the /dev/klog file and gets a copy of the kernel syslog message buffer. The syslogd daemon starts at the beginning of the buffer and sequentially processes each message that it finds. Each message is prefixed by facility and priority codes, which are the same as those specified in the /etc/syslog.conf file. The syslogd daemon then sends the messages to the destinations specified in the file.

To stop the syslogd event-logging daemon, use the following command:

# kill `cat /var/run/syslog.pid`

You can apply changes that you make to the /etc/syslog.conf configuration file without shutting down the system by using the following command:

# kill -HUP `cat /var/run/syslog.pid`

13.2.3.2 The binlogd Daemon

You must ensure that the init program starts the binlogd daemon. If the binlogd daemon does not start, or if you want to specify options with the command that starts the binlogd daemon, you must edit the /sbin/init.d/syslog file and either include or modify the binlogd command line. Note that you can also invoke the command manually.

The command that starts the binlogd daemon has the following syntax:

/usr/sbin/binlogd [-d] [-fconfig_file]

Refer to the binlogd(8) reference page for information on command options.

The binlogd daemon reads binary event records from the following:

An Internet domain socket (binlogd, 706/udp), which is specified in the /etc/services file

The /dev/kbinlog special file

When the binlogd daemon starts, it creates the /var/run/binlogd.pid file, where the binlogd daemon stores its process identification number. Use the process identification number to stop or reconfigure the binlogd daemon.

During normal system operation, the binlogd daemon is called if data is put into the kernel's binary event-log buffer or if data is received on the Internet domain socket. The binlogd daemon then reads the data from the /dev/kbinlog special file or from the socket. Each record contains an event class code and a severity level code. The binlogd daemon processes each binary event record and logs it to the destination specified in the /etc/binlog.conf file.

To stop the binlogd daemon, use the following command:

# kill `cat /var/run/binlogd.pid`

You can apply changes that you make to the /etc/binlog.conf configuration file without shutting down the system by using the following command:

# kill -HUP `cat /var/run/binlogd.pid`

13.2.4 Configuring the Kernel Binary Event Logger

You can configure the kernel binary event logger by modifying the default keywords and rebuilding the kernel. You can scale the size of the kernel binary event-log buffer to meet your systems needs. You can enable and disable the binary event logger and the logging of kernel ASCII messages into the binary event log.

The /sys/data/binlog_data.c file defines the binary event-logger configuration. The default configuration specifies a buffer size of 24K bytes, enables binary event logging, and disables the logging of kernel ASCII messages. You can modify the configuration by changing the values of the binlog_bufsize and binlog_status keywords in the file.

The binlog_bufsize keyword specifies the size of the kernel buffer that the binary event logger uses. The size of the buffer can be between 8 kilobytes (8192 bytes) and 48 kilobytes (49152 bytes). Small system configurations, such as workstations, can use a small buffer. Large server systems that use many disks may need a large buffer.

The binlog_status keyword specifies the behavior of the binary event logger. You can specify the following values for the binlog_status keyword:

0 (zero): Disables the binary event logger.
BINLOG_ON: Enables the binary event logger.
BINLOG_ASCIION: Enables the logging of kernel ASCII messages into the binary event log if the binary event logger is enabled. This value must be specified with the BINLOG_ON value as follows: int binlog_status = BINLOG_ON | BINLOG_ASCII;

After you modify the /sys/data/binlog_data.c file, you must rebuild and boot the new kernel.

13.3 Recovering Event Logs After a System Crash

You can recover unprocessed messages and binary event-log records from a system crash when you reboot the system.

The msgbuf.err entry in the /etc/syslog.conf file specifies the destination of the kernel syslog message buffer msgbuf that is recovered from the dump file. The default /etc/syslog.conf file entry for the kernel syslog message buffer file is as follows:

msgbuf.err            /var/adm/crash/msgbuf.savecore

The dumpfile entry in the /etc/binlog.conf file specifies the file name destination for the kernel binary event-log buffer that is recovered from the dump file. The default /etc/binlog.conf file entry for the kernel binary event-log buffer file is as follows:

dumpfile              /usr/adm/crash/binlogdumpfile

If a crash occurs, the syslogd and binlogd daemons cannot read the /dev/klog and /dev/kbinlog special files and process the messages and binary event records. When you reboot the system, the savecore command runs and, if a dump file exists, recovers the kernel syslog message and binary event-log buffers from the dump file. After savecore runs, the syslogd and binlogd daemons are started.

The syslogd daemon reads the syslog message buffer file, checks that its data is valid, and then processes it in the same way that it normally processes data from the /dev/klog file, using the information in the /etc/syslog.conf file.

The binlogd daemon reads the binary event-log buffer file, checks that its data is valid, and then processes the file in the same way that it processes data from the /dev/kbinlog special file, using the information in the /etc/binlog.conf file.

After the syslogd and binlogd daemons are finished with the buffer files, the files are deleted.

13.4 Maintaining Log Files

If you specify full pathnames for the message destinations in the /etc/syslog.conf and /etc/binlog.conf files, the log files will grow in size. Also, if you configure the syslogd daemon to create daily directories and log files, eventually there will be many directories and files, although the files themselves will be small. Therefore, you must keep track of the size and the number of log files and daily directories and delete files and directories if they become unwieldy.

You can also use the cron daemon to specify that log files be deleted. The following is an example of a crontab file entry:

5 1 * * * find /var/adm/syslog.dated -type d -mtime +5 -exec rm -rf '{}' \;

This command line causes all directories under /var/adm/syslog.dated that were modified more than five days ago to be deleted, along with their contents, every day at 1:05. Refer to the crontab(1) reference page for more information.

13.5 Enhanced Core File Naming

By default when a core file is written to a disk, the system saves the file under the name core. Each subsequent core file overwrites its predecessor because the file name is identical. By enabling enhanced core file-naming the system will attempt to create unique names for core files in the form core.prog-name.host-name.tag. The uniquely named files that result will not be overwritten by subsequent core files, thereby preventing the loss of valuable debugging information when the same program or multiple programs fail multiple times (and perhaps for different reasons).

The enhanced name provides the following identification data:

core - The literal string core

program_name - Up to sixteen characters taken from the program name as shown by the ps command.

host_name - The first portion of the system's network host name, or up to 16 characters of the host name, taken from the part of the host name that precedes the first dot. For example, the fourth core file generated on host buggy.net.ooze.com by the program dropsy would be core.dropsy.buggy.3

numeric_tag - The tag assigned to the core file to make it unique among all the core files generated by a program on a host. The maximum value for this tag, and thus the maximum number of core files for this program and host, is set by a system configuration parameter.
Note that the tag is not a literal version number. The system selects the first available unique tag for the core file. For example, if a program's core files have tags .0, .1, and .3, the system uses tag .2 for the next core file it creates for that program. By default, the system can create up to 16 versions of a core file. If the system-configured limit for core file instances is reached, the system will not create any more core files for that program and host combination.

If you plan to save a number of uniquely named core files, be aware that core files can quickly consume available disk space. Allowing core files to be saved under different names in a file system with minimal free space can potentially fill your disk because the files are not overwritten when new core files are created. If you enable this feature, make sure you remove old core files when you have finished examining them.

You can enable this feature at the system level by setting the enhanced-core-name system configuration variable to 1 in the proc subsystem, as in the following example:

proc:
            enhanced-core-name = 1

The system manager can limit the number of unique core file versions that a program can create on a specific host system by setting the system configuration variable enhanced-core-max-versions to the desired value, as in the following example:

proc:
            enhanced-core-name = 1
            enhanced-core-max-versions = 8

The minimum value is 1, the maximum value is 99,999, and the default is 16. Refer to Chapter 5 and in particular Section 5.2.1.2 for information on setting the attributes.

You can enable enhanced core file naming at the program level by calling the uswitch system call with the USW_CORE flag set, as in the following example:

#include     #include      
 
    /*
     * Request enhanced core file naming for
     * this process then create a core file.
     */
    main()
    {
            long uval = uswitch(USC_GET, 0);
            uval = uswitch(USC_SET, uval | USW_CORE);
            if (uval < 0) {
                    perror("uswitch");
                    exit(1);
            }
            raise(SIGQUIT);
    }

13.6 Administering Crash Dumps

When a Tru64 UNIX system crashes, it writes all or part of physical memory to swap space on disk. This information is called a crash dump. During the reboot process, the system moves the crash dump into a file and copies the kernel executable image to another file. Together, these files are the crash dump files. You can use the information in the crash dump files to help you to determine the cause of the system crash.

Crash dump files are required for analysis when a system crashes, or during the development of custom kernels (debugging). You may also have to supply a crash dump file to Technical Support to analyze system problems. To do this, you must understand how crash dump files are created. You must reserve space on disks for the crash dump and crash dump files. The amount of space you reserve depends on your system configuration and the type of crash dump you want the system to perform.

The sections that follow provide information to help you manage crash dumps and crash dump files. For information on analyzing crash dump log files, refer to the Kernel Debugging guide.

13.6.1 Related Documentation and Utilities

The following documentation contains information on crash dumps and related topics, such as swap space requirements:

Installation Guide - Provides nformation on the initial swap space and dump settings configured during installation

Kernel Debugging - Provides information on analyzing crash dumps. Note that you may need to install Development subsets and appropriate licenses in order to use the debugger. The guide contains information on:
- Crash Dump creation and content
- Planning and estimating dump sizes and space requirements
- Logging and log files
- Forcing crash dumps
- Archiving dumps

savecore(8) - Describes the program that copies a core dump from swap partitions to a file.

expand_dump(8) - Describes the program that produces a non-compressed kernel crash dump file.

sysconfig(8) and sysconfigdb(8)- Describes the programs that maintain the kernel subsystem configuration and are used to set crash dump attributes in the kernel to control crash behavior. You can also use the graphical interface /usr/bin/X11/dxkerneltuner to modify kernel attributes. See the dxkerneltuner(8) reference page for information. On-line help is also available for this interface. The dxkerneltuner interface can also be launched from the CDE Desktop by invoking the Application Manager, System Admin.

swapon(8) - Describes the program that creates additional file(s) for paging and swapping. Use swapon if you need to add additional temporary or permanent swap space to produce full dumps.

dbx(1) - Describes the source level debugger.

13.6.2 Files Created and Used During Crash Dumps

By default, the savecore command copies crash dump file into /var/adm/crash, although you can redirect crash dumps to any file system that you designate. The following files are created or used during a crash:

/var/adm/crash/vmzcore.n - The crash dump file, named vmcore.n if the file is non compressed (no z)

/var/adm/crash/bounds - A text file that specifies the incremental number of the next dump (The n in vmzcore.n)

/var/adm/crash/minfree - The file that pecifies the minimum number of kilobytes to be left after crash dump files are written

/var/adm/crash/vmunix.n - A copy of the kernel that was running at the time of the crash, typically of /vmunix.

/etc/syslog.conf and /etc/binlog.conf - The logging configuration files

13.6.3 Crash Dump Creation

After a system crash, you normally reboot your system by issuing the boot command at the console prompt. During a system reboot, the /sbin/savecore script invokes the savecore command. This command moves crash dump information from the swap partitions into a file and copies the kernel that was running at the time of the crash into another file. You can analyze these files to help you determine the cause of a crash. The savecore command also logs the crash in system log files.

You can invoke the savecore command from the command line. For information about the command syntax, see the savecore(8) reference page.

13.6.3.1 Crash Dump File Creation

When the savecore command begins running during the reboot process, it determines whether a crash dump occurred and whether the file system contains enough space to save it. (The system saves no crash dump if you shut it down and reboot it; that is, the system saves a crash dump only when it crashes.)

If a crash dump exists and the file system contains enough space to save the crash dump files, the savecore command moves the crash dump and a copy of the kernel into files in the default crash directory, /var/adm/crash. (You can modify the location of the crash directory.) The savecore command stores the kernel image in a file named vmunix.n, and by default it stores the (compressed) contents of physical memory in a file named vmzcore.n.

The n variable specifies the number of the crash. The number of the crash is recorded in the bounds file in the crash directory. After the first crash, the savecore command creates the bounds file and stores the number 1 in it. The command increments that value for each succeeding crash.

The savecore command runs early in the reboot process so that little or no system swapping occurs before the command runs. This practice helps ensure that crash dumps are not corrupted by swapping.

13.6.3.2 Crash Dump Logging

Once the savecore command writes the crash dump files, it performs the following steps to log the crash in system log files:

Writes a reboot message to the /var/adm/syslog/auth.log file. If the system crashed due to a panic condition, the panic string is included in the log entry.
You can cause the savecore command to write the reboot message to another file by modifying the auth facility entry in the syslog.conf file. If you remove the auth entry from the syslog.conf file, the savecore command does not save the reboot message.

Attempts to save the kernel message buffer from the crash dump. The kernel message buffer contains messages created by the kernel that crashed. These messages might help you determine the cause of the crash.
The savecore command saves the kernel message buffer in the /var/adm/crash/msgbuf.savecore file, by default. You can change the location to which savecore writes the kernel message buffer by modifying the msgbuf.err entry in the /etc/syslog.conf file. If you remove the msgbuf.err entry from the /etc/syslog.conf file, savecore does not save the kernel message buffer.
Later in the reboot process, the syslogd daemon starts up, reads the contents of the msgbuf.err file, and moves those contents into the /var/adm/syslog/kern.log file, as specified in the /etc/syslog.conf file. The syslogd daemon then deletes the msgbuf.err file. For more information about how system logging is performed, see the syslogd(8) reference page.

Attempts to save the binary event buffer from the crash dump. The binary event buffer contains messages that can help you identify the problem that caused the crash, particularly if the crash was due to a hardware error.
The savecore command saves the binary event buffer in the /usr/adm/crash/binlogdumpfile file by default. You can change the location to which savecore writes the binary event buffer by modifying the dumpfile entry in the /etc/binlog.conf file. If you remove the dumpfile entry from the /etc/binlog.conf file, savecore does not save the binary event buffer.
Later in the reboot process the binlogd daemon starts up, reads the contents of the /usr/adm/crash/binlogdumpfile file, and moves those contents into the /usr/adm/binary.errlog file, as specified in the /etc/binlog.conf file. The binlogd daemon then deletes the binlogdumpfile file. For more information about how binary error logging is performed, see the binlogd(8) reference page.

13.6.3.3 Writing the Dump to Swap Space

When the system creates a crash dump, it writes the dump to the swap partitions. The system uses the swap partitions because the information stored in those partitions has meaning only for a running system. Once the system crashes, the information is useless and can be safely overwritten.

Before the system writes a crash dump, it determines how the dump fits into the swap partitions, which are defined in the /etc/fstab file. For example, the following fragment of the /etc/fstab entry shows three swap partitions available:

/dev/rz1b  swap1  ufs sw 0 2
/dev/rz3h  swap2  ufs sw 0 2
/dev/rz4b  swap3  ufs sw 0 2

You use the swapon command to modify available swap space.

The following list describes how the system determines where to write the crash dump:

If the crash dump fits in the primary swap partition it will be dumped to /dev/rz1b. The system writes the dump as far toward the end of the partition as possible, leaving the beginning of the partition available for boot-time swapping.

If the crash dump is too large for the primary swap partition, but fits the secondary and/or tertiary swap space, the system writes the crash dump to the other swap partitions, /dev/rz3h and /dev/rz4b

If the crash dump is too large for all the available swap partitions, the system writes the crash dump to the swap partitions until those partitions are full. It then writes the remaining crash dump information to end of the primary swap partition, possibly filling that partition.

Note

If the aggregate size of all the swap partitions is too small to contain the crash dump, the system creates no crash dump.

Each crash dump contains a header, which the system always writes to the end of the primary swap partition. The header contains information about the size of the dump and where the dump is stored. This information allows savecore to find and save the dump at system reboot time.

The way that a crash dump is taken can be controlled by the dump_sp_threshold kernel attribute, which controls the partitions to which the crash dump is written. The default value of 4096 causes the primary swap partition to be used exclusively for crash dumps that are small enough to fit the partition. In most cases, compressed dumps will fit on the primary swap partition and you will not find it necessary to modify this. If required, you can configure the system so that it fills the secondary swap partitions with dump information before writing any information (except the dump header) to the primary swap partition.

The value in the dump_sp_threshold attribute indicates the amount of space you normally want available for swapping as the system reboots. By default, this attribute is set to 4096 blocks, meaning that the system attempts to leave 2 MB of disk space open in the primary swap partition after the dump is written. Refer to the Kernel Debugging guide for additional information on this setting.

To allow space for crash dumps, adjust the size of the swap partitions to create temporary or permanent swap space. For information about modifying the size of swap partitions, see the swapon(8) reference page.

Note

Be sure all permanent swap partitions are listed in the /etc/fstab file. The savecore command, which copies the crash dump from swap partitions to a file, uses the information in the /etc/fstab file to find the swap partitions. If you omit a swap partition, the savecore command might be unable to find the omitted partition.

You can control the default location of the crash directory with the rcmgr command. For example, to save crash dump files in the /usr/adm/crash2 directory by default (at each system startup), issue the following command:

# /usr/sbin/rcmgr set SAVECORE_DIR /usr/adm/crash2

If you want the system to return to multiuser mode, regardless of whether it saved a crash dump, issue the following command:

# /usr/sbin/rcmgr set SAVECORE_FLAGS M

13.6.4 Choosing the Content and Method of Crash Dumps

Crash dumps are compressed and partial by default, but can be full and/or non compressed if required. Normally, partial crash dumps provide the information that you need to determine the cause of a crash. However, you might want the system to generate full crash dumps if you have a recurring crash problem and partial crash dumps have not been helpful in finding the cause of the crash.

A partial crash dump contains the following:

The crash dump header

A copy of part of physical memory
The system writes the part of physical memory believed to contain significant information at the time of the system crash, basically kernel node code and data. By default, the system omits user page table entries.

A full crash dump contains the following:

The crash dump header

A copy of the entire contents of physical memory at the time of the crash

You can modify how crash dumps are taken by adjusting the crash dump threshold as described in the following section.

13.6.4.1 Adjusting the Primary Swap Partition's Crash Dump Threshold

To configure your system so that it writes even small crash dumps to secondary swap partitions before the primary swap partition, use a large value for the dump_sp_threshold attribute. As described in Section 13.6.3, the value you assign to this attribute indicates the amount of space that you normally want available for system swapping after a system crash.

To adjust the dump_sp_threshold attribute, issue the sysconfig command. For example, suppose your primary swap partition is 40 MB. To raise the value so that the system writes crash dumps to secondary partitions, issue the following command:

# sysconfig -r generic dump_sp_threshold=20480

In this exampe, the dump_sp_threshold attribute, which is in the generic subsystem, is set to 20,480 512-byte blocks (40 MB). In this example, the system attempts to leave the entire primary swap partition completely open for system swapping. The system automatically writes the crash dump to secondary swap partitions and the crash dump header to the end of the primary swap partition.

The sysconfig command changes the value of system attributes for the currently running kernel. To store the new value of the dump_sp_threshold attribute in the sysconfigtab database, modify that database using the sysconfigdb command. For information about the sysconfigtab database and the sysconfigdb command, see the sysconfigdb(8) reference page.

Note

Once the savecore program has copied the crash dump to a file, all swap devices are immediately available for mounting and swapping. The sharing of swap space only occurs for a short time during boot, and usually on systems with a small amount of physical memory.

13.6.4.2 Selecting and Using Noncompressed Crash Dumps

By default, crash dumps are compressed to save disk space, allowing you to dump a larger crash dump file to a smaller partition. This can offer significant advantages on systems with a large amount of physical memory, particularly if you want to tune the system to discourage swapping for realtime operations. On reboot after a crash, the crash dump utility, savecore, automatically detects that the dump is compressed, using information in the crash dump header in swap. It then copies the crash dump file from swap to the /var/adm/crash directory. The compressed crash dump files are identified by the letter z in the file name, to distinguish them from noncompressed crash dump files. For example: vmzcore.1.

Refer to the reference pages savecore(8), expand_dump(8), and sysconfig(8) for information on crash dump compression and how to produce a noncompressed crash dump file.

13.6.5 Generating a Crash Dump Manually

You can manually create a crash dump file by forcing a dump using the console command, crash, which causes a crash dump file to be created on a system that is not responding (hung). It is assumed that you have planned adequate space for the crash dump file and set any kernel parameters as described in the preceding sections.

On most hardware platforms, you force a crash dump by performing the following steps:

If your system has a switch for enabling and disabling the Halt button, set that switch to the Enable position.

Press the Halt button.

At the console prompt, enter the crash command.

Some systems have no Halt button. In this case, perform the following steps to force a crash dump on a hung system:

Press Ctrl/p at the console.

At the console prompt, enter the crash command.

If your system hangs and you force a crash dump, the panic string recorded in the crash dump is the following:

hardware restart

This panic string is always the one recorded when system operation is interrupted by pressing the Halt button or Ctrl/p.

13.6.6 Compressing Crash Dump Files for Archiving

If you are working entirely with compressed (vmzcore.n) crash dump files, they should already be sufficiently compressed for efficient archiving. However, if you are short of storage space, the following sections discuss options for further compression of dump files for storage or transmission if:

You are working with uncompressed (vmcore.n) crash dump files.

You need the maximum amount of compression possible - for example, if you need to transmit a crash dump file over a slow transmission line.

13.6.6.1 Compressing a Crash Dump File

This section describes how you minimize the size of crash dump files, depending on they type of file.

To compress a vmcore.n crash dump file, use a utility such as gzip, compress, or dxarchiver. For example, the following command creates a compressed file named vmcore.3.gz

% gzip vmcore.3

A vmzcore.n crash dump file uses a special compression method that makes it readable by the current Tru64 UNIX debuggers and crash analysis tools without requiring decompression. A vmzcore.n file is substantially compressed compared to the equivalent vmcore.n file, but not as much as if the latter had been compressed using a standard UNIX compression utility such as gzip. Standard compression applied to a vmzcore.n file will make the resulting file about 40 percent smaller than the equivalent vmzcore.n file.

If you need to apply the maximum compression possible to a vmzcore.n file, perform the following steps:

Uncompress the vmzcore.n file using the expand_dump command (see expand_dump(8)). The following example creates an uncompressed file named vmcore.3 from the file vmzcore.3:
```
% expand_dump vmzcore.3
```

Compress the resulting vmcore.n file using a standard UNIX utility. The following example uses the gzip command to create a compressed file named vmcore.3.gz :
```
% gzip vmcore.3
```

Note

You can uncompress a vmzcore.n file only with the expand_dump command. (Do not use gunzip, uncompress, or any other utility).

After a vmzcore.n file has been uncompressed into a vmcore.n file with expand_dump, you cannot compress it back into a vmzcore.n file.

13.6.6.2 Uncompressing a Partial Crash Dump File

Use care when uncompressing a partialcrash dump file that was compressed from a vmcore.n file. Using the gunzip or uncompress command with no flags results in a vmcore.n file that requires storage space equal to the size of memory. In other words, the uncompressed file requires the same amount of disk space as a vmcore.n file from a full crash dump.

This situation occurs because the original vmcore.n file contains UNIX File System (UFS) file holes, which are regions that have no associated data blocks. When a process, such as the gunzip or uncompress command reads from a hole in a file, the file system returns zero-valued data. Thus, memory omitted from the partial dump is added back into the uncompressed vmcore.n file as disk blocks containing all zeros.

To ensure that the uncompressed core file remains at its partial dump size, you must pipe the output from the gunzip or uncompress command with the -c flag to the dd command with the conv=sparse option. For example, to uncompress a file named vmcore.0.Z, issue the following command:

# uncompress -c vmcore.0.Z | dd of=vmcore.0 conv=sparse
 
262144+0 records in
 
262144+0 records out

13.7 Environmental Monitoring

On any system, thermal levels can increase because of poor ventilation, overheating conditions, or fan failure. Without detection, an unscheduled shutdown could ensue causing the system's loss of data or damage to the system itself. By using Environmental Monitoring, the thermal state of AlphaServer systems can be detected and users can be alerted in time enough to recover or perform an orderly shutdown of the system.

This chapter discusses how Environmental Monitoring is implemented on AlphaServer systems.

13.7.1 Environmental Monitoring Framework

The Environmental Monitoring framework consists of four components: loadable kernel module and its associated APIs, Server System MIB subagent daemon, the envmond daemon, and the envconfig utility.

13.7.1.1 Loadable Kernel Module

The loadable kernel module and its associated APIs contain the parameters needed to monitor and return status on your system's threshold levels. The kernel module exports server management attributes as described in Section 13.7.1.1.1 through the kernel configuration manager (CFG) interface only. It works across all platforms that support server management, and provides compatibility for other server management systems under development. The kernel module is supported on all Alpha systems running Version 4.0A or higher of the Tru64 UNIX operating system.

The loadable kernel module does not include platform specific code (such as the location of status registers). It is transparent to the kernel module which options are supported by a platform. That is, the kernel module and platform are designed to return valid data if an option is supported, a fixed constant for unsupported options, or null.

13.7.1.1.1 Specifying Loadable Kernel Attributes

The loadable kernel module exports the parameters listed in Table 13-1 to the kernel configuration manager (CFG).

Table 13-1: Parameters Defined in the Kernel Module

Parameter	Purpose
`env_current_temp`	Specifies the current temperature of the system. If a system is configured with the KCRCM module, the temperature returned is in Celsius. If a system does not support temperature readings and a temperature threshold has not been exceeded, a value of -1 is returned. If a system does not support temperature readings and a temperature threshold is exceeded, a value of -2 is returned.
`env_high_temp_thresh`	Provides a system specific operating temperature threshold. The value returned is a hardcoded, platform specific temperature in Celsius.
`env_fan_status`	Specifies a noncritical fan status. The value returned is a bit value of zero (0). This value will differ when the hardware support is provided for this feature.
`env_ps_status`	Provides the status of the redundant power supply. On platforms that provide interrupts for redundant power supply failures, the corresponding error status bits are read to determine the return value. A value of 1 is returned on error; otherwise, a value of zero (0) is returned.
`env_supported`	Indicates whether or not the platform supports server management and environmental monitoring.

13.7.1.1.2 Obtaining Platform Specific Functions

The loadable kernel module must return environmental status based on the platform being queried. This section describes the kernel interfaces used. To obtain environmental status, the get_info() function is used. Calls to the get_info() function are filtered through the platform_callsw[] table.

The get_info() function obtains dynamic environmental data using the function types described in Table 13-2.

Table 13-2: `get_info()` Function Types

Function Type	Use of Function
`GET_SYS_TEMP`	Reads the system's internal temperature on platforms that have a KCRCM module configured.
`GET_FAN_STATUS`	Reads fan status from error registers.
`GET_PS_STATUS`	Reads redundant power supply status from error registers.

The get_info() function obtains static data using the HIGH_TEMP_THRESH function type, which reads the platform specific upper threshold operational temperature.

13.7.1.1.3 Server System MIB Subagent

The Server System MIB Agent, (which is an eSNMP sub-agent) is used to export a subset of the Environmental Monitoring parameters specified in the Server System MIB. The Compaq Server System MIB exports a common set of hardware specific parameters across all server platforms on all operating systems offered by Compaq. Table 13-3 maps the subset of Server System MIB variables that support Environmental Monitoring to the kernel parameters described in Section 13.7.1.1.1.

Table 13-3: Mapping of Server Subsystem Variables

Server System MIB Variable Name	Kernel Module Parameter
`svrThSensorReading`	`env_current_temp`
`svrThSensorStatus`	`env_current_temp`
`svrThSensorHighThresh`	`env_high_temp_thresh`
`svrPowerSupplyStatus`	`env_ps_temp`
`svrFanStatus`	`env_fan_status`

An SNMP MIB compiler and other tools are used to compile the MIB description into code for a skeletal subagent daemon. Communication between the subagent daemon and the eSNMP daemon is handled by interfaces in the eSnmp shared library (libesnmp.so). The subagent daemon must be started when the system boots and after the eSNMP daemon has started.

For each Server System MIB variable listed in Table 13-3, code is provided in the subagent daemon, which accesses the appropriate parameter from the kernel module through the CFG interface.

13.7.1.2 Monitoring Environmental Thresholds

To monitor the system environment, the envmond daemon is used. You can customize the daemon by using the envconfig utility or customize the messages that are broadcast. The following sections discuss the daemon and utility. For more information, see the envmond and envconfig reference pages.

13.7.1.2.1 Environmental Monitoring Daemon

By using the Environmental Monitoring daemon, envmond, threshold levels can be checked and corrective action can ensue before damage occurs to your system. Then envmond daemon performs the following:

Queries the system for threshold levels.

Broadcasts a message to users and provides corrective action when a high threshold level or redundant power supply failure has been encountered. When the cooling fan on an AlphaServer 1000A fails, the kernel logs the error, synchronizes the disks, then powers the system down. On all other fan failures, a hard shutdown ensues. (Note that messages can be customized.)

Notifies users when a high temperature threshold condition has been resolved.

Notifies all users that an orderly shutdown is in progress if recovery is not possible.

To query the system, the envmond daemon uses the base operating system command /usr/sbin/snmp_request to obtain the current values of the environment variables specified in the Server System MIB.

To enable Environmental Monitoring, the envmond daemon must be started during the system boot, but after the eSNMP and Server System MIB agents have been started. You can customize the envmond daemon using the envconfig utility.

13.7.1.2.2 Customizing the envmond Daemon

You can use the envconfig utility to customize how the environment is queried by the envmond daemon. These customizations are stored in the /etc/rc.config file, which is read by the envmond daemon during startup. Use the envconfig utility to perform the following:

Turn environmental monitoring on or off during the system boot.

Start or stop the envmond daemon after the system boot.

Specify the frequency between queries of the system by the envmond daemon.

Set the highest threshold level that can be encountered before a temperature event is signaled by the envmond daemon. Specify the path of a user defined script that you want the envmond daemon to execute when a high threshold level is encountered.

Specify the grace period allotted to save data if a shutdown message has been broadcasted.

Display the values of the Environmental Monitoring variables.

13.7.1.3 Customizing Environmental Monitoring Messages

You can modify any messages broadcast or logged by the Environmental Monitoring utility. The messages are located in the file: /usr/share/sysman/envmon/EnvMon_UserDefinable_Msg.tcl. You must be root to edit this file and you can edit any message included in braces ({}). The instructions for editing this file are included in the comment (#) fields and you should avoid altering any other data in this file.

For example, you can change the messages to specify the system name (host name) and location as shown in the following example:

Save the file /usr/share/sysman/envmon/EnvMon_UserDefinable_Msg.tcl to a holding file in case of editing errors.

Edit the file using the editor of your choice (the /usr/bin/dt/dtpad editor in CDE, for example).

Search for instances of EnvmMon_Ovrstr and locate the associated text string that is contained in braces ({}).

Modify the string as required. For example, prefix messages with the host name and location of the system by changing message strings as shown in the following samples
Current message:
```
set EnvmMon_Ovrstr(ENVMON_EVENT_SAFE_MSG) {System temperature is normal
```
Edited message:
```
set EnvmMon_Ovrstr(ENVMON_EVENT_SAFE_MSG) {System ntcstr5 in room 1 aisle 4 - temperature is normal
```

Save the file and exit. You may want to run differences (diff) on the files to ensure that no other changes were made as an error in this file may prevent the correct transmission of warning messages if a system problem occurs.

13 Administering Events and Errors

13.1 Understanding the Event-Logging Facilities

13.1.1 System Event Logging

13.1.2 Binary Event Logging

Note

13.2 Configuring Event Logging

13.2.1 Editing the Configuration Files

13.2.1.1 The syslog.conf File

13.2.1.2 The binlog.conf File

13.2.2 Creating the Special Files

13.2.3 Starting and Stopping Event-Logging Daemons

13.2.3.1 The syslogd Daemon

Note

13.2.3.2 The binlogd Daemon

13.2.4 Configuring the Kernel Binary Event Logger

13.3 Recovering Event Logs After a System Crash

13.4 Maintaining Log Files

13.5 Enhanced Core File Naming

13.6 Administering Crash Dumps

13.6.1 Related Documentation and Utilities

13.6.2 Files Created and Used During Crash Dumps

13.6.3 Crash Dump Creation

13.6.3.1 Crash Dump File Creation

13.6.3.2 Crash Dump Logging

13.6.3.3 Writing the Dump to Swap Space

Note

Note

13.6.4 Choosing the Content and Method of Crash Dumps

13.6.4.1 Adjusting the Primary Swap Partition's Crash Dump Threshold

Note

13.6.4.2 Selecting and Using Noncompressed Crash Dumps

13.6.5 Generating a Crash Dump Manually

13.6.6 Compressing Crash Dump Files for Archiving

13.6.6.1 Compressing a Crash Dump File

Note

13.6.6.2 Uncompressing a Partial Crash Dump File

13.7 Environmental Monitoring

13.7.1 Environmental Monitoring Framework

13.7.1.1 Loadable Kernel Module

13.7.1.1.1 Specifying Loadable Kernel Attributes

Table 13-1: Parameters Defined in the Kernel Module

13.7.1.1.2 Obtaining Platform Specific Functions

Table 13-2: get_info() Function Types

13.7.1.1.3 Server System MIB Subagent

Table 13-3: Mapping of Server Subsystem Variables

13.7.1.2 Monitoring Environmental Thresholds

13.7.1.2.1 Environmental Monitoring Daemon

13.7.1.2.2 Customizing the envmond Daemon

13.7.1.3 Customizing Environmental Monitoring Messages

Table 13-2: `get_info()` Function Types