This chapter provides information on the following topics:
Event logging, which is a way to record informational and error messages that are generated by the system. You use the event logs to solve system problems or verify system operations and you can configure event logging to select events in which you have a particular interest. Understanding and configuring the event-logging facilities is described in Section 13.1 and Section 13.2.
Recovering the event logs after a system crash is described in Section 13.3.
System log files require disk space and may periodically require removal and archiving to save space. Maintenance of log files is described in Section 13.4.
When a system or program halts abnormally a crash dump file or a core file may be created. Options for configuring the crash dump facility and for storing and naming core files are described in Section 13.5 and Section 13.6
Certain systems allow you to monitor the status of the system hardware, such as temperature and power status. Environmental Monitoring is described in Section 13.7
A related topic is use of the system exerciser tools, which are described in Appendix F.
The Tru64 UNIX operating system uses two mechanisms to log system events:
The system event-logging facility
The binary event-logging facility
The log files that the system
and binary event-logging facilities create have the default protection of
640, are owned by
root, and belong to the
system
group.
You must have the proper authority to examine the files.
The following sections describe the event-logging facilities.
The primary systemwide event-logging facility uses the
syslog
function to log events in ASCII format.
The
syslog
function uses the
syslogd
daemon to collect the messages that are logged by the various
kernel, command, utility, and application programs.
The
syslogd
daemon logs the messages to a local file or forwards the messages
to a remote system, as specified in the
/etc/syslog.conf
file.
When you install your Tru64 UNIX operating system, the
/etc/syslog.conf
file is created and specifies the default event-logging configuration.
The
/etc/syslog.conf
file specifies the file names that
are the destination for the event messages, which are in ASCII format.
Section 13.2.1.1
discusses the
/etc/syslog.conf
file.
The binary event-logging
facility detects hardware and software events in the kernel and logs the detailed
information in binary format records.
Events that are logged by the binary
event-logging facility are also logged by the
syslog
function
in a less detailed, but still informative, summary message.
The binary event-logging facility
uses the
binlogd
daemon to collect various event-log records.
The
binlogd
daemon logs these records to a local file
or forwards the records to a remote system, as specified in the
/etc/binlog.conf
default configuration file, which is created when
you install your Tru64 UNIX system.
In this release, , the event management utility of choice is the DECevent
component, in place of the
uerf
error logging facility.
You can examine the binary event-log files by using the
dia
command (preferred) or by using the
uerf
command.
Both
commands translate the records from binary format to ASCII format.
Note
The
uerffacility remains as a component of Tru64 UNIX, but will be retired in a future release of the operating system. See Appendix D oruerf(8) for more information about usinguerf.
The DECevent utility is an event managment utility that you can use to produce ASCII reports from entries in the system's event log files. The DECevent utility can be used from the command line and it can be run by selecting it from the Common Desktop Environment (CDE) Application Manager.
For information about administering the
DECevent
utility, see the following Tru64 UNIX documentation:
DECevent Translation and Reporting Guide
dia(8)
A
new anlysis utility that supports recent processors only is provided in Tru64 UNIX.
Compaq
Analyze is designated to be the replacement for
uerf
in
EV6-series processors.
See the
Compaq Analyze Installation
Guide for Compaq Tru64 UNIX
which can be found on the
Associated Products
CD-ROM.
The Tru64 UNIX
Installation Guide
contains information on installing associated products.
Note that the
sys_check
utility uses DECevent translation
and reporting tools to read system error files such as binary.errlog.saved.
Refer to the
sys_check(8)
reference page for more information.
When you install your system, the default system and binary event-logging configuration is used. You can change the default configuration by modifying the configuration files. You can also modify the binary event-logging configuration, if necessary.
To enable system and binary event-logging, the special files must exist and the event-logging daemons must be running. Refer to Section 13.2.2 and Section 13.2.3 for more information.
If you do not want to use the default system or binary event-logging
configuration, edit the
/etc/syslog.conf
or
/etc/binlog.conf
configuration file to specify how the system should log events.
In the files, you specify the facility, which is the source of a message or
the part of the system that generates a message; the priority, which is the
message's level of severity; and the destination for messages.
The following sections describe how to edit the configuration files.
If you want the
syslogd
daemon to use a configuration file other than the default,
you must specify the file name with the
syslogd
-f
config_file
command.
The following is an example of the default
/etc/syslog.conf
file:
# # syslogd config file # # facilities: kern user mail daemon auth syslog lpr binary # priorities: emerg alert crit err warning notice info debug # # [1] [2] [3] kern.debug /var/adm/syslog.dated/kern.log user.debug /var/adm/syslog.dated/user.log daemon.debug /var/adm/syslog.dated/daemon.log auth.crit;syslog.debug /var/adm/syslog.dated/syslog.log mail,lpr.debug /var/adm/syslog.dated/misc.log msgbuf.err /var/adm/crash.dated/msgbuf.savecore kern.debug /var/adm/messages kern.debug /dev/console *.emerg *
Each
/etc/syslog.conf
file entry has the following
entry syntax:
Specifies the facility, which is the part of the system generating the message. [Return to example]
Specifies the severity level.
The
syslogd
daemon logs all messages of the specified severity level plus all messages
of greater severity.
For example, if you specify level
err,
all messages of levels
err,
crit,
alert, and
emerg
or
panic
are logged.
[Return to example]
Specifies the destination where the messages are logged. [Return to example]
The
syslogd
daemon ignores blank lines and lines that begin with a number
sign (#).
You can specify a number sign (#) as the first character in a line
to include comments in the
/etc/syslog.conf
file or to
disable an entry.
The facility and severity level are separated from the destination by one or more tabs.
You can specify more than one facility and its severity level by separating
them with semicolons.
In the preceding example, messages from the
auth
facility of
crit
severity level and higher
and messages from the
syslog
facility of
debug
severity level and higher are logged to the
/var/adm/syslog.dated/syslog.log
file.
You can specify more than one facility by separating them with commas.
In the preceding example, messages from the
mail
and
lpr
facilities of
debug
severity level and higher
are logged to the
/var/adm/syslog.dated/misc.log
file.
You can specify the following facilities:
| Facility | Description |
kern |
Messages generated by the kernel. These messages cannot be generated by any user process. |
user |
Messages generated by user processes. This is the default facility. |
mail |
Messages generated by the mail system. |
daemon |
Messages generated by the system daemons. |
auth |
Messages generated by the authorization system
(for example:
login,
su, and
getty). |
lpr |
Messages generated by the line printer spooling
system (for example:
lpr,
lpc, and
lpd). |
local0 |
Reserved for local use, along with local1 to local7. |
mark |
Receives a message of priority
info
every 20 minutes, unless a different interval is specified
with the
syslogd
-m
option. |
msgbuf |
Kernel syslog message buffer recovered from
a system crash.
The
savecore
command and the
syslogd
daemon use the
msgbuf
facility to recover
system event messages from a crash. |
* |
Messages generated by all parts of the system. |
You can specify the following severity levels, which are listed in order of highest to lowest severity:
| Severity Level | Description |
emerg
or
panic |
A panic condition. You can broadcast these messages to all users. |
alert |
A condition that you should immediately correct, such as a corrupted system database. |
crit |
A critical condition, such as a hard device error. |
err |
Error messages. |
warning
or
warn |
Warning messages. |
notice |
Conditions that are not error conditions, but are handled as special cases. |
info |
Informational messages. |
debug |
Messages containing information that is used to debug a program. |
none |
Disables a specific facility's messages. |
You can specify the following message destinations:
| Destination | Description |
| Full pathname | Appends messages to the specified file.
You should direct each facility's messages to separate files (for example:
kern.log,
mail.log, or
lpr.log). |
| Host name preceded by an at sign (@) | Forwards messages to the
syslogd
daemon on the specified host. |
| List of users separated by commas | Writes messages to the specified users if they are logged in. |
* |
Writes messages to all the users who are logged in. |
You can specify in the
/etc/syslog.conf
file that the
syslogd
daemon
create daily log files.
To create daily log files, use the following syntax
to specify the pathname of the message destination:
/var/adm/syslog.dated/
{file}
The
file
variable specifies the name of the
log file, for example,
mail.log
or
kern.log.
If you specify a
/var/adm/syslog.dated/file
pathname destination, each day the
syslogd
daemon creates a subdirectory under the
/var/adm/syslog.dated
directory and a log file in the subdirectory by using the following
syntax:
/var/adm/syslog.dated/
date / file
The date variable specifies the day, month, and time that the log file was created.
The
file
variable specifies the name of the
log file you previously specified in the
/etc/syslog.conf
file.
The
syslogd
daemon automatically creates a new
date
directory every 24 hours and also when you boot the system.
For example, to create a daily log file of all mail messages of level
info
or higher, edit the
/etc/syslog.conf
file
and specify an entry similar to the following:
mail.info /var/adm/syslog.dated/mail.log
If you specify the previous command, the
syslogd
daemon could create the following daily directory and file:
/var/adm/syslog.dated/11-Jan-12:10/mail.log
If you want the
binlogd
daemon
to use a configuration file other than the default, specify the file name
with the
binlogd -f
config_file
command.
The following is an example of a
/etc/binlog.conf
file:
# # binlogd configuration file # # format of a line: event_code.priority destination # # where: # event_code - see codes in binlog.h and man page, * = all events # priority - severe, high, low, * = all priorities # destination - local file pathname or remote system hostname # # *.* /usr/adm/binary.errlog dumpfile /usr/adm/crash/binlogdumpfile 102.high /usr/adm/disk.errlog [1] [2] [3]
Each entry in the
/etc/binlog.conf
file, except the
dumpfile
event class entry, contains three fields:
Specifies the event class code that indicates the part of the system generating the event. [Return to example]
Specifies the severity level of the event.
Do not specify a
severity level if you specify
dumpfile
for an event class.
[Return to example]
Specifies the destination where the binary event records are logged. [Return to example]
The
binlogd
daemon ignores blank lines and lines
that begin with a number sign (#).
You can specify a number sign (#) as the
first character in a line to include comments in the file or to disable an
entry.
The event class and severity level are separated from the destination by one or more tabs.
You can specify the following event class codes:
| Class Code | General |
| * | All event classes. |
dumpfile |
Specifies the recovery of the kernel binary event log buffer from a crash dump. A severity level cannot be specified. |
| Class Code | Hardware-Detected Events |
| 100 | CPU machine checks and exceptions |
| 101 | Memory |
| 102 | Disks |
| 103 | Tapes |
| 104 | Device controller |
| 105 | Adapters |
| 106 | Buses |
| 107 | Stray interrupts |
| 108 | Console events |
| 109 | Stack dumps |
| 199 | SCSI CAM events |
| Class Code | Software-Detected Events |
| 201 | CI port-to-port-driver events |
| 202 | System communications services events |
| Class Code | Informational ASCII Messages |
| 250 | Generic |
| Class Code | Operational Events |
| 300 | Startup ASCII messages |
| 301 | Shutdown ASCII messages |
| 302 | Panic messages |
| 310 | Time stamp |
| 350 | Diagnostic status messages |
| 351 | Repair and maintenance messages |
You can specify the following severity levels:
| Severity Level | Description |
| * | All severity levels |
severe |
Unrecoverable events that are usually fatal to system operation |
high |
Recoverable events or unrecoverable events that are not fatal to system operation |
low |
Informational events |
You can specify the following destinations:
| Destination | Description |
| Full pathname | Specifies the file name to which the
binlogd
daemon appends the binary event records. |
@hostname |
Specifies the name of the host (preceded
by an @) to which the
binlogd
daemon forwards the binary
event records.
If you specify
dumpfile
for an event class,
you cannot forward records to a host. |
The
syslogd
daemon cannot log kernel messages unless the
/dev/klog
character special file exists.
If the
/dev/klog
file does
not exist, create it by using the following command syntax:
/dev/MAKEDEV /dev/klog
Also, the
binlogd
daemon cannot log local system
events unless the
/dev/kbinlog
character special file exists.
If the
/dev/kbinlog
file does not exist, create it by
using the following command syntax:
/dev/MAKEDEV /dev/kbinlog
Refer to the
MAKEDEV(8)
reference page for more information.
The
syslogd
and
binlogd
daemons are automatically started by the
init
program during system startup.
However, you must ensure that the daemons
are started.
You can also specify options with the command that starts the
daemons.
Refer to the
init(8)
reference page for more information.
You
must ensure that the
syslogd
daemon is started by the
init
program.
If the
syslogd
daemon is not started
or if you want to specify options with the command that starts the
syslogd
daemon, you must edit the
/sbin/init.d/syslog
file and either include or modify the
syslogd
command line.
Note that you can also invoke the command manually.
The command
that starts the
syslogd
daemon has the following syntax:
/usr/sbin/syslogd
[-d]
[-fconfig_file]
[-mmark_interval]
Refer to the
syslogd(8)
reference page for information about command
options.
Note
You must ensure that the
/var/admdirectory is mounted, or thesyslogddaemon will not work correctly.
The
syslogd
daemon reads messages from the following:
The Tru64 UNIX domain socket
/dev/log
file, which is automatically created by the
syslogd
daemon
An Internet domain socket, which is specified in the
/etc/services
file
The special file
/dev/klog, which logs
only kernel messages
Messages from other programs use the
openlog,
syslog, and
closelog
calls.
When the
syslogd
daemon is started, it creates the
/var/run/syslog.pid
file, where the
syslogd
daemon
stores its process identification number.
Use the process identification
number to stop the
syslogd
daemon before you shut down
the system.
During normal system operation, the
syslogd
daemon
is called if data is put in the kernel syslog message buffer, located in physical
memory.
The
syslogd
daemon reads the
/dev/klog
file and gets a copy of the kernel syslog message buffer.
The
syslogd
daemon starts at the beginning of the buffer and sequentially
processes each message that it finds.
Each message is prefixed by facility
and priority codes, which are the same as those specified in the
/etc/syslog.conf
file.
The
syslogd
daemon then
sends the messages to the destinations specified in the file.
To stop the
syslogd
event-logging daemon, use
the following command:
#kill `cat /var/run/syslog.pid`
You can apply changes that you make to the
/etc/syslog.conf
configuration file without shutting down the system by using the
following command:
#kill -HUP `cat /var/run/syslog.pid`
You
must ensure that the
init
program starts the
binlogd
daemon.
If the
binlogd
daemon does not
start, or if you want to specify options with the command that starts the
binlogd
daemon, you must edit the
/sbin/init.d/syslog
file and either include or modify the
binlogd
command line.
Note that you can also invoke the command manually.
The command that starts the
binlogd
daemon
has the following syntax:
/usr/sbin/binlogd
[-d]
[-fconfig_file]
Refer to the
binlogd(8)
reference page for information on command options.
The
binlogd
daemon reads binary event records from
the following:
An Internet domain socket (binlogd,
706/udp), which is specified in the
/etc/services
file
The
/dev/kbinlog
special file
When the
binlogd
daemon starts, it creates the
/var/run/binlogd.pid
file, where the
binlogd
daemon stores its process identification number.
Use the process identification
number to stop or reconfigure the
binlogd
daemon.
During normal system operation, the
binlogd
daemon
is called if data is put into the kernel's binary event-log buffer or if data
is received on the Internet domain socket.
The
binlogd
daemon then reads the data from the
/dev/kbinlog
special
file or from the socket.
Each record contains an event class code and a severity
level code.
The
binlogd
daemon processes each binary event
record and logs it to the destination specified in the
/etc/binlog.conf
file.
To stop the
binlogd
daemon,
use the following command:
#kill `cat /var/run/binlogd.pid`
You can apply changes that you make to the
/etc/binlog.conf
configuration file without shutting down the system by using the
following command:
#kill -HUP `cat /var/run/binlogd.pid`
You can configure the kernel binary event logger by modifying the default keywords and rebuilding the kernel. You can scale the size of the kernel binary event-log buffer to meet your systems needs. You can enable and disable the binary event logger and the logging of kernel ASCII messages into the binary event log.
The
/sys/data/binlog_data.c
file defines the
binary event-logger configuration.
The default configuration specifies a buffer
size of 24K bytes, enables binary event logging, and disables the logging
of kernel ASCII messages.
You can modify the configuration by changing the
values of the
binlog_bufsize
and
binlog_status
keywords in the file.
The
binlog_bufsize
keyword specifies the size of
the kernel buffer that the binary event logger uses.
The size of the buffer
can be between 8 kilobytes (8192 bytes) and 48 kilobytes (49152 bytes).
Small
system configurations, such as workstations, can use a small buffer.
Large
server systems that use many disks may need a large buffer.
The
binlog_status
keyword specifies the behavior
of the binary event logger.
You can specify the following values for the
binlog_status
keyword:
0
(zero)Disables the binary event logger.
BINLOG_ONEnables the binary event logger.
BINLOG_ASCIIONEnables the logging of kernel ASCII messages into the binary
event log if the binary event logger is enabled.
This value must be specified
with the
BINLOG_ON
value as follows:
int binlog_status = BINLOG_ON | BINLOG_ASCII;
After you modify the
/sys/data/binlog_data.c
file,
you must rebuild and boot the new kernel.
You can recover unprocessed messages and binary event-log records from a system crash when you reboot the system.
The
msgbuf.err
entry in the
/etc/syslog.conf
file specifies the destination of the kernel syslog message buffer
msgbuf
that is recovered from the dump file.
The default
/etc/syslog.conf
file entry for the kernel syslog message buffer
file is as follows:
msgbuf.err /var/adm/crash/msgbuf.savecore
The
dumpfile
entry in the
/etc/binlog.conf
file specifies the file name destination for the kernel binary
event-log buffer that is recovered from the dump file.
The default
/etc/binlog.conf
file entry for the kernel binary event-log buffer
file is as follows:
dumpfile /usr/adm/crash/binlogdumpfile
If a crash occurs, the
syslogd
and
binlogd
daemons cannot read the
/dev/klog
and
/dev/kbinlog
special files and
process the messages and binary event records.
When you reboot the system,
the
savecore
command runs and, if a dump file exists, recovers
the kernel syslog message and binary event-log buffers from the dump file.
After
savecore
runs, the
syslogd
and
binlogd
daemons are started.
The
syslogd
daemon reads the syslog message buffer
file, checks that its data is valid, and then processes it in the same way
that it normally processes data from the
/dev/klog
file,
using the information in the
/etc/syslog.conf
file.
The
binlogd
daemon reads the binary event-log buffer
file, checks that its data is valid, and then processes the file in the same
way that it processes data from the
/dev/kbinlog
special
file, using the information in the
/etc/binlog.conf
file.
After the
syslogd
and
binlogd
daemons are finished with the buffer files, the files are deleted.
If you
specify full pathnames for the message destinations in the
/etc/syslog.conf
and
/etc/binlog.conf
files, the log files will
grow in size.
Also, if you configure the
syslogd
daemon
to create daily directories and log files, eventually there will be many directories
and files, although the files themselves will be small.
Therefore, you must
keep track of the size and the number of log files and daily directories and
delete files and directories if they become unwieldy.
You can also use the
cron
daemon to specify that
log files be deleted.
The following is an example of a
crontab
file entry:
5 1 * * * find /var/adm/syslog.dated -type d -mtime +5 -exec rm -rf '{}' \;
This command line causes all directories under
/var/adm/syslog.dated
that were modified more than five days ago to be deleted, along
with their contents, every day at 1:05.
Refer to the
crontab(1)
reference page
for more information.
By default when a core file is written to a disk, the system saves
the file under the name
core.
Each subsequent core file
overwrites its predecessor because the file name is identical.
By enabling
enhanced core file-naming the system will attempt to create unique names
for core files in the form
core.prog-name.host-name.tag.
The uniquely named files that result will not be overwritten by subsequent
core files, thereby preventing the loss of valuable debugging information
when the same program or multiple programs fail multiple times (and perhaps
for different reasons).
The enhanced name provides the following identification data:
core - The literal string
core
program_name - Up to sixteen characters taken from the
program name as shown by the
ps
command.
host_name - The first portion of the system's network
host name, or up to 16 characters of the host name, taken from the part of
the host name that precedes the first dot.
For example, the fourth core file
generated on host
buggy.net.ooze.com
by the program
dropsy
would be
core.dropsy.buggy.3
numeric_tag - The tag assigned to the core file to make it unique among all the core files generated by a program on a host. The maximum value for this tag, and thus the maximum number of core files for this program and host, is set by a system configuration parameter.
Note that the tag is not a literal version number. The system selects the first available unique tag for the core file. For example, if a program's core files have tags .0, .1, and .3, the system uses tag .2 for the next core file it creates for that program. By default, the system can create up to 16 versions of a core file. If the system-configured limit for core file instances is reached, the system will not create any more core files for that program and host combination.
If you plan to save a number of uniquely named core files, be aware that core files can quickly consume available disk space. Allowing core files to be saved under different names in a file system with minimal free space can potentially fill your disk because the files are not overwritten when new core files are created. If you enable this feature, make sure you remove old core files when you have finished examining them.
You can enable this feature at the system level by setting the
enhanced-core-name
system configuration variable to 1 in the
proc
subsystem, as in the following example:
proc:
enhanced-core-name = 1
The system manager can limit the
number of unique core file versions that a program can create on a specific
host system by setting the system configuration variable
enhanced-core-max-versions
to the desired value, as in the following example:
proc:
enhanced-core-name = 1
enhanced-core-max-versions = 8
The minimum value is 1, the maximum value is 99,999, and the default is 16. Refer to Chapter 5 and in particular Section 5.2.1.2 for information on setting the attributes.
You can enable enhanced core file naming at the program level by calling
the
uswitch
system call with the
USW_CORE
flag set, as in the following example:
#include #include
/*
* Request enhanced core file naming for
* this process then create a core file.
*/
main()
{
long uval = uswitch(USC_GET, 0);
uval = uswitch(USC_SET, uval | USW_CORE);
if (uval < 0) {
perror("uswitch");
exit(1);
}
raise(SIGQUIT);
}
When a Tru64 UNIX system crashes, it writes all or part of physical memory to swap space on disk. This information is called a crash dump. During the reboot process, the system moves the crash dump into a file and copies the kernel executable image to another file. Together, these files are the crash dump files. You can use the information in the crash dump files to help you to determine the cause of the system crash.
Crash dump files are required for analysis when a system crashes, or during the development of custom kernels (debugging). You may also have to supply a crash dump file to Technical Support to analyze system problems. To do this, you must understand how crash dump files are created. You must reserve space on disks for the crash dump and crash dump files. The amount of space you reserve depends on your system configuration and the type of crash dump you want the system to perform.
The sections that follow provide information to help you manage crash dumps and crash dump files. For information on analyzing crash dump log files, refer to the Kernel Debugging guide.
The following documentation contains information on crash dumps and related topics, such as swap space requirements:
Installation Guide - Provides nformation on the initial swap space and dump settings configured during installation
Kernel Debugging - Provides information on analyzing crash dumps. Note that you may need to install Development subsets and appropriate licenses in order to use the debugger. The guide contains information on:
Crash Dump creation and content
Planning and estimating dump sizes and space requirements
Logging and log files
Forcing crash dumps
Archiving dumps
savecore(8)
- Describes the program that copies a core
dump from swap partitions to a file.
expand_dump(8)
- Describes the program that produces a non-compressed
kernel crash dump file.
sysconfig(8)
and
sysconfigdb(8)- Describes the programs
that maintain the kernel subsystem configuration and are used to set crash
dump attributes in the kernel to control crash behavior.
You can also use
the graphical interface
/usr/bin/X11/dxkerneltuner
to modify
kernel attributes.
See the
dxkerneltuner(8)
reference page for information.
On-line
help is also available for this interface.
The
dxkerneltuner
interface can also be launched from the CDE Desktop by invoking the Application
Manager, System Admin.
swapon(8)
- Describes the program that creates additional
file(s) for paging and swapping.
Use
swapon
if you need
to add additional temporary or permanent swap space to produce full dumps.
dbx(1)
- Describes the source level debugger.
By default, the
savecore
command copies crash dump
file into
/var/adm/crash, although you can redirect crash
dumps to any file system that you designate.
The following files are created
or used during a crash:
/var/adm/crash/vmzcore.n
- The crash
dump file, named
vmcore.n
if the file is non compressed
(no
z)
/var/adm/crash/bounds
- A text file
that specifies the incremental number of the next dump (The
n
in
vmzcore.n)
/var/adm/crash/minfree
- The file
that pecifies the minimum number of kilobytes to be left after crash dump
files are written
/var/adm/crash/vmunix.n
- A copy
of the kernel that was running at the time of the crash, typically of
/vmunix.
/etc/syslog.conf
and
/etc/binlog.conf
- The logging configuration files
After a system crash, you normally reboot your system by
issuing the
boot
command at the console prompt.
During
a system reboot, the
/sbin/savecore
script invokes the
savecore
command.
This command moves crash dump information from
the swap partitions into a file and copies the kernel that was running at
the time of the crash into another file.
You can analyze these files to
help you determine the cause of a crash.
The
savecore
command also logs the crash in system log files.
You can invoke the
savecore
command from the command
line.
For information about the command syntax, see the
savecore(8)
reference
page.
When the
savecore
command begins running during the reboot process, it determines whether a
crash dump occurred and whether the file system contains enough space to
save it.
(The system saves no crash dump if you shut it down and reboot
it; that is, the system saves a crash dump only when it crashes.)
If a crash dump exists and
the file system contains enough space to save the crash dump files, the
savecore
command moves the crash dump and a copy of the kernel
into files in the default crash directory,
/var/adm/crash.
(You can modify the location of the crash directory.) The
savecore
command stores the kernel image in a file named
vmunix.n, and by default it stores the (compressed) contents
of physical memory in a file named
vmzcore.n.
The
n
variable specifies the number of the crash.
The number of
the crash is recorded in the
bounds
file in the crash
directory.
After the first crash, the
savecore
command
creates the
bounds
file and stores the number 1 in it.
The command increments that value for each succeeding crash.
The
savecore
command runs early in the reboot process
so that little or no system swapping occurs before the command runs.
This
practice helps ensure that crash dumps are not corrupted by swapping.
Once the
savecore
command writes the crash dump files, it performs the following steps to log
the crash in system log files:
Writes a reboot message to the
/var/adm/syslog/auth.log
file.
If the system crashed due to a panic condition, the panic
string is included in the log entry.
You can cause
the
savecore
command to write the reboot message to another
file by modifying the
auth
facility entry in the
syslog.conf
file.
If you remove the
auth
entry
from the
syslog.conf
file, the
savecore
command does not save the reboot message.
Attempts to save the kernel message buffer from the crash dump. The kernel message buffer contains messages created by the kernel that crashed. These messages might help you determine the cause of the crash.
The
savecore
command saves the kernel message buffer in the
/var/adm/crash/msgbuf.savecore
file, by default.
You can change
the location to which
savecore
writes the kernel message
buffer by modifying the
msgbuf.err
entry in the
/etc/syslog.conf
file.
If you remove the
msgbuf.err
entry from the
/etc/syslog.conf
file,
savecore
does not save the kernel message buffer.
Later in the
reboot process, the
syslogd
daemon starts up, reads the
contents of the
msgbuf.err
file, and moves those contents
into the
/var/adm/syslog/kern.log
file, as specified in
the
/etc/syslog.conf
file.
The
syslogd
daemon then deletes the
msgbuf.err
file.
For more information
about how system logging is performed, see the
syslogd(8)
reference page.
Attempts to save the binary event buffer from the crash dump. The binary event buffer contains messages that can help you identify the problem that caused the crash, particularly if the crash was due to a hardware error.
The
savecore
command saves the binary event buffer in the
/usr/adm/crash/binlogdumpfile
file by default.
You can change the
location to which
savecore
writes the binary event buffer
by modifying the
dumpfile
entry in the
/etc/binlog.conf
file.
If you remove the
dumpfile
entry from
the
/etc/binlog.conf
file,
savecore
does not save the binary event buffer.
Later in the reboot process
the
binlogd
daemon starts up, reads the contents of the
/usr/adm/crash/binlogdumpfile
file, and moves those contents into
the
/usr/adm/binary.errlog
file, as specified in the
/etc/binlog.conf
file.
The
binlogd
daemon then
deletes the
binlogdumpfile
file.
For more information
about how binary error logging is performed, see the
binlogd(8)
reference page.
When the system creates a crash dump, it writes the dump to the swap partitions. The system uses the swap partitions because the information stored in those partitions has meaning only for a running system. Once the system crashes, the information is useless and can be safely overwritten.
Before the system writes a crash dump, it determines how the dump fits
into the swap partitions, which are defined in the
/etc/fstab
file.
For example, the following fragment of the
/etc/fstab
entry shows three swap partitions available:
/dev/rz1b swap1 ufs sw 0 2 /dev/rz3h swap2 ufs sw 0 2 /dev/rz4b swap3 ufs sw 0 2
You use the
swapon
command to modify available
swap space.
The following list describes how the system determines where to write the crash dump:
If the crash dump fits in the
primary swap partition it will be dumped to
/dev/rz1b.
The system writes the dump as far toward the end of the partition as possible,
leaving the beginning of the partition available for boot-time swapping.
If the crash dump is too large for the primary swap partition,
but fits the secondary and/or tertiary swap space, the system writes the crash
dump to the other swap partitions,
/dev/rz3h
and
/dev/rz4b
If the crash dump is too large for all the available swap partitions, the system writes the crash dump to the swap partitions until those partitions are full. It then writes the remaining crash dump information to end of the primary swap partition, possibly filling that partition.
Note
If the aggregate size of all the swap partitions is too small to contain the crash dump, the system creates no crash dump.
Each crash
dump contains a header, which the system always writes to the end of the primary
swap partition.
The header contains information about the size of the dump
and where the dump is stored.
This information allows
savecore
to find and save the dump at system reboot time.
The
way that a crash dump is taken can be controlled by the
dump_sp_threshold
kernel attribute, which controls the partitions to which the crash
dump is written.
The default value of 4096 causes the primary swap partition
to be used exclusively for crash dumps that are small enough to fit the partition.
In most cases, compressed dumps will fit on the primary swap partition and
you will not find it necessary to modify this.
If required, you can configure
the system so that it fills the secondary swap partitions with dump information
before writing any information (except the dump header) to the primary swap
partition.
The value in the
dump_sp_threshold
attribute indicates
the amount of space you normally want available for swapping as the system
reboots.
By default, this attribute is set to 4096 blocks, meaning that the
system attempts to leave 2 MB of disk space open in the primary swap partition
after the dump is written.
Refer to the
Kernel Debugging
guide for additional
information on this setting.
To allow space for crash dumps, adjust the size of the swap
partitions to create temporary or permanent swap space.
For information about
modifying the size of swap partitions, see the
swapon(8)
reference page.
Note
Be sure all permanent swap partitions are listed in the
/etc/fstabfile. Thesavecorecommand, which copies the crash dump from swap partitions to a file, uses the information in the/etc/fstabfile to find the swap partitions. If you omit a swap partition, thesavecorecommand might be unable to find the omitted partition.
You can control
the default location of the crash directory with the
rcmgr
command.
For example, to save crash dump files in the
/usr/adm/crash2
directory by default (at each system startup), issue
the following command:
#/usr/sbin/rcmgr set SAVECORE_DIR /usr/adm/crash2
If you want the system to return to multiuser mode, regardless of whether it saved a crash dump, issue the following command:
#/usr/sbin/rcmgr set SAVECORE_FLAGS M
Crash dumps are compressed and partial by default, but can be full and/or non compressed if required. Normally, partial crash dumps provide the information that you need to determine the cause of a crash. However, you might want the system to generate full crash dumps if you have a recurring crash problem and partial crash dumps have not been helpful in finding the cause of the crash.
A partial crash dump contains the following:
The crash dump header
A copy of part of physical memory
The system writes the part of physical memory believed to contain significant information at the time of the system crash, basically kernel node code and data. By default, the system omits user page table entries.
A full crash dump contains the following:
The crash dump header
A copy of the entire contents of physical memory at the time of the crash
You can modify how crash dumps are taken by adjusting the crash dump threshold as described in the following section.
To configure your system so that it writes even small crash dumps to
secondary swap partitions before the primary swap partition, use a large value
for the
dump_sp_threshold
attribute.
As described in
Section 13.6.3, the value you assign to this attribute
indicates the amount of space that you normally want available for system
swapping after a system crash.
To adjust the
dump_sp_threshold
attribute, issue
the
sysconfig
command.
For example, suppose your primary
swap partition is 40 MB.
To raise the value so that the system writes crash
dumps to secondary partitions, issue the
following command:
#sysconfig -r generic dump_sp_threshold=20480
In this
exampe, the
dump_sp_threshold
attribute, which is in the
generic
subsystem, is set to 20,480 512-byte blocks (40 MB).
In
this example, the system attempts to leave the entire primary swap partition
completely open for system swapping.
The system automatically writes the
crash dump to secondary swap partitions and the crash dump header to the end
of the primary swap partition.
The
sysconfig
command changes the value of system
attributes for the currently running kernel.
To store the new value of the
dump_sp_threshold
attribute in the
sysconfigtab
database, modify that database using the
sysconfigdb
command.
For information about the
sysconfigtab
database and the
sysconfigdb
command, see the
sysconfigdb(8)
reference page.
Note
Once the
savecoreprogram has copied the crash dump to a file, all swap devices are immediately available for mounting and swapping. The sharing of swap space only occurs for a short time during boot, and usually on systems with a small amount of physical memory.
By default, crash dumps are compressed to save disk space, allowing
you to dump a larger crash dump file to a smaller partition.
This can offer
significant advantages on systems with a large amount of physical memory,
particularly if you want to tune the system to discourage swapping for realtime
operations.
On reboot after a crash, the crash dump utility,
savecore, automatically detects that the dump is compressed, using information
in the crash dump header in swap.
It then copies the crash dump file from
swap to the
/var/adm/crash
directory.
The compressed crash
dump files are identified by the letter
z
in the file name,
to distinguish them from noncompressed crash dump files.
For example:
vmzcore.1.
Refer to the reference pages
savecore(8),
expand_dump(8), and
sysconfig(8)
for information on crash dump compression and how to produce a noncompressed
crash dump file.
You can manually create a crash dump file by forcing a dump using the
console command,
crash, which causes a crash dump file
to be created on a system that is not responding (hung).
It is assumed that
you have planned adequate space for the crash dump file and set any kernel
parameters as described in the preceding sections.
On most hardware platforms, you force a crash dump by performing the following steps:
If your system has a switch for enabling and disabling the Halt button, set that switch to the Enable position.
Press the Halt button.
At the console prompt, enter the
crash
command.
Some systems have no Halt button. In this case, perform the following steps to force a crash dump on a hung system:
Press Ctrl/p at the console.
At the console prompt, enter the
crash
command.
If your system hangs and you force a crash dump, the panic string recorded in the crash dump is the following:
hardware restart
This panic string is always the one recorded when system operation is interrupted by pressing the Halt button or Ctrl/p.
If you are working entirely with
compressed (vmzcore.n) crash
dump files, they should already be sufficiently compressed for efficient
archiving.
However, if you are short of storage space, the following sections
discuss options for further compression of dump files for storage or transmission
if:
You are working with uncompressed (vmcore.n) crash dump files.
You need the maximum amount of compression possible - for example, if you need to transmit a crash dump file over a slow transmission line.
This section describes how you minimize the size of crash dump files, depending on they type of file.
To compress a
vmcore.n
crash dump file, use a utility such as
gzip,
compress, or
dxarchiver.
For example, the following
command creates a compressed file named
vmcore.3.gz
%gzip vmcore.3
A
vmzcore.n
crash dump
file uses a special compression method that makes it readable by the current Tru64 UNIX
debuggers and crash analysis tools without requiring decompression.
A
vmzcore.n
file is substantially compressed
compared to the equivalent
vmcore.n
file, but not as much as if the latter had been compressed using a standard
UNIX compression utility such as
gzip.
Standard compression
applied to a
vmzcore.n
file
will make the resulting file about 40 percent smaller than the equivalent
vmzcore.n
file.
If you need to apply the maximum compression possible to a
vmzcore.n
file, perform the following
steps:
Uncompress the
vmzcore.n
file using the
expand_dump
command (see
expand_dump(8)).
The following example creates an uncompressed file named
vmcore.3
from the file
vmzcore.3:
%expand_dump vmzcore.3
Compress the resulting
vmcore.n
file using a standard UNIX utility.
The following example uses the
gzip
command to create a compressed file named
vmcore.3.gz :
%gzip vmcore.3
Note
You can uncompress a
vmzcore.nfile only with theexpand_dumpcommand. (Do not usegunzip,uncompress, or any other utility).After a
vmzcore.nfile has been uncompressed into avmcore.nfile withexpand_dump, you cannot compress it back into avmzcore.nfile.
Use care when uncompressing a partialcrash
dump file that was compressed from a
vmcore.n
file.
Using the
gunzip
or
uncompress
command with no flags results in a
vmcore.n
file that requires storage space equal to the size of memory.
In other words,
the uncompressed file requires the same amount of disk space as a
vmcore.n
file from a full crash dump.
This situation occurs because the original
vmcore.n
file contains UNIX File System (UFS) file holes,
which are regions that have no associated data blocks.
When a process,
such as the
gunzip
or
uncompress
command
reads from a hole in a file, the file system returns zero-valued data.
Thus,
memory omitted from the partial dump is added back into the uncompressed
vmcore.n
file as disk blocks containing
all zeros.
To ensure that the uncompressed core file remains at its partial dump
size, you must pipe the output from the
gunzip
or
uncompress
command with the
-c
flag to the
dd
command with the
conv=sparse
option.
For
example, to uncompress a file named
vmcore.0.Z, issue
the following command:
#uncompress -c vmcore.0.Z | dd of=vmcore.0 conv=sparse262144+0 records in 262144+0 records out
On any system, thermal levels can increase because of poor ventilation, overheating conditions, or fan failure. Without detection, an unscheduled shutdown could ensue causing the system's loss of data or damage to the system itself. By using Environmental Monitoring, the thermal state of AlphaServer systems can be detected and users can be alerted in time enough to recover or perform an orderly shutdown of the system.
This chapter discusses how Environmental Monitoring is implemented on AlphaServer systems.
The Environmental Monitoring framework consists of four components:
loadable kernel module and its associated APIs, Server
System MIB subagent daemon, the
envmond
daemon, and the
envconfig
utility.
The loadable kernel module and its associated APIs contain the parameters needed to monitor and return status on your system's threshold levels. The kernel module exports server management attributes as described in Section 13.7.1.1.1 through the kernel configuration manager (CFG) interface only. It works across all platforms that support server management, and provides compatibility for other server management systems under development. The kernel module is supported on all Alpha systems running Version 4.0A or higher of the Tru64 UNIX operating system.
The loadable kernel module does not include platform specific code (such as the location of status registers). It is transparent to the kernel module which options are supported by a platform. That is, the kernel module and platform are designed to return valid data if an option is supported, a fixed constant for unsupported options, or null.
The loadable kernel module exports the parameters listed in Table 13-1 to the kernel configuration manager (CFG).
| Parameter | Purpose |
env_current_temp |
Specifies the current temperature of the system. If a system is configured with the KCRCM module, the temperature returned is in Celsius. If a system does not support temperature readings and a temperature threshold has not been exceeded, a value of -1 is returned. If a system does not support temperature readings and a temperature threshold is exceeded, a value of -2 is returned. |
env_high_temp_thresh |
Provides a system specific operating temperature threshold. The value returned is a hardcoded, platform specific temperature in Celsius. |
env_fan_status |
Specifies a noncritical fan status. The value returned is a bit value of zero (0). This value will differ when the hardware support is provided for this feature. |
env_ps_status |
Provides the status of the redundant power supply. On platforms that provide interrupts for redundant power supply failures, the corresponding error status bits are read to determine the return value. A value of 1 is returned on error; otherwise, a value of zero (0) is returned. |
env_supported |
Indicates whether or not the platform supports server management and environmental monitoring. |
The loadable kernel module must return environmental status based on
the platform being queried.
This section describes the kernel interfaces
used.
To obtain environmental status, the
get_info()
function is used.
Calls to the
get_info()
function are filtered through the
platform_callsw[]
table.
The
get_info()
function obtains dynamic environmental
data using the function types described in
Table 13-2.
get_info() Function Types| Function Type | Use of Function |
GET_SYS_TEMP |
Reads the system's internal temperature on platforms that have a KCRCM module configured. |
GET_FAN_STATUS |
Reads fan status from error registers. |
GET_PS_STATUS |
Reads redundant power supply status from error registers. |
The
get_info()
function obtains static data using the
HIGH_TEMP_THRESH
function type, which reads the platform specific upper threshold
operational temperature.
The Server System MIB Agent, (which is an eSNMP sub-agent) is used to export a subset of the Environmental Monitoring parameters specified in the Server System MIB. The Compaq Server System MIB exports a common set of hardware specific parameters across all server platforms on all operating systems offered by Compaq. Table 13-3 maps the subset of Server System MIB variables that support Environmental Monitoring to the kernel parameters described in Section 13.7.1.1.1.
| Server System MIB Variable Name | Kernel Module Parameter |
svrThSensorReading |
env_current_temp |
svrThSensorStatus |
env_current_temp |
svrThSensorHighThresh |
env_high_temp_thresh |
svrPowerSupplyStatus |
env_ps_temp |
svrFanStatus |
env_fan_status |
An SNMP MIB compiler and other
tools are used to compile the MIB description into code for a skeletal subagent
daemon.
Communication between the subagent daemon and the eSNMP daemon is
handled by interfaces in the eSnmp shared library (libesnmp.so).
The subagent daemon must be started when the system boots and after the eSNMP
daemon has started.
For each Server System MIB variable listed in Table 13-3, code is provided in the subagent daemon, which accesses the appropriate parameter from the kernel module through the CFG interface.
To monitor the system environment, the
envmond
daemon
is used.
You can customize the daemon by using the
envconfig
utility or customize the messages that are broadcast.
The following sections
discuss the daemon and utility.
For more information, see the
envmond
and
envconfig
reference pages.
By using the Environmental Monitoring daemon,
envmond,
threshold levels can be checked and corrective action can ensue before damage
occurs to your system.
Then
envmond
daemon performs the
following:
Broadcasts a message to users and provides corrective action when a high threshold level or redundant power supply failure has been encountered. When the cooling fan on an AlphaServer 1000A fails, the kernel logs the error, synchronizes the disks, then powers the system down. On all other fan failures, a hard shutdown ensues. (Note that messages can be customized.)
Notifies users when a high temperature threshold condition has been resolved.
Notifies all users that an orderly shutdown is in progress if recovery is not possible.
To query the system, the
envmond
daemon
uses the base operating system command
/usr/sbin/snmp_request
to obtain
the current values of the environment variables specified in the Server System
MIB.
To enable Environmental Monitoring, the
envmond
daemon
must be started during the system boot, but after the eSNMP and Server System
MIB agents have been started.
You can customize the
envmond
daemon using the
envconfig
utility.
You can use the
envconfig
utility to customize how
the environment is queried by the
envmond
daemon.
These
customizations are stored in the
/etc/rc.config
file, which
is read by the
envmond
daemon during startup.
Use the
envconfig
utility to perform the following:
Turn environmental monitoring on or off during the system boot.
Specify the frequency between queries of the system by the
envmond
daemon.
Set
the highest threshold level that can be encountered before a temperature event
is signaled by the
envmond
daemon.
Specify the path of
a user defined script that you want the
envmond
daemon
to execute when a high threshold level is encountered.
Specify the grace period allotted to save data if a shutdown message has been broadcasted.
Display the values of the Environmental Monitoring variables.
You can modify any messages broadcast or logged by the Environmental
Monitoring utility.
The messages are located in the file:
/usr/share/sysman/envmon/EnvMon_UserDefinable_Msg.tcl.
You must be root to edit this file and you can edit any message
included in braces ({}).
The instructions for editing this file are included
in the comment (#) fields and you should avoid altering any other data in
this file.
For example, you can change the messages to specify the system name (host name) and location as shown in the following example:
Save the file
/usr/share/sysman/envmon/EnvMon_UserDefinable_Msg.tcl
to a holding file in case of editing errors.
Edit the file using the editor of your choice (the
/usr/bin/dt/dtpad
editor in CDE, for example).
Search for instances of
EnvmMon_Ovrstr
and locate the associated text string that is contained in braces ({}).
Modify the string as required. For example, prefix messages with the host name and location of the system by changing message strings as shown in the following samples
Current message:
set EnvmMon_Ovrstr(ENVMON_EVENT_SAFE_MSG) {System temperature is normal
Edited message:
set EnvmMon_Ovrstr(ENVMON_EVENT_SAFE_MSG) {System ntcstr5 in room 1 aisle 4 - temperature is normal
Save the file and exit.
You may want to run differences (diff) on the files to ensure that no other changes were made as
an error in this file may prevent the correct transmission of warning messages
if a system problem occurs.