This chapter helps you analyze crash dump files by providing the following information:
If the system crashed
because of a hardware problem (for example, because a memory board became
corrupt), correcting the problem probably requires repairing or replacing
the hardware. You might be able to disconnect the hardware that caused the
problem and operate without it until it is repaired or replaced. If you need
to repair or replace Digital hardware, call the nearest Digital service center
or sales office.
If a software panic
caused the crash, you can fix the problem if it is in software you or someone
else at your company wrote. Otherwise, you must request that the producer
of the software fix the problem. If the problem is in software from Digital,
you file a Software Performance Report (SPR) to request a correction to the
Digital software.
For information about reporting problems to Digital, contact your local
Digital service center or sales office.
In most cases, only system programmers can fix the problem that caused
a panic because most panics are caused by software errors. However, some
system panics reflect other problems. For example, if a memory board becomes
corrupted, software that attempts to write to that board might call the panic function and crash the system. In this case, the solution might
be to replace the memory board and reboot the system.
The sections that follow demonstrate finding the cause of a software
panic using the dbx and kdbx debuggers. You can also
examine output from the crashdc crash data collection tool to help
you determine the cause of a crash. Sample output from crashdc
is shown and explained in Appendix A.
The sections that follow show how to identify hardware traps using the dbx and kdbx debuggers. You can also examine output from
the crashdc crash data collection tool to help you determine the
cause of a crash. Sample output from crashdc is shown and explained
in Appendix A.
The following example shows a method for stepping through kernel threads
to identify the events that lead to the crash:
The following example shows a method for determining which CPU caused
the crash and which function called the panic function:
This panic string indicates that this CPU was not the one that started
the system panic. This CPU was requested to panic and stop operation.
Notice that the panic function appears twice in the stack
trace. The series of events that resulted in the first call to the panic function caused the crash. The events that occurred after the first
call to the panic function were performed after the system was
corrupt and during an attempt to save data. Normally, any events that occur
after the initial call to the panic function will not help you
determine why the system crashed.
In this example, the problem is in the pgrp_ref function
on line 561 in the kern_proc.c file.
If you follow the stack trace after the pgrp_ref function,
you can see that the pgrp_ref function calls the simple_lock_valid_violation function. This function displays information about simple locks, which
might be helpful in determining why the system crashed.
5.1 Guidelines for Examining Crash Dump Files
In examining crash
dump files, there is no one way to determine the cause of a system crash.
However, following these steps should help you identify the events that lead
to most crashes:
5.2 Identifying a Crash Caused by a Software Problem
When software encounters
a state from which it cannot continue, it calls the system panic
function. For example, if the software attempts to access an area of memory
that is protected from access, the software might call the panic
function and crash the system.5.2.1 Using dbx to Determine the Cause of a Software Panic
The following
example shows a method for identifying a software panic with the dbx debugger:
# dbx -k vmunix.0 vmcore.0
dbx version 3.11.1
Type 'help' for help.
stopped at [boot:753 ,0xfffffc00003c4ae4] Source not available
(dbx) p panicstr (1)
0xfffffc000044b648 = "ialloc: dup alloc"
(dbx) t (2)
> 0 boot(paniced = 0, arghowto = 0) ["../../../../src/kernel/arch/alpha/machdep.\
c":753, 0xfffffc00003c4ae4]
1 panic(s = 0xfffffc000044b618 = "mode = 0%o, inum = %d, pref = %d fs = %s\n")\
["../../../../src/kernel/bsd/subr_prf.c":1119, 0xfffffc00002bdbb0]
2 ialloc(pip = 0xffffffff8c6acc40, ipref = 57664, mode = 0, ipp = 0xffffffff8c\
f95af8) ["../../../../src/kernel/ufs/ufs_alloc.c":501, 0xfffffc00002dab48]
3 maknode(vap = 0xffffffff8cf95c50, ndp = 0xffffffff8cf922f8, ipp = 0xffffffff\
8cf95b60) ["../../../../src/kernel/ufs/ufs_vnops.c":2842, 0xfffffc00002ea500]
4 ufs_create(ndp = 0xffffffff8cf922f8, vap = 0xfffffc00002fe0a0) ["../../../..\
/src/kernel/ufs/ufs_vnops.c":602, 0xfffffc00002e771c]
5 vn_open(ndp = 0xffffffff8cf95d18, fmode = 4618, cmode = 416) ["../../../../s\
rc/kernel/vfs/vfs_vnops.c":258, 0xfffffc00002fe138]
6 copen(p = 0xffffffff8c6efba0, args = 0xffffffff8cf95e50, retval = 0xffffffff\
8cf95e40, compat = 0) ["../../../../src/kernel/vfs/vfs_syscalls.c":1379, 0xfffffc\
00002fb890]
7 open(p = 0xffffffff8cf95e40, args = (nil), retval = 0x7f4) ["../../../../src\
/kernel/vfs/vfs_syscalls.c":1340, 0xfffffc00002fb7bc]
8 syscall(ep = 0xffffffff8cf95ef8, code = 45) ["../../../../src/kernel/arch/al\
pha/syscall_trap.c":532, 0xfffffc00003cfa34]
9 _Xsyscall() ["../../../../src/kernel/arch/alpha/locore.s":703, 0xfffffc00003\
c31e0]
(dbx) q
5.2.2 Using kdbx to Determine the Cause of a Software Panic
The
following example shows a method of finding a software panic using the kdbx debugger:
# kdbx -k vmunix.3 vmcore.3
dbx version 3.11.1
Type 'help' for help.
stopped at [boot:753 ,0xfffffc00003c4b04] Source not available
(kdbx) sum (1)
Hostname : system.dec.com
cpu: DEC3000 - M500 avail: 1
Boot-time: Mon Dec 14 12:06:31 1992
Time: Mon Dec 14 12:17:16 1992
Kernel : OSF1 release 1.2 version 1.2 (alpha)
(kdbx) p panicstr (2)
0xfffffc0000453ea0 = "wdir: compact2"
(kdbx) t (3)
> 0 boot(paniced = 0, arghowto = 0) ["../../../../src/kernel/arch/alpha/machdep\
.c":753, 0xfffffc00003c4b04]
1 panic(s = 0xfffffc00002e0938 = "p") ["../../../../src/kernel/bsd/subr_prf.c"\
:1119, 0xfffffc00002bdbb0]
2 direnter(ip = 0xffffffff00000000, ndp = 0xffffffff9d38db60) ["../../../../sr\
c/kernel/ufs/ufs_lookup.c":986, 0xfffffc00002e2adc]
3 ufs_mkdir(ndp = 0xffffffff9d38a2f8, vap = 0x100000020) ["../../../../src/ker\
nel/ufs/ufs_vnops.c":2383, 0xfffffc00002e9cbc]
4 mkdir(p = 0xffffffff9c43d7c0, args = 0xffffffff9d38de50, retval = 0xffffffff\
9d38de40) ["../../../../src/kernel/vfs/vfs_syscalls.c":2579, 0xfffffc00002fd930]
5 syscall(ep = 0xffffffff9d38def8, code = 136) ["../../../../src/kernel/arch/a\
lpha/syscall_trap.c":532, 0xfffffc00003cfa54]
6 _Xsyscall() ["../../../../src/kernel/arch/alpha/locore.s":703, 0xfffffc00003\
c3200]
(kdbx) q
dbx (pid 29939) died. Exiting...
5.3 Identifying a Hardware Exception
Occasionally,
your system might crash due to a hardware error. During a hardware exception,
the hardware encounters a situation from which it cannot continue. For example,
the hardware might detect a parity error in a portion of memory that is necessary
for its successful operation. When a hardware exception occurs, the hardware
stores information in registers and stops operation. When control returns
to the software, it normally calls the panic function and the system
crashes.5.3.1 Using dbx to Determine the Cause of a Hardware Error
The
following example shows a method for identifying a hardware trap with the dbx debugger:
# dbx -k vmunix.1 vmcore.1
dbx version 3.11.1
Type 'help' for help.
(dbx) sh strings vmunix.1 | grep '(Rev' (1)
DEC OSF/1 X2.0A-7 (Rev. 1);
(dbx) p utsname (2)
struct {
sysname = "OSF1"
nodename = "system.dec.com"
release = "2.0"
version = "2.0"
machine = "alpha"
}
(dbx) p panicstr (3)
0xfffffc0000489350 = "trap: Kernel mode prot fault\n"
(dbx) t (4)
> 0 boot(paniced = 0, arghowto = 0) ["/usr/sde/alpha/build/alpha.nightly/src/ker\
nel/arch/alpha/machdep.c":
1 panic(s = 0xfffffc0000489350 = "trap: Kernel mode prot fault\n") ["/usr/sde\
/alpha/build/alpha.nightly/src/kernel/bsd/subr_prf.c":1099, 0xfffffc00002c0730]
2 trap() ["/usr/sde/alpha/build/alpha.nightly/src/kernel/arch/alpha/trap.c":54\
4, 0xfffffc00003e0c78]
3 _XentMM() ["/usr/sde/alpha/build/alpha.nightly/src/kernel/arch/alpha/locore.\
s":702, 0xfffffc00003d4ff4]
(dbx) kps (5)
PID COMM
00000 kernel idle
00001 init
00002 device server
00003 exception hdlr
00663 ypbind
00018 cfgmgr
00219 automount
.
.
.
00265 cron
00293 xdm
02311 inetd
00278 lpd
01443 csh
01442 rlogind
01646 rlogind
01647 csh
(dbx) p $pid (6)
2311
(dbx) p *pmsgbuf (7)
struct {
msg_magic = 405601
msg_bufx = 62
msg_bufr = 3825
msg_bufc = "nknown flag
printstate: unknown flag
printstate: unknown flag
de: table is full
<3>vnode: table is full
.
.
.
<3>arp: local IP address 0xffffffff82b40429 in use by
hardware address 08:00:2B:20:19:CD
<3>arp: local IP address 0xffffffff82b40429 in use by
hardware address 08:00:2B:2B:F6:3B
va=0000000000000028, status word=0000000000000000, pc=fffffc000032972c
panic: trap: Kernel mode prot fault
syncing disks... 3 3 done
printstate: unknown flag
printstate: unknown flag
printstate: unknown flag
printstate: unknown flag
printstate: u"
}
(dbx) px savedefp
0xffffffff89b2b4e0
(dbx) p savedefp
0xffffffff89b2b4e0
(dbx) p savedefp[28]
18446739675666356012
(dbx) px savedefp[28] (8)
0xfffffc000032972c
(dbx) savedefp[28]/i (9)
[nfs_putpage:2344, 0xfffffc000032972c] ldl r5, 40(r1)
(dbx) savedefp[23]/i (10)
[ubc_invalidate:1768, 0xfffffc0000315fe0] stl r0, 84(sp)
(dbx) func nfs_putpage (11)
(dbx) file (12)
/usr/sde/alpha/build/alpha.nightly/src/kernel/kern/sched_prim.c
(dbx) func ubc_invalidate (13)
ubc_invalidate: Source not available
(dbx) file (14)
/usr/sde/alpha/build/alpha.nightly/src/kernel/vfs/vfs_ubc.c
(dbx) q
The result from this example shows that the ubc_invalidate function, which resides in the /vfs/vfs_ubc.c file at line
number 1768, called the nfs_putpage function at line number 2344
in the /kern/sched_prim.c file and the system stopped.
va=0000000000000028, status word=0000000000000000, pc=fffffc000032972c
5.3.2 Using kdbx to Determine the Cause of a Hardware Error
The
following example shows a method for identifying a hardware error by using
the kdbx debugger:
# kdbx -k vmunix.5 vmcore.5
dbx version 3.11.1
Type 'help' for help.
stopped at [boot:753 ,0xfffffc00003c4b04] Source not available
(kdbx) sum (1)
Hostname : system.dec.com
cpu: DEC3000 - M500 avail: 1
Boot-time: Thu Jan 7 08:12:30 1993
Time: Thu Jan 7 08:13:23 1993
Kernel : OSF1 release 1.2 version 1.2 (alpha)
(kdbx) p panicstr (2)
0xfffffc0000471030 = "ECC Error"
(kdbx) t (3)
> 0 boot(paniced = 0, arghowto = 0) ["../../../../src/kernel/arch/alpha/machdep.\
c":753, 0xfffffc00003c4b04]
1 panic(s = 0x670) ["../../../../src/kernel/bsd/subr_prf.c":1119, 0xfffffc00002\
bdbb0]
2 kn15aa_machcheck(type = 1648, cmcf = 0xfffffc00000f8050 = , framep = 0xffff\
ffff94f79ef8) ["../../../../src/kernel/arch/alpha/hal/kn15aa.c":1269, 0xfffffc000\
03da62c]
3 mach_error(type = -1795711240, phys_logout = 0x3, regs = 0x6) ["../../../../s\
rc/kernel/arch/alpha/hal/cpusw.c":323, 0xfffffc00003d7dc0]
4 _XentInt() ["../../../../src/kernel/arch/alpha/locore.s":609, 0xfffffc00003c3\
148]
(kdbx) q
dbx (pid 337) died. Exiting...
5.4 Finding a Panic String in a Thread Other Than the Current Thread
The dbx and kdbx debuggers have the concept of the current thread. In many cases,
when you invoke one of the debuggers to analyze a crash dump, the panic string
is in the current thread. At times, however, the current thread contains
no panic string and so is probably not the thread that caused the crash.
# dbx -k ./vmunix.2 ./vmcore.2
dbx version 3.11.1
Type 'help' for help.
thread 0x8d431c68 stopped at [thread_block:1305 +0x114,0xfffffc000033961c] \
Source not available
(dbx) p panicstr (1)
0xfffffc000048a0c8 = "kernel memory fault"
(dbx) t (2)
> 0 thread_block() ["../../../../src/kernel/kern/sched_prim.c":1305, 0xfffffc0\
e
00033961c]
1 mpsleep(chan = 0xffffffff8d4ef450 = , pri = 282, wmesg = 0xfffffc000046f\
290 = "network", timo = 0, lockp = (nil), flags = 0) ["../../../../src/kernel/\
bsd/kern_synch.c":267, 0xfffffc00002b772c]
2 sosleep(so = 0xffffffff8d4ef408, addr = 0xffffffff906cfcf4 = "^P", pri = 2 \
82,tmo = 0) ["../../../../src/kernel/bsd/uipc_socket2.c":612, 0xfffffc00002d3784]
3 accept1(p = 0xffffffff8f8bfde8, args = 0xffffffff906cfe50, retval = 0xffff \
ffff906cfe40, compat_43 = 1) ["../../../../src/kernel/bsd/uipc_syscalls.c":300 \
, 0xfffffc00002d4c74]
4 oaccept(p = 0xffffffff8d431c68, args = 0xffffffff906cfe50, retval = 0xffff \
ffff906cfe40) ["../../../../src/kernel/bsd/uipc_syscalls.c":250, 0xfffffc00002d\
4b0c]
5 syscall(ep = 0xffffffff906cfef8, code = 99, sr = 1) ["../../../../src/kern \
el/arch/alpha/syscall_trap.c":499, 0xfffffc00003ec18c]
6 _Xsyscall() ["../../../../src/kernel/arch/alpha/locore.s":675, 0xfffffc000\
03df96c]
(dbx) tlist (3)
thread 0x8d431a60 stopped at [thread_block:1305 +0x114,0xfffffc000033961c] \
Source not available
thread 0x8d431858 stopped at [thread_block:1289 +0x18,0xfffffc00003394b8] \
Source not available
thread 0x8d431650 stopped at [thread_block:1289 +0x18,0xfffffc00003394b8] \
Source not available
thread 0x8d431448 stopped at [thread_block:1305 +0x114,0xfffffc000033961c] \
Source not available
thread 0x8d431240 stopped at [thread_block:1305 +0x114,0xfffffc000033961c] \
Source not available
.
.
.
thread 0x8d42f5d0 stopped at [boot:696 ,0xfffffc00003e119c] Source not
\
available
thread 0x8d42f3c8 stopped at [thread_block:1289 +0x18,0xfffffc00003394b8] \
Source not available
thread 0x8d42f1c0 stopped at [thread_block:1289 +0x18,0xfffffc00003394b8] \
Source not available
thread 0x8d42efb8 stopped at [thread_block:1289 +0x18,0xfffffc00003394b8] \
Source not available
thread 0x8d42dd70 stopped at [thread_block:1289 +0x18,0xfffffc00003394b8] \
Source not available
(dbx) tset 0x8d42f5d0 (4)
thread 0x8d42f5d0 stopped at [boot:696 ,0xfffffc00003e119c] Source not ava\
ilable
(dbx) t (5)
> 0 boot(paniced = 0, arghowto = 0) ["../../../../src/kernel/arch/alpha/mac\
hdep.c":694, 0xfffffc00003e1198]
1 panic(s = 0xfffffc000048a098 = " sp contents at time of fault: 0x%l01\
6x\r\n\n") ["../../../../src/kernel/bsd/subr_prf.c":1110, 0xfffffc00002beef4]
2 trap() ["../../../../src/kernel/arch/alpha/trap.c":677, 0xfffffc00003ecc70]
3 _XentMM() ["../../../../src/kernel/arch/alpha/locore.s":828, 0xfffffc000\
03dfb1c]
4 pmap_release_page(pa = 18446744071785586688) ["../../../../src/kernel/ar\
ch/alpha/pmap.c":640, 0xfffffc00003e3ecc]
5 put_free_ptepage(page = 5033216) ["../../../../src/kernel/arch/alpha/pma\
p.c" :534, 0xfffffc00003e3ca0]
6 pmap_destroy(map = 0xffffffff8d5bc428) ["../../../../src/kernel/arch/alp\
ha/p map.c":1891, 0xfffffc00003e6140]
7 vm_map_deallocate(map = 0xffffffff81930ee0) ["../../../../src/kernel/vm/\
vm_map.c":482, 0xfffffc00003d03c0]
8 task_deallocate(task = 0xffffffff8d568d48) ["../../../../src/kernel/kern\
/task.c":237, 0xfffffc000033c1dc]
9 thread_deallocate(thread = 0x4e4360) ["../../../../src/kernel/kern/threa\
d.c":689, 0xfffffc000033d83c]
10 reaper_thread() ["../../../../src/kernel/kern/thread.c":1952, 0xfffffc00\
0033e920]
11 reaper_thread() ["../../../../src/kernel/kern/thread.c":1901, 0xfffffc00\
0033e8ac]
(dbx) q5.5 Identifying the Cause of a Crash on an SMP System
If
you are analyzing crash dump files from an SMP system, you must first determine
on which CPU the panic occurred. You can then continue crash dump analysis
as you would on a single processor system.
% dbx -k ./vmunix.1 ./vmcore.1
dbx version 3.11.6
Type 'help' for help.
stopped at [boot:1494 ,0xfffffc0000442918] Source not available
(dbx) p ustsname (1)
struct {
sysname = "OSF1"
nodename = "wasted.zk3.dec.com"
release = "V3.0"
version = "358"
machine = "alpha"
}
(dbx) print paniccpu (2)
0
(dbx) p machine_slot[1] (3)
struct {
is_cpu = 1
cpu_type = 15
cpu_subtype = 3
running = 1
cpu_ticks = {
[0] 416162
[1] 83260
[2] 1401080
[3] 11821212
[4] 1095581
}
clock_freq = 1024
error_restart = 0
cpu_panicstr = 0xfffffc000059f6a0 = "cpu_ip_intr: panic request"
cpu_panic_thread = 0xffffffff8109a780
}
(dbx) p panicstr (4)
0xfffffc0000558ad0 = "simple_lock: uninitialized lock"
(dbx) tset active_threads[paniccpu] (5)
stopped at [boot:1494 ,0xfffffc0000442918]
(dbx) t (6)
> 0 boot(0x0, 0x4, 0xac35c0000000a, 0xfffffc00004403fc, 0xfffffc000000000e) \
["../../../../src/kernel/arch/alpha/machdep.c":1494, 0xfffffc0000442918]
1 panic(s = 0xfffffc0000558b40 = "simple_lock: hierarchy violation") ["../\
2 simple_lock_fault(slp = 0xfffffc00006292f0, state = 0, caller = 0xfffffc\
000046f384, arg = 0xfffffc0000534fd8 = "session.s_fpgrp_lock", fmt = 0xfffffc\
0000558de8 = " class already locked: %s\n", error = 0xfffffc0000558b40 = "\
simple_lock: hierarchy violation") ["../../../../src/kernel/kern/lock.c":1558\
, 0xfffffc00003c34ec]
3 simple_lock_hierarchy_violation(slp = 0xfffffc000046f384, state = 184467\
39675668500440, caller = 0xfffffc0000558de8, curhier = 5606208) ["../../../..\
/src/kernel/kern/lock.c":1616, 0xfffffc00003c3620]
4 xnaintr(0xfffffc00005a5158, 0x2, 0xffffffffb53ef238, 0xfffffc000068a754,\
0xfffffc000055891d) ["../../../../src/kernel/io/dec/netif/if_xna.c":1077, 0x\
fffffc000046f384]
5 _XentInt(0x2, 0xfffffc0000447174, 0xfffffc00005b7d40, 0x2, 0x0) ["../../\
6 swap_ipl(0x2, 0xfffffc0000447174, 0xfffffc00005b7d40, 0x2, 0x0) ["../../\
7 boot(0x0, 0x0, 0xffffffffa52c6000, 0xffffffffb53ef1f8, 0xfffffc00003bf4f\
c) ["../../../../src/kernel/arch/alpha/machdep.c":1434, 0xfffffc000044280c]
8 panic(s = 0xfffffc0000558ad0 = "simple_lock: uninitialized lock") ["../.\
9 simple_lock_fault(slp = 0xffffffffa52c6000, state = 1719, caller = 0xfff\
ffc00003734c4, arg = (nil), fmt = (nil), error = 0xfffffc0000558ad0 = "simple\
_lock: uninitialized lock") ["../../../../src/kernel/kern/lock.c":1558, 0xfff\
ffc00003c34ec]
10 simple_lock_valid_violation(slp = 0xfffffc00003734c4, state = 0, caller \
= (nil)) ["../../../../src/kernel/kern/lock.c":1584, 0xfffffc00003c3578]
11 pgrp_ref(0xffffffffa52c6000, 0x0, 0xfffffc000023ee20, 0x6b7, 0xfffffc000\
05e1080) ["../../../../src/kernel/bsd/kern_proc.c":561, 0xfffffc00003734c4]
12 exit(0xffffffffb53ef740, 0x100, 0x1, 0xffffffffa42e5e80, 0x1) ["../../..\
/../src/kernel/bsd/kern_exit.c":868, 0xfffffc000023ef30]
13 rexit(0xffffffff814d2d80, 0xffffffffb53ef758, 0xffffffffb53ef8b8, 0x1000\
00001, 0x0) ["../../../../src/kernel/bsd/kern_exit.c":546, 0xfffffc000023e7dc]
14 syscall(0xffffffffb53ec000, 0xfffffc000068a300, 0x0, 0x51, 0x1) ["../../\
15 _Xsyscall(0x8, 0x3ff800e6938, 0x14000d0f0, 0x1, 0x11ffffc18) ["../../../\
(dbx) p *pmsgbuf (7)
struct {
msg_magic = 405601
msg_bufx = 701
msg_bufr = 134
msg_bufc = "0.64.143, errno 22
NFS server: stale file handle fs(742,645286) file 573 gen 32779
getattr, client address = 16.140.64.143, errno 22
simple_lock: uninitialized lock
pc of caller: 0xfffffc00003734c4
lock address: 0xffffffffa52c6000
lock class name: (unknown_simple_lock)
current lock state: 0x00000000e0e9b04a (cpu=0,pc=0xfffffc00e0e9b048,free)
panic (cpu 0): simple_lock: uninitialized lock
simple_lock: hierarchy violation
pc of caller: 0xfffffc000046f384
lock address: 0xfffffc00006292f0
lock info addr: 0xfffffc0000672cc0
lock class name: xna_softc.lk_xna_softc
class already locked: session.s_fpgrp_lock
.
.
.
}
(dbx) quit
cpu_ip_intro: panic_request