This chapter helps you analyze crash dump files by providing the following information:
If the system crashed because of a hardware problem (for example, because a memory board became corrupt), correcting the problem probably requires repairing or replacing the hardware. You might be able to disconnect the hardware that caused the problem and operate without it until it is repaired or replaced. If you need to repair or replace Digital hardware, call the nearest Digital service center or sales office.
If a software panic caused the crash, you can fix the problem if it is in software you or someone else at your company wrote. Otherwise, you must request that the producer of the software fix the problem. If the problem is in software from Digital, you file a Software Performance Report (SPR) to request a correction to the Digital software.
In most cases, only system programmers can fix the problem that caused a panic because most panics are caused by software errors. However, some system panics reflect other problems. For example, if a memory board becomes corrupted, software that attempts to write to that board might call the panic function and crash the system. In this case, the solution might be to replace the memory board and reboot the system.
The sections that follow demonstrate finding the cause of a software panic using the dbx and kdbx debuggers. You can also examine output from the crashdc crash data collection tool to help you determine the cause of a crash. Sample output from crashdc is shown and explained in Appendix A.
# dbx -k vmunix.0 vmcore.0 dbx version 3.11.1 Type 'help' for help. stopped at [boot:753 ,0xfffffc00003c4ae4] Source not available (dbx) p panicstr (1) 0xfffffc000044b648 = "ialloc: dup alloc" (dbx) t (2) > 0 boot(paniced = 0, arghowto = 0) ["../../../../src/kernel/arch/alpha/machdep.\ c":753, 0xfffffc00003c4ae4] 1 panic(s = 0xfffffc000044b618 = "mode = 0%o, inum = %d, pref = %d fs = %s\n")\ ["../../../../src/kernel/bsd/subr_prf.c":1119, 0xfffffc00002bdbb0] 2 ialloc(pip = 0xffffffff8c6acc40, ipref = 57664, mode = 0, ipp = 0xffffffff8c\ f95af8) ["../../../../src/kernel/ufs/ufs_alloc.c":501, 0xfffffc00002dab48] 3 maknode(vap = 0xffffffff8cf95c50, ndp = 0xffffffff8cf922f8, ipp = 0xffffffff\ 8cf95b60) ["../../../../src/kernel/ufs/ufs_vnops.c":2842, 0xfffffc00002ea500] 4 ufs_create(ndp = 0xffffffff8cf922f8, vap = 0xfffffc00002fe0a0) ["../../../..\ /src/kernel/ufs/ufs_vnops.c":602, 0xfffffc00002e771c] 5 vn_open(ndp = 0xffffffff8cf95d18, fmode = 4618, cmode = 416) ["../../../../s\ rc/kernel/vfs/vfs_vnops.c":258, 0xfffffc00002fe138] 6 copen(p = 0xffffffff8c6efba0, args = 0xffffffff8cf95e50, retval = 0xffffffff\ 8cf95e40, compat = 0) ["../../../../src/kernel/vfs/vfs_syscalls.c":1379, 0xfffffc\ 00002fb890] 7 open(p = 0xffffffff8cf95e40, args = (nil), retval = 0x7f4) ["../../../../src\ /kernel/vfs/vfs_syscalls.c":1340, 0xfffffc00002fb7bc] 8 syscall(ep = 0xffffffff8cf95ef8, code = 45) ["../../../../src/kernel/arch/al\ pha/syscall_trap.c":532, 0xfffffc00003cfa34] 9 _Xsyscall() ["../../../../src/kernel/arch/alpha/locore.s":703, 0xfffffc00003\ c31e0] (dbx) q
# kdbx -k vmunix.3 vmcore.3 dbx version 3.11.1 Type 'help' for help. stopped at [boot:753 ,0xfffffc00003c4b04] Source not available (kdbx) sum (1) Hostname : system.dec.com cpu: DEC3000 - M500 avail: 1 Boot-time: Mon Dec 14 12:06:31 1992 Time: Mon Dec 14 12:17:16 1992 Kernel : OSF1 release 1.2 version 1.2 (alpha) (kdbx) p panicstr (2) 0xfffffc0000453ea0 = "wdir: compact2" (kdbx) t (3) > 0 boot(paniced = 0, arghowto = 0) ["../../../../src/kernel/arch/alpha/machdep\ .c":753, 0xfffffc00003c4b04] 1 panic(s = 0xfffffc00002e0938 = "p") ["../../../../src/kernel/bsd/subr_prf.c"\ :1119, 0xfffffc00002bdbb0] 2 direnter(ip = 0xffffffff00000000, ndp = 0xffffffff9d38db60) ["../../../../sr\ c/kernel/ufs/ufs_lookup.c":986, 0xfffffc00002e2adc] 3 ufs_mkdir(ndp = 0xffffffff9d38a2f8, vap = 0x100000020) ["../../../../src/ker\ nel/ufs/ufs_vnops.c":2383, 0xfffffc00002e9cbc] 4 mkdir(p = 0xffffffff9c43d7c0, args = 0xffffffff9d38de50, retval = 0xffffffff\ 9d38de40) ["../../../../src/kernel/vfs/vfs_syscalls.c":2579, 0xfffffc00002fd930] 5 syscall(ep = 0xffffffff9d38def8, code = 136) ["../../../../src/kernel/arch/a\ lpha/syscall_trap.c":532, 0xfffffc00003cfa54] 6 _Xsyscall() ["../../../../src/kernel/arch/alpha/locore.s":703, 0xfffffc00003\ c3200] (kdbx) q dbx (pid 29939) died. Exiting...
The sections that follow show how to identify hardware traps using the dbx and kdbx debuggers. You can also examine output from the crashdc crash data collection tool to help you determine the cause of a crash. Sample output from crashdc is shown and explained in Appendix A.
# dbx -k vmunix.1 vmcore.1 dbx version 3.11.1 Type 'help' for help. (dbx) sh strings vmunix.1 | grep '(Rev' (1) DEC OSF/1 X2.0A-7 (Rev. 1); (dbx) p utsname (2) struct { sysname = "OSF1" nodename = "system.dec.com" release = "2.0" version = "2.0" machine = "alpha" } (dbx) p panicstr (3) 0xfffffc0000489350 = "trap: Kernel mode prot fault\n" (dbx) t (4) > 0 boot(paniced = 0, arghowto = 0) ["/usr/sde/alpha/build/alpha.nightly/src/ker\ nel/arch/alpha/machdep.c": 1 panic(s = 0xfffffc0000489350 = "trap: Kernel mode prot fault\n") ["/usr/sde\ /alpha/build/alpha.nightly/src/kernel/bsd/subr_prf.c":1099, 0xfffffc00002c0730] 2 trap() ["/usr/sde/alpha/build/alpha.nightly/src/kernel/arch/alpha/trap.c":54\ 4, 0xfffffc00003e0c78] 3 _XentMM() ["/usr/sde/alpha/build/alpha.nightly/src/kernel/arch/alpha/locore.\ s":702, 0xfffffc00003d4ff4] (dbx) kps (5) PID COMM 00000 kernel idle 00001 init 00002 device server 00003 exception hdlr 00663 ypbind 00018 cfgmgr 00219 automount
.
.
.
00265 cron 00293 xdm 02311 inetd 00278 lpd 01443 csh 01442 rlogind 01646 rlogind 01647 csh (dbx) p $pid (6) 2311 (dbx) p *pmsgbuf (7) struct { msg_magic = 405601 msg_bufx = 62 msg_bufr = 3825 msg_bufc = "nknown flag printstate: unknown flag printstate: unknown flag de: table is full <3>vnode: table is full
.
.
.
<3>arp: local IP address 0xffffffff82b40429 in use by hardware address 08:00:2B:20:19:CD <3>arp: local IP address 0xffffffff82b40429 in use by hardware address 08:00:2B:2B:F6:3B va=0000000000000028, status word=0000000000000000, pc=fffffc000032972c panic: trap: Kernel mode prot fault syncing disks... 3 3 done printstate: unknown flag printstate: unknown flag printstate: unknown flag printstate: unknown flag printstate: u" } (dbx) px savedefp 0xffffffff89b2b4e0 (dbx) p savedefp 0xffffffff89b2b4e0 (dbx) p savedefp[28] 18446739675666356012 (dbx) px savedefp[28] (8) 0xfffffc000032972c (dbx) savedefp[28]/i (9) [nfs_putpage:2344, 0xfffffc000032972c] ldl r5, 40(r1) (dbx) savedefp[23]/i (10) [ubc_invalidate:1768, 0xfffffc0000315fe0] stl r0, 84(sp) (dbx) func nfs_putpage (11) (dbx) file (12) /usr/sde/alpha/build/alpha.nightly/src/kernel/kern/sched_prim.c (dbx) func ubc_invalidate (13) ubc_invalidate: Source not available (dbx) file (14) /usr/sde/alpha/build/alpha.nightly/src/kernel/vfs/vfs_ubc.c (dbx) q
va=0000000000000028, status word=0000000000000000, pc=fffffc000032972c
# kdbx -k vmunix.5 vmcore.5 dbx version 3.11.1 Type 'help' for help. stopped at [boot:753 ,0xfffffc00003c4b04] Source not available (kdbx) sum (1) Hostname : system.dec.com cpu: DEC3000 - M500 avail: 1 Boot-time: Thu Jan 7 08:12:30 1993 Time: Thu Jan 7 08:13:23 1993 Kernel : OSF1 release 1.2 version 1.2 (alpha) (kdbx) p panicstr (2) 0xfffffc0000471030 = "ECC Error" (kdbx) t (3) > 0 boot(paniced = 0, arghowto = 0) ["../../../../src/kernel/arch/alpha/machdep.\ c":753, 0xfffffc00003c4b04] 1 panic(s = 0x670) ["../../../../src/kernel/bsd/subr_prf.c":1119, 0xfffffc00002\ bdbb0] 2 kn15aa_machcheck(type = 1648, cmcf = 0xfffffc00000f8050 = , framep = 0xffff\ ffff94f79ef8) ["../../../../src/kernel/arch/alpha/hal/kn15aa.c":1269, 0xfffffc000\ 03da62c] 3 mach_error(type = -1795711240, phys_logout = 0x3, regs = 0x6) ["../../../../s\ rc/kernel/arch/alpha/hal/cpusw.c":323, 0xfffffc00003d7dc0] 4 _XentInt() ["../../../../src/kernel/arch/alpha/locore.s":609, 0xfffffc00003c3\ 148] (kdbx) q dbx (pid 337) died. Exiting...
The following example shows a method for stepping through kernel threads to identify the events that lead to the crash:
# dbx -k ./vmunix.2 ./vmcore.2 dbx version 3.11.1 Type 'help' for help. thread 0x8d431c68 stopped at [thread_block:1305 +0x114,0xfffffc000033961c] \ Source not available (dbx) p panicstr (1) 0xfffffc000048a0c8 = "kernel memory fault" (dbx) t (2) > 0 thread_block() ["../../../../src/kernel/kern/sched_prim.c":1305, 0xfffffc0\ e 00033961c] 1 mpsleep(chan = 0xffffffff8d4ef450 = , pri = 282, wmesg = 0xfffffc000046f\ 290 = "network", timo = 0, lockp = (nil), flags = 0) ["../../../../src/kernel/\ bsd/kern_synch.c":267, 0xfffffc00002b772c] 2 sosleep(so = 0xffffffff8d4ef408, addr = 0xffffffff906cfcf4 = "^P", pri = 2 \ 82,tmo = 0) ["../../../../src/kernel/bsd/uipc_socket2.c":612, 0xfffffc00002d3784] 3 accept1(p = 0xffffffff8f8bfde8, args = 0xffffffff906cfe50, retval = 0xffff \ ffff906cfe40, compat_43 = 1) ["../../../../src/kernel/bsd/uipc_syscalls.c":300 \ , 0xfffffc00002d4c74] 4 oaccept(p = 0xffffffff8d431c68, args = 0xffffffff906cfe50, retval = 0xffff \ ffff906cfe40) ["../../../../src/kernel/bsd/uipc_syscalls.c":250, 0xfffffc00002d\ 4b0c] 5 syscall(ep = 0xffffffff906cfef8, code = 99, sr = 1) ["../../../../src/kern \ el/arch/alpha/syscall_trap.c":499, 0xfffffc00003ec18c] 6 _Xsyscall() ["../../../../src/kernel/arch/alpha/locore.s":675, 0xfffffc000\ 03df96c] (dbx) tlist (3) thread 0x8d431a60 stopped at [thread_block:1305 +0x114,0xfffffc000033961c] \ Source not available thread 0x8d431858 stopped at [thread_block:1289 +0x18,0xfffffc00003394b8] \ Source not available thread 0x8d431650 stopped at [thread_block:1289 +0x18,0xfffffc00003394b8] \ Source not available thread 0x8d431448 stopped at [thread_block:1305 +0x114,0xfffffc000033961c] \ Source not available thread 0x8d431240 stopped at [thread_block:1305 +0x114,0xfffffc000033961c] \ Source not available
.
.
.
thread 0x8d42f5d0 stopped at [boot:696 ,0xfffffc00003e119c] Source not \ available thread 0x8d42f3c8 stopped at [thread_block:1289 +0x18,0xfffffc00003394b8] \ Source not available thread 0x8d42f1c0 stopped at [thread_block:1289 +0x18,0xfffffc00003394b8] \ Source not available thread 0x8d42efb8 stopped at [thread_block:1289 +0x18,0xfffffc00003394b8] \ Source not available thread 0x8d42dd70 stopped at [thread_block:1289 +0x18,0xfffffc00003394b8] \ Source not available (dbx) tset 0x8d42f5d0 (4) thread 0x8d42f5d0 stopped at [boot:696 ,0xfffffc00003e119c] Source not ava\ ilable (dbx) t (5) > 0 boot(paniced = 0, arghowto = 0) ["../../../../src/kernel/arch/alpha/mac\ hdep.c":694, 0xfffffc00003e1198] 1 panic(s = 0xfffffc000048a098 = " sp contents at time of fault: 0x%l01\ 6x\r\n\n") ["../../../../src/kernel/bsd/subr_prf.c":1110, 0xfffffc00002beef4] 2 trap() ["../../../../src/kernel/arch/alpha/trap.c":677, 0xfffffc00003ecc70] 3 _XentMM() ["../../../../src/kernel/arch/alpha/locore.s":828, 0xfffffc000\ 03dfb1c] 4 pmap_release_page(pa = 18446744071785586688) ["../../../../src/kernel/ar\ ch/alpha/pmap.c":640, 0xfffffc00003e3ecc] 5 put_free_ptepage(page = 5033216) ["../../../../src/kernel/arch/alpha/pma\ p.c" :534, 0xfffffc00003e3ca0] 6 pmap_destroy(map = 0xffffffff8d5bc428) ["../../../../src/kernel/arch/alp\ ha/p map.c":1891, 0xfffffc00003e6140] 7 vm_map_deallocate(map = 0xffffffff81930ee0) ["../../../../src/kernel/vm/\ vm_map.c":482, 0xfffffc00003d03c0] 8 task_deallocate(task = 0xffffffff8d568d48) ["../../../../src/kernel/kern\ /task.c":237, 0xfffffc000033c1dc] 9 thread_deallocate(thread = 0x4e4360) ["../../../../src/kernel/kern/threa\ d.c":689, 0xfffffc000033d83c] 10 reaper_thread() ["../../../../src/kernel/kern/thread.c":1952, 0xfffffc00\ 0033e920] 11 reaper_thread() ["../../../../src/kernel/kern/thread.c":1901, 0xfffffc00\ 0033e8ac] (dbx) q
The following example shows a method for determining which CPU caused the crash and which function called the panic function:
% dbx -k ./vmunix.1 ./vmcore.1 dbx version 3.11.6 Type 'help' for help. stopped at [boot:1494 ,0xfffffc0000442918] Source not available (dbx) p ustsname (1) struct { sysname = "OSF1" nodename = "wasted.zk3.dec.com" release = "V3.0" version = "358" machine = "alpha" } (dbx) print paniccpu (2) 0 (dbx) p machine_slot[1] (3) struct { is_cpu = 1 cpu_type = 15 cpu_subtype = 3 running = 1 cpu_ticks = { [0] 416162 [1] 83260 [2] 1401080 [3] 11821212 [4] 1095581 } clock_freq = 1024 error_restart = 0 cpu_panicstr = 0xfffffc000059f6a0 = "cpu_ip_intr: panic request" cpu_panic_thread = 0xffffffff8109a780 } (dbx) p panicstr (4) 0xfffffc0000558ad0 = "simple_lock: uninitialized lock" (dbx) tset active_threads[paniccpu] (5) stopped at [boot:1494 ,0xfffffc0000442918] (dbx) t (6) > 0 boot(0x0, 0x4, 0xac35c0000000a, 0xfffffc00004403fc, 0xfffffc000000000e) \ ["../../../../src/kernel/arch/alpha/machdep.c":1494, 0xfffffc0000442918] 1 panic(s = 0xfffffc0000558b40 = "simple_lock: hierarchy violation") ["../\ 2 simple_lock_fault(slp = 0xfffffc00006292f0, state = 0, caller = 0xfffffc\ 000046f384, arg = 0xfffffc0000534fd8 = "session.s_fpgrp_lock", fmt = 0xfffffc\ 0000558de8 = " class already locked: %s\n", error = 0xfffffc0000558b40 = "\ simple_lock: hierarchy violation") ["../../../../src/kernel/kern/lock.c":1558\ , 0xfffffc00003c34ec] 3 simple_lock_hierarchy_violation(slp = 0xfffffc000046f384, state = 184467\ 39675668500440, caller = 0xfffffc0000558de8, curhier = 5606208) ["../../../..\ /src/kernel/kern/lock.c":1616, 0xfffffc00003c3620] 4 xnaintr(0xfffffc00005a5158, 0x2, 0xffffffffb53ef238, 0xfffffc000068a754,\ 0xfffffc000055891d) ["../../../../src/kernel/io/dec/netif/if_xna.c":1077, 0x\ fffffc000046f384] 5 _XentInt(0x2, 0xfffffc0000447174, 0xfffffc00005b7d40, 0x2, 0x0) ["../../\ 6 swap_ipl(0x2, 0xfffffc0000447174, 0xfffffc00005b7d40, 0x2, 0x0) ["../../\ 7 boot(0x0, 0x0, 0xffffffffa52c6000, 0xffffffffb53ef1f8, 0xfffffc00003bf4f\ c) ["../../../../src/kernel/arch/alpha/machdep.c":1434, 0xfffffc000044280c] 8 panic(s = 0xfffffc0000558ad0 = "simple_lock: uninitialized lock") ["../.\ 9 simple_lock_fault(slp = 0xffffffffa52c6000, state = 1719, caller = 0xfff\ ffc00003734c4, arg = (nil), fmt = (nil), error = 0xfffffc0000558ad0 = "simple\ _lock: uninitialized lock") ["../../../../src/kernel/kern/lock.c":1558, 0xfff\ ffc00003c34ec] 10 simple_lock_valid_violation(slp = 0xfffffc00003734c4, state = 0, caller \ = (nil)) ["../../../../src/kernel/kern/lock.c":1584, 0xfffffc00003c3578] 11 pgrp_ref(0xffffffffa52c6000, 0x0, 0xfffffc000023ee20, 0x6b7, 0xfffffc000\ 05e1080) ["../../../../src/kernel/bsd/kern_proc.c":561, 0xfffffc00003734c4] 12 exit(0xffffffffb53ef740, 0x100, 0x1, 0xffffffffa42e5e80, 0x1) ["../../..\ /../src/kernel/bsd/kern_exit.c":868, 0xfffffc000023ef30] 13 rexit(0xffffffff814d2d80, 0xffffffffb53ef758, 0xffffffffb53ef8b8, 0x1000\ 00001, 0x0) ["../../../../src/kernel/bsd/kern_exit.c":546, 0xfffffc000023e7dc] 14 syscall(0xffffffffb53ec000, 0xfffffc000068a300, 0x0, 0x51, 0x1) ["../../\ 15 _Xsyscall(0x8, 0x3ff800e6938, 0x14000d0f0, 0x1, 0x11ffffc18) ["../../../\ (dbx) p *pmsgbuf (7) struct { msg_magic = 405601 msg_bufx = 701 msg_bufr = 134 msg_bufc = "0.64.143, errno 22 NFS server: stale file handle fs(742,645286) file 573 gen 32779 getattr, client address = 16.140.64.143, errno 22 simple_lock: uninitialized lock pc of caller: 0xfffffc00003734c4 lock address: 0xffffffffa52c6000 lock class name: (unknown_simple_lock) current lock state: 0x00000000e0e9b04a (cpu=0,pc=0xfffffc00e0e9b048,free) panic (cpu 0): simple_lock: uninitialized lock simple_lock: hierarchy violation pc of caller: 0xfffffc000046f384 lock address: 0xfffffc00006292f0 lock info addr: 0xfffffc0000672cc0 lock class name: xna_softc.lk_xna_softc class already locked: session.s_fpgrp_lock
.
.
.
} (dbx) quit
cpu_ip_intro: panic_request
This panic string indicates that this CPU was not the one that started the system panic. This CPU was requested to panic and stop operation.
Notice that the panic function appears twice in the stack trace. The series of events that resulted in the first call to the panic function caused the crash. The events that occurred after the first call to the panic function were performed after the system was corrupt and during an attempt to save data. Normally, any events that occur after the initial call to the panic function will not help you determine why the system crashed.
In this example, the problem is in the pgrp_ref function on line 561 in the kern_proc.c file.
If you follow the stack trace after the pgrp_ref function, you can see that the pgrp_ref function calls the simple_lock_valid_violation function. This function displays information about simple locks, which might be helpful in determining why the system crashed.