MORE INFORMATION
The OS/2 trap screen displays the values of the registers at the time
of failure. Specifically, CS:IP indicates the selector:offset address
of the command that was trying to execute. Unfortunately, OS/2 uses
Intel's protection mechanism instead of physical memory addresses, so
it is sometimes hard to determine which component's code CS:IP is
pointing to.
Determining the Ring Level
The last two CS bits indicate which of Intel's ring levels the error
occurred in. This helps to narrow down which component failed and helps
determine how to find the appropriate component later. To determine the
ring, evaluate CS in binary and examine the least significant two bits.
If they are both on (ones), the failure occurred at ring 3. If they are
both off (zeros) the failure occurred at ring 0.
Ring 3 encompasses application level code. This includes all PM
applications and their DLLs, all LAN Manager services and anything started
with the RUN= line in CONFIG.SYS. Examples include the Netlogon service,
print manager, printer drivers, print queue drivers, any DLLs, and
Presentation Manager itself.
Ring 0 encompasses kernel level code. This includes the OS/2 kernel,
all device drivers and installable file systems. Examples include
OS2KRNL (the OS/2 kernel), HPFS386.IFS, IBMTOK.OS2, TCPDRV.OS2, and
NETWKSTA.SYS.
Usually a ring 3 trap allows you to kill the offending application and
let the operating system continue running. Ring 0 traps require that you
reboot the machine.
Determining the Failing Component
The value of CSLIM in the trap screen represents the limit of the
current code segment. The length of an application's code segment(s)
does not change (with an exception to be discussed shortly). It is
also very unlikely that two components would have code segments with
the same lengths. Therefore, if we can determine which component has a
code segment that matches CSLIM we will have determined the failing
application or driver.
Using the fact that you are in ring 0 or ring 3, as well as any other
evidence that narrows the focus of your search, start searching on the
candidates most likely to have failed. Remember to search any DLLs an
application might be linked with (PSTAT.EXE will tell you which DLLs an
application uses if you run PSTAT while your application is running).
Determining the Lengths of an Application's Code Segment(s)
- EXEHDR is a utility that provides lots of layout information about
an executable, .DLL, or device driver--including the amount of memory
it needs for its code segments. Here is a sample output of EXEHDR on
LAN Manager's node service executable, NODE.EXE:
F:\LANMAN\SERVICES>exehdr node.exe
Microsoft (R) EXE File Header Utility Version 2.01
Copyright (C) Microsoft Corp 1985-1990. All rights reserved.
Module: node
Description: node.exe
Data: NONSHARED
Initial CS:IP: seg 1 offset 4704
Initial SS:SP: seg 5 offset 0000
Extra stack allocation: 2710 bytes
DGROUP: seg 5
no. type address file mem flags
1 CODE 00000200 0a06a 0a06a
2 DATA 0000a600 000c6 00100
3 DATA 00000000 00000 00850
4 DATA 00000000 00000 007c4
5 DATA 0000a800 0243a 02480
The last portion lists the various memory segments this application
requires. The "mem" field represents the size of the memory segment.
Notice that segment 1 has a memory segment of size A06A which happens
to match the CSLIM in the trap at the beginning of this article. That
trap definitely came from NODE.EXE.
Note: For ring 0 components, the CSLIM in the trap dump may be off by
1 or 2 bytes from what EXEHDR reports. In fact, ring 0 components can
change their code segment sizes altogether, making the component a
little harder to narrow down.
- Although EXEHDR is by far the quickest way to determine a code
segment's length (and for ring 3 applications it is all that is
required), this can also be done using the OS/2 kernel debugger.
Load the same version of the software as that of the machine that
failed. Load the OS/2 kernel debugger along with the symbol files
for the suspected components. Break into the kernel debugger and
activate a suspect's map using the kernel debugger's "w" command.
Now use the "lg" command to list the memory segments used by this
component. You will see something like this:
#lg
0047:0000 NODE_TEXT
00A7:0000 FAR_BSS
00C7:0000 DGROUP
#
This is a list of memory segments the application is using along with
their associated selectors. We can dump the global or local descriptor
tables to find out the length of each selector. For instance, if we
want to find the size of 47 above (labeled "NODE_TEXT"), then we can use
the "dg" command:
#dg 47
LDT
0047 Code Bas=994980 Lim=A06A DPL=3 P RE A
#
"Lim=A06A" tells us that the size of the segment associated with
this selector is A06A and thus matches our CSLIM value in the trap.
As mentioned earlier, some ring 0 components can significantly
change the size of their memory segments from what is reported by
EXEHDR. This change can only make the segment smaller. Therefore,
if your CSLIM on your trap dump is 912E, you need only test
components using the kernel debugger who report segments larger
than 912E with EXEHDR.
Determining the Failing Function
Once you are in the kernel debugger, have found the failing component,
have loaded the component's symbols, and have found the appropriate
code segment, simply unassemble the segment's selector at the offset
specified by IP in the trap dump. In our example we might enter the
following command:
#U 47:48C1 L1
0047:48C1 C57760A LDS SI,DWORD PTR [BP+0A]
#
Now use the "LN" command to list the nearest symbols. This will most
likely give you the name of the failing function. In our case we get:
#LN
0047:48B8 _strncpy + C | 0047:48E2 _strncmp - 1E
#
This tells us that we are 12 bytes past the _strncpy symbol and 30
bytes before the _strncmp symbol. This means we are in the middle of
the code for strncpy.
At this point you should be able to find the source code for the
function and could possibly get more information on why the failure
occurred by analyzing the values in the other registers. You can also use
breakpoints and other debugging techniques to determine the reason for
failure. Also, ways of avoiding the failing function may be possible to
determine by seeing why and how the function is used. This can often be
helpful in keeping production environments from hanging.
Note that this is only one way to understand problems better and should
not be considered all the information needed to resolve a bug. Ultimately,
a developer will most likely need a reproduction scenario to actually fix
the bug. If this is not possible, a network trace may also be needed.
For more information on traps, Intel architecture, and machine code,
see any of Intel's Microprocessor Programmer's Reference Manuals. For
information on the OS/2 kernel debugger, see Chapter 11 of the Microsoft
OS/2 Device Driver Reference.