Determining the Source of OS/2 Traps (97483)

MORE INFORMATION

The OS/2 trap screen displays the values of the registers at the time of failure. Specifically, CS:IP indicates the selector:offset address of the command that was trying to execute. Unfortunately, OS/2 uses Intel's protection mechanism instead of physical memory addresses, so it is sometimes hard to determine which component's code CS:IP is pointing to.

Determining the Ring Level

The last two CS bits indicate which of Intel's ring levels the error occurred in. This helps to narrow down which component failed and helps determine how to find the appropriate component later. To determine the ring, evaluate CS in binary and examine the least significant two bits. If they are both on (ones), the failure occurred at ring 3. If they are both off (zeros) the failure occurred at ring 0.

Ring 3 encompasses application level code. This includes all PM applications and their DLLs, all LAN Manager services and anything started with the RUN= line in CONFIG.SYS. Examples include the Netlogon service, print manager, printer drivers, print queue drivers, any DLLs, and Presentation Manager itself.

Ring 0 encompasses kernel level code. This includes the OS/2 kernel, all device drivers and installable file systems. Examples include OS2KRNL (the OS/2 kernel), HPFS386.IFS, IBMTOK.OS2, TCPDRV.OS2, and NETWKSTA.SYS.

Usually a ring 3 trap allows you to kill the offending application and let the operating system continue running. Ring 0 traps require that you reboot the machine.

Determining the Failing Component

The value of CSLIM in the trap screen represents the limit of the current code segment. The length of an application's code segment(s) does not change (with an exception to be discussed shortly). It is also very unlikely that two components would have code segments with the same lengths. Therefore, if we can determine which component has a code segment that matches CSLIM we will have determined the failing application or driver.

Using the fact that you are in ring 0 or ring 3, as well as any other evidence that narrows the focus of your search, start searching on the candidates most likely to have failed. Remember to search any DLLs an application might be linked with (PSTAT.EXE will tell you which DLLs an application uses if you run PSTAT while your application is running).

Determining the Lengths of an Application's Code Segment(s)

EXEHDR is a utility that provides lots of layout information about an executable, .DLL, or device driver--including the amount of memory it needs for its code segments. Here is a sample output of EXEHDR on LAN Manager's node service executable, NODE.EXE:
```
      F:\LANMAN\SERVICES>exehdr node.exe

  Microsoft (R) EXE File Header Utility  Version 2.01
  Copyright (C) Microsoft Corp 1985-1990.  All rights reserved.

  Module:                   node
  Description:              node.exe
  Data:                     NONSHARED
  Initial CS:IP:            seg   1 offset 4704
  Initial SS:SP:            seg   5 offset 0000
  Extra stack allocation:   2710 bytes
  DGROUP:                   seg   5

  no. type address  file  mem   flags

    1 CODE 00000200 0a06a 0a06a
    2 DATA 0000a600 000c6 00100
    3 DATA 00000000 00000 00850
    4 DATA 00000000 00000 007c4
    5 DATA 0000a800 0243a 02480
						
```
The last portion lists the various memory segments this application requires. The "mem" field represents the size of the memory segment. Notice that segment 1 has a memory segment of size A06A which happens to match the CSLIM in the trap at the beginning of this article. That trap definitely came from NODE.EXE.

Note: For ring 0 components, the CSLIM in the trap dump may be off by 1 or 2 bytes from what EXEHDR reports. In fact, ring 0 components can change their code segment sizes altogether, making the component a little harder to narrow down.
Although EXEHDR is by far the quickest way to determine a code segment's length (and for ring 3 applications it is all that is required), this can also be done using the OS/2 kernel debugger. Load the same version of the software as that of the machine that failed. Load the OS/2 kernel debugger along with the symbol files for the suspected components. Break into the kernel debugger and activate a suspect's map using the kernel debugger's "w" command. Now use the "lg" command to list the memory segments used by this component. You will see something like this:
```
      #lg
      0047:0000 NODE_TEXT
      00A7:0000 FAR_BSS
      00C7:0000 DGROUP
      #
						
```

This is a list of memory segments the application is using along with their associated selectors. We can dump the global or local descriptor tables to find out the length of each selector. For instance, if we want to find the size of 47 above (labeled "NODE_TEXT"), then we can use the "dg" command:

   #dg 47
   LDT
   0047  Code    Bas=994980 Lim=A06A DPL=3 P  RE    A
   #

"Lim=A06A" tells us that the size of the segment associated with this selector is A06A and thus matches our CSLIM value in the trap.

As mentioned earlier, some ring 0 components can significantly change the size of their memory segments from what is reported by EXEHDR. This change can only make the segment smaller. Therefore, if your CSLIM on your trap dump is 912E, you need only test components using the kernel debugger who report segments larger than 912E with EXEHDR.

Determining the Failing Function

Once you are in the kernel debugger, have found the failing component, have loaded the component's symbols, and have found the appropriate code segment, simply unassemble the segment's selector at the offset specified by IP in the trap dump. In our example we might enter the following command:

  #U 47:48C1 L1
  0047:48C1 C57760A       LDS     SI,DWORD PTR [BP+0A]
  #

Now use the "LN" command to list the nearest symbols. This will most likely give you the name of the failing function. In our case we get:

  #LN
  0047:48B8 _strncpy + C | 0047:48E2 _strncmp - 1E
  #

This tells us that we are 12 bytes past the _strncpy symbol and 30 bytes before the _strncmp symbol. This means we are in the middle of the code for strncpy.

At this point you should be able to find the source code for the function and could possibly get more information on why the failure occurred by analyzing the values in the other registers. You can also use breakpoints and other debugging techniques to determine the reason for failure. Also, ways of avoiding the failing function may be possible to determine by seeing why and how the function is used. This can often be helpful in keeping production environments from hanging.

Note that this is only one way to understand problems better and should not be considered all the information needed to resolve a bug. Ultimately, a developer will most likely need a reproduction scenario to actually fix the bug. If this is not possible, a network trace may also be needed.

For more information on traps, Intel architecture, and machine code, see any of Intel's Microprocessor Programmer's Reference Manuals. For information on the OS/2 kernel debugger, see Chapter 11 of the Microsoft OS/2 Device Driver Reference.