Index Index for
Section 8
Index Alphabetical
listing for S
Bottom of page Bottom of
page

sched_stat(8)

NAME

sched_stat - Displays CPU usage and process-scheduling statistics for SMP and NUMA platforms

SYNOPSIS

/usr/sbin/sched_stat [-l] [-s] [-f] [-u] [-R] [command [cmd_arg]...]

OPTIONS

-f Prints the count of calls that are not multiprocessor safe and therefore funneled to the master CPU. For example: Funnelling counts unix master calls 11174 resulting blocks 2876 The impact of funneled calls on the master CPU needs to be taken into account when evaluating statistics for the master CPU. -l Prints scheduler load-balancing statistics. For example: Scheduler Load Balancing | 5-second averages steal idle desired | current interrupt RT cpu trys steals steals load | load % % -----+------------------------------------------------------------------------- 0 | 288 3 20609 0.000 0.000 0.454 0.156 1 | 615 6 21359 0.000 0.000 0.002 0.203 2 | 996 4 20135 0.000 0.001 0.000 0.237 3 | 1302 4 16195 0.000 0.001 0.000 0.330 6 | 5 0 3029 0.000 0.000 0.000 0.034 . . . In the displayed table, each row contains per-CPU information as follows: cpu The number identifier of the CPU. steal tries The number of attempts made to steal processes/threads from other CPUs when the CPU was not idle. steals The number of processes/threads actually stolen from other CPUs when the CPU was not idle. idle steals The number of processes/threads stolen from other CPUs when the CPU was idle. desired load The number of time slices that should be used on this CPU for running timeshare threads. This information is calculated by comparing the current load, interrupt %, and RT % statistics obtained for this CPU with those obtained for other CPUs in the same PAG. When current load is less than desired load, the scheduler will attempt to migrate timeshare threads to this CPU in order to better balance the timeshare workload among CPUs in the same PAG. See DESCRIPTION for information about PAGs. current load Over the last five seconds, the average number of time slices used to run timeshare threads on this CPU. interrupt % Over the last five seconds, the average percentage of time slices that this CPU spent in interrupt context. RT % Over the last five seconds, the average precentage of time slices that this CPU used to run threads according to FIFO or round-robin policy. -R Prints information about CPU locality in two tables: Radtab Shows the order-of-preference (in terms of memory affinity) that exists between a CPU and different RADs. Order-of- preference indicates, for a given home RAD, the ranking of other RADs in terms of increasing physical distance from that home RAD. If a process or thread needs more memory or needs to be scheduled on a RAD other than its home RAD, the kernel automatically searches RADs for additional memory or CPU cycles in the order of preference shown in this table. Hoptab Shows the distance (number of hops) between different RADs and, by association, between CPUs. The information in this table is coarser-grained than in the preceding Radtab table and more relevant to NUMA programming choices. For example, the expression RAD_DIST_LOCAL + 2 indicates RADs that are no more than two hops from a thread's home RAD. For example (a small, switchless mesh NUMA system): Radtab (rads in order of preference) CPU # Preference 0 1 2 3 -------------------- 0 0 1 2 3 1 1 0 3 2 2 2 3 0 1 3 3 2 1 0 Hoptab (hops indexed by rad) CPU # To rad # 0 1 2 3 -------------------- 0 0 1 1 2 1 1 0 2 1 2 1 2 0 1 3 2 1 1 0 In these tables, the CPU identifiers are listed across the top from left to right and the RAD identifiers are listed on the left from top to bottom. For example if a process running on CPU 2 needs additional memory, Radtab indicates that the kernel will search for that memory first in RAD 2, then in RAD 3, then in RAD 0, and last in RAD 1. Hoptab shows the basis of this preference in that RAD 2 is CPU 2's local RAD, RADs 0 and 3 are one hop away, and RAD 1 is two hops away. The -R option is useful only on NUMA platforms, such as GS1280 and ES80 AlphServer systems, in which memory latency times varies from one RAD to another. The information in these tables is less useful for GS80, GS160 and GS320 AlphaServer systems because both coarse and finer- grained memory affinity is the same from any CPU in one RAD to any CPU in another RAD; however, the displays can tell you which CPUs are in which RAD. Make sure that you both maximize the size of your terminal emulator window and minimize the font size before using the -R option; otherwise, line-wrapping will render the tables very difficult to read on systems that have many CPUs. -s Prints scheduling-dispatch (processor-usage) statistics for each CPU. For example: Scheduler Dispatch Statistics cpu 0 local global idle remote | total percent --------------------------------------------------------------------------- hot 60827 12868 19158991 0 | 19232686 91.6 warm 78 21 1542019 0 | 1542118 7.3 cold 315 27289 184784 7855 | 220243 1.0 --------------------------------------------------------------------------- total 61220 40178 20885794 7855 | 20995047 percent 0.3 0.2 99.5 0.0 cpu 1 local global idle remote | total percent --------------------------------------------------------------------------- hot 33760 11788 16412544 0 | 16458092 89.5 warm 66 24 1707014 0 | 1707104 9.3 cold 201 26191 203513 0 | 229905 1.2 --------------------------------------------------------------------------- . . . These statistics show the count and percentage of thread context switches (times that the kernel switches to a new thread) for the following categories: local Threads scheduled from the CPU's Local Run Queue global Threads scheduled from the Global Run Queue of the PAG to which the CPU belongs idle Threads scheduled from the Idle CPU Queue of the PAG to which the CPU belongs remote Threads stolen from Global or Local Run Queues in another PAG Note that these statistics do not count CPU time slices that were used to re-run the same thread. Each SMP unit (or RAD on a NUMA system) has a Processor Affinity Group (PAG). Each PAG contains the following queues: · A Global Run Queue from which processes or threads are scheduled on the first available CPU · One or more Local Run Queues from which processes or threads are scheduled on a specific CPU · A queue that contains idle CPUs A thread that is handed to an idle CPU goes directly to that CPU without first being placed on the other queues. If there is insufficient work queued locally to keep the PAG's CPUs busy, threads are stolen first from the Global and then the Local Run Queues in a remote PAG. For each of these categories, statistics are grouped into hot, warm, and cold subcategories. The hot statistics show context switches to threads that last ran on the CPU only a very short time before. The warm statistics show context switches to threads that last ran on the CPU a somewhat longer time before. The cold statistics indicate context switches to threads that never ran on the CPU before. These statistics are a measure of how well cache affinity is being maintained; that is, how likely the data used by threads when they last ran is still in the cache when the threads are rescheduled. You cannot evaluate this information without knowledge of the type of work being done on the system; maintenance of cache affinity can be very important on systems (or processor sets) that are dedicated to running certain applications (such as those doing high performance technical computing) but is less critical for systems serving a variety of applications and users. -u Prints processor-usage statistics for each CPU. For example: Processor Usage cpu user nice system idle widle | scalls intr csw tbsyc -----+-------------------------------+------------------------------------------ 0 | 0.0 0.0 0.7 99.2 0.1 | 3327337 50861486 41885424 317108 1 | 0.0 0.0 0.4 99.5 0.1 | 3514438 0 36710149 268667 2 | 0.0 0.0 0.4 99.5 0.1 | 3182064 0 37384120 257749 3 | 0.0 0.0 0.4 99.5 0.1 | 3528519 0 36468319 249492 6 | 0.0 0.0 0.1 99.9 0.0 | 668892 11664 11793053 352294 7 | 0.0 0.0 0.1 99.9 0.0 | 772821 0 9341527 352319 8 | 0.0 0.0 0.0 100.0 0.0 | 529050 11724 5717059 347267 9 | 0.0 0.0 0.0 100.0 0.0 | 492386 0 6603681 351509 . . . In this table: cpu The number identifier of the CPU. user The percentage of time slices spent running threads in user context. nice The percentage of time slices in which lower-priority threads were scheduled. These are user-context threads whose priority was explicitly lowered by using an interface such as the nice command or the class-scheduling software. system The percentage of time slices spent running threads in system context. This work includes servicing of interrupts and system calls that are made on behalf of user processes. An unusually high percentage in the system category might indicate a system bottleneck. Running kprofile and lockinfo provides more specific information about where system time is being spent. See uprofile(1) and lockinfo(8), respectively, for information about these utilities. idle The percentage of time slices in which no threads were scheduled. widle The percentage of time slices in which available threads were blocked by pending I/O and the CPU was idle. If this count is unusually high, it suggests that a bottleneck in an I/O channel might be causing suboptimal performance. scalls The count of system calls that were serviced. intr The count of interrupts that were serviced. csw The count of thread context switches (thread scheduling changes) that completed. tbsyc The number of times that the translation buffer was synchronized.

OPERANDS

command The command to be executed by sched_stat. cmd_args Any arguments to the preceding command. The command and cmd_arg operands are used to limit the length of time in which sched_stat gathers statistics. Typically, sleep is specified for command and some number of seconds is specified for cmd_arg. If you do not specify a command to specify an time interval for statistics gathering, the statistics will reflect what has occurred since the system was last booted.

DESCRIPTION

The sched_stat utility helps you determine how well the system load is distributed among CPUs, what kinds of jobs are getting (or not getting) sufficient cycles on each CPU, and how well cache affinity is being maintained for these jobs. Answers to the following questions influence how a process and its threads are scheduled: · Is the request to be serviced multiprocessor-safe? If not, the kernel funnels the request to the master CPU. The master CPU must reside in the default processor set (which contains all system CPUs if none were assigned to user-defined processor sets) and is typically CPU 0; however, some platforms permit CPUs other than CPU 0 to be the master CPU. Few requests generated by software distributed with the operating system need to be funneled to the master CPU and most of these are associated with certain device drivers. However, if the system runs many third-party drivers, the number of requests that must be funneled to the master CPU might be higher. · What is the job priority? Job priority influences how frequently a thread is scheduled. Realtime requests and interrupts have higher priority than time-share jobs, which include the majority of user-mode threads. So, if a significant number of CPU cycles are spent servicing realtime requests and interrupts, there are fewer cycles available for time-share jobs. Default priority for time-share jobs can also be changed by using the nice command, the runclass command, or through class-scheduling software. On a busy system, cache affinity is less likely to be maintained for a thread from a time-share job whose priority was lowered because more time is likely to elapse between rescheduling operations for each thread. Conversely, cache affinity is more likely to be maintained for threads of a higher-priority time-share job because less time elapses between rescheduling operations. Note that the scheduler always prioritizes the need for low response latency (as demanded by interrupts and real-time requests) higher than maintenance of cache affinity, regardless of the priority assigned to a time-share job. · Are there user-defined restrictions that limit where a process may run? If so, the kernel must schedule all threads of that process on CPUs in the restricted set. In some cases, user-defined restrictions are explicit RAD or CPU bindings specified either in an application or by a command (such as runon) that was used to launch the program or reassign one of its threads. The set of CPUs where the kernel can schedule a thread is also influenced by the presence of user-defined processor sets. If the process was not explicitly started in or reassigned to a user-defined processor set, the kernel must run it and all of its threads only on CPUs in the default processor set. · Are any CPUs idle? The scheduler is very aggressive in its attempts to steal jobs from other CPUs to run on an idle CPU. This means that the scheduler will migrate processes or threads across RAD boundaries to give an idle CPU work to do unless one of the preceding restrictions is in place to prevent that. For example, the scheduler does not cross processor set boundaries when stealing work from another CPU, even when a CPU is idle. In general, keeping CPUs busy with work has higher priority than maintaining memory or cache affinity during load-balancing operations. Explicit memory-allocation advice provided in application code influences scheduling only to the extent that the preceding factors do not override that advice. However, explicit memory-allocation advice does make a difference (and thereby can improve performance) when CPUs in the processor set where the program is running are kept busy but are not overloaded. To gather statistics with sched_stat, you typically follow these steps: 1. Start up a system workload and wait for it to get to a steady state. 2. Start sched_stat with sleep as the specified command and some number of seconds as the specified cmd_arg. This causes sched_stat to gather statistics for the length of time it takes the sleep command to execute. For example, the following command causes sched_stat to collect statistics for 60 seconds and then print a report: # /usr/sbin/sched_stat sleep 60 If you include options on the command line, only statistics for the specified options are reported. If you specify the command without any options, all options except for -R are assumed. (See the descriptions of the -f, -l, -s, and -u options in the OPTIONS section.)

NOTES

Running the sched_stat command has minimal impact on system performance.

RESTRICTIONS

The sched_stat utility is subject to change, without advance notice, from one release to another. The utility is intended mainly for use by other software applications included in the operating system product, kernel developers, and software support representatives. Therefore, sched_stat should be used only interactively; any customer scripts or programs written to depend on its output data or display format might be broken by changes in future versions of the utility or by patches that might be applied to it.

EXIT STATUS

0 (Zero) Success. >0 An error occurred.

FILES

/dev/sysdev0 The pseudo driver that is opened by the sched_stat utility for RAD- related statistics gathering.

SEE ALSO

Commands: iostat(1), netstat(1), nice(1), renice(1), runclass(1), runon(1), uprofile(1), vmstat(1), advfsstat(8), collect(8), lockinfo(8), nfsstat(8), sys_check(8) Others: numa_intro(3), class_scheduling(4), processor_sets(4)

Index Index for
Section 8
Index Alphabetical
listing for S
Top of page Top of
page