sched_stat(8)

Index for
Section 8
Alphabetical
listing for S
Bottom of
page
sched_stat(8)
NAME
  sched_stat - Displays CPU usage and process-scheduling statistics for SMP
  and NUMA platforms

SYNOPSIS
  /usr/sbin/sched_stat [-l] [-s] [-f] [-u] [-R] [command [cmd_arg]...]

OPTIONS
  -f  Prints the count of calls that are not multiprocessor safe and
      therefore funneled to the master CPU. For example:

	    Funnelling counts

	    unix master calls 11174   resulting blocks 2876

      The impact of funneled calls on the master CPU needs to be taken into
      account when evaluating statistics for the master CPU.

  -l  Prints scheduler load-balancing statistics. For example:
				      Scheduler Load Balancing

							 |	   5-second averages
		     steal	       idle	 desired |   current   interrupt     RT
	    cpu	      trys    steals   steals	 load	 |   load	  %	     %
	   -----+-------------------------------------------------------------------------
	      0 |      288	  3    20609	  0.000	     0.000	0.454	   0.156
	      1 |      615	  6    21359	  0.000	     0.000	0.002	   0.203
	      2 |      996	  4    20135	  0.000	     0.001	0.000	   0.237
	      3 |     1302	  4    16195	  0.000	     0.001	0.000	   0.330
	      6 |	 5	  0	3029	  0.000	     0.000	0.000	   0.034

	   .
	   .
	   .

      In the displayed table, each row contains per-CPU information as
      follows:

      cpu     The number identifier of the CPU.

      steal tries
	      The number of attempts made to steal processes/threads from
	      other CPUs when the CPU was not idle.

      steals  The number of processes/threads actually stolen from other CPUs
	      when the CPU was not idle.

      idle steals
	      The number of processes/threads stolen from other CPUs when the
	      CPU was idle.

      desired load
	      The number of time slices that should be used on this CPU for
	      running timeshare threads.  This information is calculated by
	      comparing the current load, interrupt %, and RT % statistics
	      obtained for this CPU with those obtained for other CPUs in the
	      same PAG.

	      When current load is less than desired load, the scheduler will
	      attempt to migrate timeshare threads to this CPU in order to
	      better balance the timeshare workload among CPUs in the same
	      PAG.

	      See DESCRIPTION for information about PAGs.

      current load
	      Over the last five seconds, the average number of time slices
	      used to run timeshare threads on this CPU.

      interrupt %
	      Over the last five seconds, the average percentage of time
	      slices that this CPU spent in interrupt context.

      RT %    Over the last five seconds, the average precentage of time
	      slices that this CPU used to run threads according to FIFO or
	      round-robin policy.

  -R  Prints information about CPU locality in two tables:

      Radtab  Shows the order-of-preference (in terms of memory affinity)
	      that exists between a CPU and different RADs.  Order-of-
	      preference indicates, for a given home RAD, the ranking of
	      other RADs in terms of increasing physical distance from that
	      home RAD.	 If a process or thread needs more memory or needs to
	      be scheduled on a RAD other than its home RAD, the kernel
	      automatically searches RADs for additional memory or CPU cycles
	      in the order of preference shown in this table.

      Hoptab  Shows the distance (number of hops) between different RADs and,
	      by association, between CPUs. The information in this table is
	      coarser-grained than in the preceding Radtab table and more
	      relevant to NUMA programming choices. For example, the
	      expression RAD_DIST_LOCAL + 2 indicates RADs that are no more
	      than two hops from a thread's home RAD.

      For example (a small, switchless mesh NUMA system):

	   Radtab (rads in order of preference)
				    CPU #
	   Preference	 0    1	   2	3
		     --------------------
	   0		 0    1	   2	3
	   1		 1    0	   3	2
	   2		 2    3	   0	1
	   3		 3    2	   1	0

	   Hoptab (hops indexed by rad)
				    CPU #
	   To rad #	 0    1	   2	3
		     --------------------
	   0		 0    1	   1	2
	   1		 1    0	   2	1
	   2		 1    2	   0	1
	   3		 2    1	   1	0

      In these tables, the CPU identifiers are listed across the top from
      left to right and the RAD identifiers are listed on the left from top
      to bottom.  For example if a process running on CPU 2 needs additional
      memory, Radtab indicates that the kernel will search for that memory
      first in RAD 2, then in RAD 3, then in RAD 0, and last in RAD 1.
      Hoptab shows the basis of this preference in that RAD 2 is CPU 2's
      local RAD, RADs 0 and 3 are one hop away, and RAD 1 is two hops away.

      The -R option is useful only on NUMA platforms, such as GS1280 and ES80
      AlphServer systems, in which memory latency times varies from one RAD
      to another. The information in these tables is less useful for GS80,
      GS160 and GS320 AlphaServer systems because both coarse and finer-
      grained memory affinity is the same from any CPU in one RAD to any CPU
      in another RAD; however, the displays can tell you which CPUs are in
      which RAD.

      Make sure that you both maximize the size of your terminal emulator
      window and minimize the font size before using the -R option;
      otherwise, line-wrapping will render the tables very difficult to read
      on systems that have many CPUs.

  -s  Prints scheduling-dispatch (processor-usage) statistics for each CPU.
      For example:

				    Scheduler Dispatch Statistics

	   cpu 0       local	   global	idle	  remote |     total   percent
	   ---------------------------------------------------------------------------
	   hot	       60827	   12868    19158991	       0 |    19232686	  91.6
	   warm		  78	      21     1542019	       0 |     1542118	   7.3
	   cold		 315	   27289      184784	    7855 |	220243	   1.0
	   ---------------------------------------------------------------------------
	   total       61220	   40178    20885794	    7855 |    20995047
	   percent	 0.3	     0.2	99.5	     0.0

	   cpu 1       local	   global	idle	  remote |     total   percent
	   ---------------------------------------------------------------------------
	   hot	       33760	   11788    16412544	       0 |    16458092	  89.5
	   warm		  66	      24     1707014	       0 |     1707104	   9.3
	   cold		 201	   26191      203513	       0 |	229905	   1.2
	   ---------------------------------------------------------------------------

	   .
	   .
	   .

      These statistics show the count and percentage of thread context
      switches (times that the kernel switches to a new thread) for the
      following categories:

      local   Threads scheduled from the CPU's Local Run Queue

      global  Threads scheduled from the Global Run Queue of the PAG to which
	      the CPU belongs

      idle    Threads scheduled from the Idle CPU Queue of the PAG to which
	      the CPU belongs

      remote  Threads stolen from Global or Local Run Queues in another PAG

      Note that these statistics do not count CPU time slices that were used
      to re-run the same thread.

      Each SMP unit (or RAD on a NUMA system) has a Processor Affinity Group
      (PAG). Each PAG contains the following queues:

	·  A Global Run Queue from which processes or threads are scheduled
	   on the first available CPU

	·  One or more Local Run Queues from which processes or threads are
	   scheduled on a specific CPU

	·  A queue that contains idle CPUs

	   A thread that is handed to an idle CPU goes directly to that CPU
	   without first being placed on the other queues.

      If there is insufficient work queued locally to keep the PAG's CPUs
      busy, threads are stolen first from the Global and then the Local Run
      Queues in a remote PAG.

      For each of these categories, statistics are grouped into hot, warm,
      and cold subcategories.  The hot statistics show context switches to
      threads that last ran on the CPU only a very short time before. The
      warm statistics show context switches to threads that last ran on the
      CPU a somewhat longer time before. The cold statistics indicate context
      switches to threads that never ran on the CPU before. These statistics
      are a measure of how well cache affinity is being maintained; that is,
      how likely the data used by threads when they last ran is still in the
      cache when the threads are rescheduled. You cannot evaluate this
      information without knowledge of the type of work being done on the
      system; maintenance of cache affinity can be very important on systems
      (or processor sets) that are dedicated to running certain applications
      (such as those doing high performance technical computing) but is less
      critical for systems serving a variety of applications and users.

  -u  Prints processor-usage statistics for each CPU. For example:

					   Processor Usage

	    cpu	   user	 nice system idle widle |    scalls	  intr	    csw	     tbsyc
	   -----+-------------------------------+------------------------------------------
	      0 |   0.0	  0.0	0.7  99.2   0.1 |   3327337   50861486	 41885424   317108
	      1 |   0.0	  0.0	0.4  99.5   0.1 |   3514438	     0	 36710149   268667
	      2 |   0.0	  0.0	0.4  99.5   0.1 |   3182064	     0	 37384120   257749
	      3 |   0.0	  0.0	0.4  99.5   0.1 |   3528519	     0	 36468319   249492
	      6 |   0.0	  0.0	0.1  99.9   0.0 |    668892	 11664	 11793053   352294
	      7 |   0.0	  0.0	0.1  99.9   0.0 |    772821	     0	  9341527   352319
	      8 |   0.0	  0.0	0.0 100.0   0.0 |    529050	 11724	  5717059   347267
	      9 |   0.0	  0.0	0.0 100.0   0.0 |    492386	     0	  6603681   351509

	   .
	   .
	   .

      In this table:

      cpu     The number identifier of the CPU.

      user    The percentage of time slices spent running threads in user
	      context.

      nice    The percentage of time slices in which lower-priority threads
	      were scheduled. These are user-context threads whose priority
	      was explicitly lowered by using an interface such as the nice
	      command or the class-scheduling software.

      system  The percentage of time slices spent running threads in system
	      context. This work includes servicing of interrupts and system
	      calls that are made on behalf of user processes.	An unusually
	      high percentage in the system category might indicate a system
	      bottleneck. Running kprofile and lockinfo provides more
	      specific information about where system time is being spent.
	      See uprofile(1) and lockinfo(8), respectively, for information
	      about these utilities.

      idle    The percentage of time slices in which no threads were
	      scheduled.

      widle   The percentage of time slices in which available threads were
	      blocked by pending I/O and the CPU was idle. If this count is
	      unusually high, it suggests that a bottleneck in an I/O channel
	      might be causing suboptimal performance.

      scalls  The count of system calls that were serviced.

      intr    The count of interrupts that were serviced.

      csw     The count of thread context switches (thread scheduling
	      changes) that completed.

      tbsyc   The number of times that the translation buffer was
	      synchronized.

OPERANDS
  command
      The command to be executed by sched_stat.

  cmd_args
      Any arguments to the preceding command.

  The command and cmd_arg operands are used to limit the length of time in
  which sched_stat gathers statistics. Typically, sleep is specified for
  command and some number of seconds is specified for cmd_arg.

  If you do not specify a command to specify an time interval for statistics
  gathering, the statistics will reflect what has occurred since the system
  was last booted.

DESCRIPTION
  The sched_stat utility helps you determine how well the system load is
  distributed among CPUs, what kinds of jobs are getting (or not getting)
  sufficient cycles on each CPU, and how well cache affinity is being
  maintained for these jobs.

  Answers to the following questions influence how a process and its threads
  are scheduled:

    ·  Is the request to be serviced multiprocessor-safe?

       If not, the kernel funnels the request to the master CPU. The master
       CPU must reside in the default processor set (which contains all
       system CPUs if none were assigned to user-defined processor sets) and
       is typically CPU 0; however, some platforms permit CPUs other than CPU
       0 to be the master CPU.	Few requests generated by software
       distributed with the operating system need to be funneled to the
       master CPU and most of these are associated with certain device
       drivers. However, if the system runs many third-party drivers, the
       number of requests that must be funneled to the master CPU might be
       higher.

    ·  What is the job priority?

       Job priority influences how frequently a thread is scheduled. Realtime
       requests and interrupts have higher priority than time-share jobs,
       which include the majority of user-mode threads. So, if a significant
       number of CPU cycles are spent servicing realtime requests and
       interrupts, there are fewer cycles available for time-share jobs.

       Default priority for time-share jobs can also be changed by using the
       nice command, the runclass command, or through class-scheduling
       software. On a busy system, cache affinity is less likely to be
       maintained for a thread from a time-share job whose priority was
       lowered because more time is likely to elapse between rescheduling
       operations for each thread. Conversely, cache affinity is more likely
       to be maintained for threads of a higher-priority time-share job
       because less time elapses between rescheduling operations. Note that
       the scheduler always prioritizes the need for low response latency (as
       demanded by interrupts and real-time requests) higher than maintenance
       of cache affinity, regardless of the priority assigned to a time-share
       job.

    ·  Are there user-defined restrictions that limit where a process may
       run?

       If so, the kernel must schedule all threads of that process on CPUs in
       the restricted set. In some cases, user-defined restrictions are
       explicit RAD or CPU bindings specified either in an application or by
       a command (such as runon) that was used to launch the program or
       reassign one of its threads.

       The set of CPUs where the kernel can schedule a thread is also
       influenced by the presence of user-defined processor sets. If the
       process was not explicitly started in or reassigned to a user-defined
       processor set, the kernel must run it and all of its threads only on
       CPUs in the default processor set.

    ·  Are any CPUs idle?

       The scheduler is very aggressive in its attempts to steal jobs from
       other CPUs to run on an idle CPU. This means that the scheduler will
       migrate processes or threads across RAD boundaries to give an idle CPU
       work to do unless one of the preceding restrictions is in place to
       prevent that. For example, the scheduler does not cross processor set
       boundaries when stealing work from another CPU, even when a CPU is
       idle. In general, keeping CPUs busy with work has higher priority than
       maintaining memory or cache affinity during load-balancing operations.

  Explicit memory-allocation advice provided in application code influences
  scheduling only to the extent that the preceding factors do not override
  that advice. However, explicit memory-allocation advice does make a
  difference (and thereby can improve performance) when CPUs in the processor
  set where the program is running are kept busy but are not overloaded.

  To gather statistics with sched_stat, you typically follow these steps:

   1.  Start up a system workload and wait for it to get to a steady state.

   2.  Start sched_stat with sleep as the specified command and some number
       of seconds as the specified cmd_arg. This causes sched_stat to gather
       statistics for the length of time it takes the sleep command to
       execute.

  For example, the following command causes sched_stat to collect statistics
  for 60 seconds and then print a report:

       # /usr/sbin/sched_stat sleep 60

  If you include options on the command line, only statistics for the
  specified options are reported.

  If you specify the command without any options, all options except for -R
  are assumed. (See the descriptions of the -f, -l, -s, and -u options in the
  OPTIONS section.)

NOTES
  Running the sched_stat command has minimal impact on system performance.

RESTRICTIONS
  The sched_stat utility is subject to change, without advance notice, from
  one release to another. The utility is intended mainly for use by other
  software applications included in the operating system product, kernel
  developers, and software support representatives. Therefore, sched_stat
  should be used only interactively; any customer scripts or programs written
  to depend on its output data or display format might be broken by changes
  in future versions of the utility or by patches that might be applied to
  it.

EXIT STATUS
  0 (Zero)
      Success.

  >0  An error occurred.

FILES
  /dev/sysdev0
      The pseudo driver that is opened by the sched_stat utility for RAD-
      related statistics gathering.

SEE ALSO
  Commands: iostat(1), netstat(1), nice(1), renice(1), runclass(1), runon(1),
  uprofile(1), vmstat(1), advfsstat(8), collect(8), lockinfo(8), nfsstat(8),
  sys_check(8)

  Others: numa_intro(3), class_scheduling(4), processor_sets(4)
Index for
Section 8
Alphabetical
listing for S
Top of
page