 |
Index for Section 3 |
|
 |
Alphabetical listing for N |
|
 |
Bottom of page |
|
numa_intro(3)
NAME
numa_intro - Introduction to NUMA support
DESCRIPTION
NUMA, or Non-Uniform Memory Access, refers to a hardware architectural
feature in modern multiprocessor platforms that attempts to address the
increasing disparity between requirements for processor speed and bandwidth
and the bandwidth capabilities of memory systems, including the
interconnect between processors and memory. NUMA systems address this
problem by grouping resources--processors, I/O buses, and memory--into
building blocks that balance an appropriate number of processors and I/O
buses with a local memory system that delivers the necessary bandwidth. The
local building blocks are combined into a larger system by means of a
system-level interconnect with a platform-specific topology.
The local processor and I/O components on a particular building block can
access their own "local" memory with the lowest possible latency for a
particular system design. The local building block can in turn access the
resources (processors, I/O, and memory) of remote building blocks at the
cost of increased access latency and decreased global access bandwidth. The
term "Non-Uniform Memory Access" refers to the difference in latency
between "local" and "remote" memory accesses that can occur on a NUMA
platform.
Overall system throughput and individual application performance is
optimized on a NUMA platform by maximizing the ratio of local resource
accesses to remote accesses. This is achieved by recognizing and preserving
the "affinity" that processes have for the various resources on the system
building blocks. For this reason, the building blocks are called "Resource
Affinity Domains" or RADs.
RADs are supported only on a class of platforms known as Cache Coherent
NUMA, or CC NUMA, where all memory is accessible and cache coherent with
respect to all processors and I/O buses. The Tru64 UNIX operating system
includes enhancements to optimize system throughput and application
performance on CC NUMA platforms for legacy applications as well as those
that use NUMA-aware APIs. System enhancements to support NUMA are discussed
in the following subsections. Along with system performance monitoring and
tuning facilities, these enhancements allow the operating system to make a
"best effort" to optimize the performance of any given collection of
applications or application components on a CC-NUMA platform.
NUMA Enhancements to Basic UNIX Algorithms and Default Behaviors
For NUMA, modifications to basic UNIX algorithms (scheduling, memory
allocation, and so forth) and to default behaviors maximize local accesses
transparently to applications. These modifications, which include the
following, directly benefit legacy and non-NUMA-aware applications that
were designed for uniprocessors or Uniform Memory Access Symmetric
Multiprocessors but run on CC NUMA platforms:
· Topology-aware placement of data
The operating system attempts to allocate memory for application (and
kernel) data on the RAD closest to where the data will be accessed;
or, for data that is globally accessed, the operating system may
allocate memory across the available RADs. When there is insufficient
free memory on optimal RADs, the memory allocations for data may
"overflow" onto nearby RADs.
· Replication of read-only code and data
The operating system will attempt to make a local copy of read-only
text, such as shared library and program code. Kernel code and kernel
read-only data are replicated on all RADs at boot time. If
insufficient free local memory is available, the operating system may
choose to utilize a remote copy rather than wait for free local
memory.
· Memory affinity-aware scheduling
The operating system scheduler takes "cache affinity" into account
when choosing a processor to run a process thread on multiprocessor
platforms. Cache affinity assumes that a process thread builds a
"memory footprint" in a particular processor's cache. On CC NUMA
platforms, the scheduler also takes into account the fact that
processes will have memory allocated on particular RADs, and will
attempt to keep processes running on processors that are in the same
RAD as their memory footprints.
· Load balancing
To minimize the requirement for remote memory allocation (overflow),
the scheduler will take into account memory availability on a RAD as
well as the processor load average for the RAD. Although these two
factors may at times conflict with one another, the scheduler will
attempt to balance the load so that processes run where there are
memory pages as well as processor cycles available. This balancing
involves both the initial selection of a RAD at process creation and
migration of processes or individual pages in response to changing
loads as processes come and go or their resource requirements or
access patterns change.
NUMA Enhancements to Application Programming Interfaces
Application programmers can use new or modified library routines to further
increase local accesses on CC NUMA platforms. Using these APIs, programmers
can write new applications or modify old ones to provide additional
information to the operating system or to take explicit control over
process, thread, memory object placement, or some combination of these.
Following are tables that list the NUMA library routines that deal with
RADs and RAD sets, processes and threads, memory management, CPUs and CPU
sets, and NUMA Scheduling Groups. Routines are listed alphabetically in
each table, and some routines are listed in more than one table.
For information about NUMA types, structures, and symbolic values, see
numa_types(4). For information about NUMA Scheduling Groups, see
numa_scheduling_groups(4).
RADs and RAD Sets
_______________________________________________________________________________
Function Purpose Library Reference Page
_______________________________________________________________________________
libnuma nloc(3)
nloc()
Returns the RAD
set that is a
specified distance
from a resource.
libnuma rad_attach_pid(3)
rad_attach_pid()
Attaches a process
to a RAD (assigns
a home RAD but
allows execution
on other RADs).
libnuma rad_attach_pid(3)
rad_bind_pid()
Binds a process to
a RAD (assigns a
home RAD and
restricts
execution to the
home RAD).
libnuma rad_foreach(3)
rad_foreach()
Scans a RAD set
for members and
returns the first
member found.
libnuma
rad_get_current_home()
Returns the
caller's home RAD.
rad_get_current_home(3)
libnuma rad_get_num(3)
rad_get_cpus()
Returns the set of
CPUs that are in a
RAD.
libnuma rad_get_num(3)
rad_get_freemem()
Returns a snapshot
of the free memory
pages that are in
a RAD.
libnuma rad_get_num(3)
rad_get_info()
Returns
information about
a RAD, including
its state (online
or offline) and
the number of CPUs
and memory pages
it contains.
libnuma rad_get_num(3)
rad_get_max()
Returns the number
of RADs in the
system. **
libnuma rad_get_num(3)
rad_get_num()
Returns the number
of RAD's in the
caller's
partition. **
libnuma rad_get_num(3)
rad_get_physmem()
Returns the number
of memory pages
assigned to a RAD.
libnuma rad_get_num(3)
rad_get_state()
Reserved for
future use.
(Currently, RAD
state is always
set to
RAD_ONLINE.)
libnuma radsetops(3)
radaddset()
Adds a RAD to a
RAD set.
libnuma radsetops(3)
radandset()
Performs a logical
AND operation on
two RAD sets,
storing the result
in a RAD set.
libnuma radsetops(3)
radcopyset()
Copies the
contents of one
RAD set to another
RAD set.
libnuma radsetops(3)
radcountset()
Returns the
members of a RAD
set.
libnuma radsetops(3)
raddelset()
Removes a RAD from
a RAD set.
libnuma radsetops(3)
raddiffset()
Finds the logical
difference between
two RAD sets,
storing the result
in another RAD
set.
libnuma radsetops(3)
rademptyset()
Initializes a RAD
set such that no
RADs are included.
libnuma radsetops(3)
radfillset()
Initializes a RAD
set such that it
includes all RADs.
libnuma radsetops(3)
radisemptyset()
Tests whether a
RAD set is empty.
libnuma radsetops(3)
radismember()
Tests whether a
RAD belongs to a
given RAD set.
libnuma radsetops(3)
radorset()
Performs a logical
OR operation on
two RAD sets,
storing the result
in another RAD
set.
libnuma radsetops(3)
radsetcreate()
Allocates a RAD
set and sets it to
empty.
libnuma radsetops(3)
radsetdestroy()
Releases the
memory allocated
for a RAD set.
libnuma radsetops(3)
radxorset()
Performs a logical
XOR operation on
two RAD sets,
storing the result
in another RAD
set.
_______________________________________________________________________________
** On a partitioned system, the system and the partition are equivalent.
In this case, the operating system returns information only for the
partition in which it is installed.
Processes and Threads
_________________________________________________________________________________
Function Purpose Library Reference Page
_________________________________________________________________________________
libnuma nfork(3)
nfork()
Creates a child
process that is an
exact copy of its
parent process. See
also the table entry
for rad_fork().
nmadvise() libnuma nmadvise(3)
Tells the system what
behavior to expect
from a process with
respect to
referencing mapped
files and shared
memory regions.
libnuma
nsg_attach_pid()
Attaches a process to
a NUMA scheduling
group.
nsg_attach_pid(3)
libnuma
nsg_detach_pid()
Detaches a process
from a NUMA
scheduling group.
nsg_attach_pid(3)
libpthread
pthread_nsg_attach()
Attaches a thread to
a NUMA scheduling
group.
pthread_nsg_attach(3)
libpthread
pthread_nsg_detach()
Detaches a thread
from a NUMA
scheduling group.
pthread_nsg_detach(3)
libpthread
pthread_rad_attach()
Attaches a thread to
a RAD set.
pthread_rad_attach(3)
libpthread
pthread_rad_bind()
Attaches a thread to
a RAD set and
restricts its
execution to the home
RAD.
pthread_rad_attach(3)
libpthread
pthread_rad_detach()
Detaches a thread
from a RAD set.
pthread_rad_detach(3)
libnuma
rad_attach_pid()
Attaches a process to
a RAD (assigns a home
RAD but allows
execution on other
RADs).
rad_attach_pid(3)
libnuma
rad_bind_pid()
Binds a process to a
RAD (assigns a home
RAD and restricts
execution to the home
RAD).
rad_attach_pid(3)
libnuma rad_fork(3)
rad_fork()
Creates a child
process on a RAD that
optionally does not
inherit the RAD
assignment of its
parent. See also the
table entry for
nfork().
_________________________________________________________________________________
Memory Management
______________________________________________________________________
Function Purpose Library Reference Page
______________________________________________________________________
libnuma
memalloc_attr()
Returns the memory
allocation policy for
a RAD set specified
by its virtual
address.
memalloc_attr(3)
libc amalloc(3)
nacreate()
Sets up an arena for
memory allocation for
use with the
amalloc() function..
An arena is used in
multithreaded
programs when there
is a need for
thread-specific heap
memory allocation.
libnuma nmadvise(3)
nmadvise()
Tells the system what
behavior to expect
from a process with
respect to
referencing mapped
files and shared
memory regions.
libnuma nmmap(3)
nmmap()
Maps an open file (or
anonymous memory)
onto the address
space for a process
by using a specified
memory allocation
policy.
libnuma nshmget(3)
nshmget()
Returns or creates
the ID for a shared
memory region.
______________________________________________________________________
CPUs and CPU Sets
_________________________________________________________________________
Function Purpose Library Reference Page
_________________________________________________________________________
libc
cpu_foreach()
Enumerates the members
of a CPU set.
cpu_foreach(3)
libc
cpu_get_current()
Returns the identifier
of the current CPU on
which the calling
process is running.
cpu_get_current(3)
libc
cpu_get_info()
Returns CPU
information for the
system. **
cpu_get_info(3)
libc
cpu_get_max()
Returns the number of
CPU slots available in
the caller's
partition. **
cpu_get_info(3)
libc
cpu_get_num()
Returns the number of
available CPUs.
cpu_get_info(3)
libnuma
cpu_get_rad()
Returns the RAD
identifier for a CPU.
cpu_get_rad(3)
libc cpusetops(3)
cpuaddset()
Adds a CPU to a CPU
set.
libc cpusetops(3)
cpuandset()
Performs a logical AND
operation on the
contents of two CPU
sets, storing the
result in a third CPU
set.
libc cpusetops(3)
cpucopyset()
Copies the contents of
one CPU set to another
CPU set.
libc cpusetops(3)
cpucountset()
Returns the number of
CPUs in a CPU set.
libnuma cpusetops(3)
cpudelset()
Deletes a CPU from a
CPU set.
libnuma cpusetops(3)
cpudiffset()
Finds the logical
difference between two
CPU sets, storing the
result in a third CPU
set.
libnuma cpusetops(3)
cpuemptyset()
Initializes a CPU set
such that it includes
no CPUs.
libnuma cpusetops(3)
cpufillset()
Initializes a CPU set
such that it includes
all CPUs.
libnuma cpusetops(3)
cpuisemptyset()
Tests whether a CPU
set is empty.
libnuma cpusetops(3)
cpuismember()
Tests whether a CPU is
a member of a
particular CPU set.
libnuma cpusetops(3)
cpuorset()
Performs a logical OR
operation on the
contents of two CPU
sets, storing the
result in a third CPU
set.
libnuma cpusetops(3)
cpusetcreate()
Allocates a CPU set
and sets it to empty.
libnuma cpusetops(3)
cpusetdestroy()
Releases the memory
allocated to a CPU
set.
libnuma cpusetops(3)
cpuxorset()
Performs a logical XOR
operation on the
contents of two CPU
sets, storing the
result in a third CPU
set.
_________________________________________________________________________
** On a partitioned system, the system and the partition are equivalent.
In this case, the operating system returns information only for the
partition in which it is installed.
NUMA Scheduling Groups
________________________________________________________________________________
Function Purpose Library Reference Page
________________________________________________________________________________
libnuma
nsg_attach_pid()
Attaches a process
to a NUMA scheduling
group.
nsg_attach_pid(3)
libnuma nsg_destroy(3)
nsg_destroy()
Removes a NUMA
scheduling group and
deallocates its
structures.
libnuma
nsg_detach_pid()
Detaches a process
from a NUMA
scheduling group.
nsg_attach_pid(3)
libpthread
pthread_nsg_attach()
Attaches a thread to
a NUMA scheduling
group.
pthread_nsg_attach(3)
libpthread
pthread_nsg_detach()
Detaches a thread
from a NUMA
scheduling group.
pthread_nsg_detach(3)
libnuma nsg_get(3)
nsg_get()
Returns the status
of a NUMA scheduling
group.
libnuma nsg_get_nsgs(3)
nsg_get_nsgs()
Returns a list of
NUMA scheduling
groups that are
active.
libnuma nsg_get_pids(3)
nsg_get_pids()
Returns a list of
processes attached
to a NUMA scheduling
group.
libnuma nsg_init(3)
nsg_init()
Looks up (and
possibly creates) a
NUMA scheduling
group.
libnuma nsg_set(3)
nsg_set()
Sets group ID, user
ID, and permissions
for a NUMA
scheduling group.
libpthread
pthread_nsg_get()
Returns a list of
threads attached to
a NUMA scheduling
group.
pthread_nsg_get(3)
________________________________________________________________________________
NUMA Enhancements to System Utilities and Deamons
A number of system commands display RAD-specific information or perform
RAD-specific operations. The following list briefly describes the NUMA
options supported by system utilities and daemons:
· The runon -r command executes an application on a specific RAD.
· The vmstat -r command displays virtual memory statistics for a
specific RAD.
· The netstat -R command displays network routing tables for each RAD.
· The ps -o RAD command includes RAD binding in the information
displayed about processes running on the system.
· The hwmgr -view hier command displays the RAD location of CPUs and
devices. In this case, in place of a RAD identifier, the command
identifies the contruct in hardware that corresponds to a RAD. When
run on a GS80, GS160, or GS320 AlphaServer platform, the command shows
the hierarchy of CPUs and devices within QBBs. When run on an ES80 or
GS1280 AlphaServer platform, the command shows the hierarchy of CPUs
and devices within PIDs (processing unit IDs).
· The sched_stat -R command also displays the RAD location of system
CPUs. In addition, this command shows the relative distance (number of
hops) between CPUs.
· The -t and -u options on the nfsd command allow customization of the
number of TCP and UCP server threads, respectively, that are spawned
per RAD. This feature allows the NFS server to automatically scale the
number of TCP and UCP server threads according to the size of the
system.
· The -r option on the inetd command allows customization of the RAD
locations on which to start Internet server child daemons. By default,
one child deamon is started on each RAD.
· The route -R command of the kdbx kernel debugger displays network
route tables for all RADs.
SEE ALSO
NUMA Overview
The NUMA Overview is a web-only document that includes a complete NUMA
programming example. Starting with Tru64 UNIX Version 5.1, this web-only
document can be accessed through the version-specific web pages for Tru64
UNIX documentation. Links to documentation sets for different product
versions are available at the following URL:
http://www.Tru64UNIX.compaq.com/docs/pub_page/doc_list.html
 |
Index for Section 3 |
|
 |
Alphabetical listing for N |
|
 |
Top of page |
|