 |
Index for Section 3 |
|
 |
Alphabetical listing for N |
|
 |
Bottom of page |
|
numa_intro(3)
NAME
numa_intro - Introduction to NUMA support
DESCRIPTION
NUMA, or Non-Uniform Memory Access, refers to a hardware architectural
feature in modern multi-processor platforms that attempts to address
the increasing disparity between requirements for processor speed and
bandwidth and the bandwidth capabilities of memory systems, including the
interconnect between processors and memory. NUMA systems address this
problem by grouping resources--processors, I/O busses, and memory--into
building blocks that balance an appropriate number of processors and I/O
busses with a local memory system that delivers the necessary bandwidth.
The local building blocks are combined into a larger system by means of a
system level interconnect with a platform-specific topology.
The local processor and I/O components on a particular building block can
access their own "local" memory with the lowest possible latency for a
particular system design. The local building block can in turn access the
resources (processors, I/O, and memory) of remote building blocks at the
cost of increased access latency and decreased global access bandwidth. The
term "Non-Uniform Memory Access" refers to the difference in latency
between "local" and "remote" memory accesses that can occur on a NUMA
platform.
Overall system throughput and individual application performance is
optimized on a NUMA platform by maximizing the ratio of local resource
accesses to remote accesses. This is achieved by recognizing and preserving
the "affinity" that processes have for the various resources on the system
building blocks. For this reason, the building blocks are called "Resource
Affinity Domains" or RADs.
RADs are supported only on a class of platforms known as Cache Coherent
NUMA, or CC NUMA, where all memory is accessible and cache coherent with
respect to all processors and I/O busses. The Tru64 UNIX operating system
includes enhancements to optimize system throughput and application
performance on CC NUMA platforms for legacy applications as well as those
that use NUMA aware APIs. System enhancements to support NUMA are discussed
in the following subsections. Along with system performance monitoring and
tuning facilities, these enhancements allow the operating system to make a
"best effort" to optimize the performance of any given collection of
applications or application components on a CC-NUMA platform.
NUMA Enhancements to Basic UNIX Algorithms and Default Behaviors
For NUMA, modifications to basic UNIX algorithms (scheduling, memory
allocation, and so forth) and to default behaviors maximize local accesses
transparently to applications. These modifications, which include the
following, directly benefit legacy and non-NUMA-aware applications that
were designed for uniprocessors or Uniform Memory Access Symmetric
Multiprocessors but run on CC NUMA platforms:
· Topology-aware placement of data
The operating system attempts to allocate memory for application (and
kernel) data on the RAD closest to where the data will be accessed;
or, for data that is globally accessed, the operating system may
allocate memory across the available RADs. When there is insufficient
free memory on optimal RADs, the memory allocations for data may
"overflow" onto nearby RADs.
· Replication of read-only code and data
The operating system will attempt to make a local copy of read-only
data, such as shared program and library code. Kernel code and kernel
read-only data are replicated on all RADs at boot time. If
insufficient free local memory is available, the operating system may
choose to utilize a remote copy rather than wait for free local
memory.
· Memory affinity-aware scheduling
The operating system scheduler takes "cache affinity" into account
when choosing a processor to run a process thread on multiprocessor
platforms. Cache affinity assumes that a process thread builds a
"memory footprint" in a particular processor's cache. On CC NUMA
platforms, the scheduler also takes into account the fact that
processes will have memory allocated on particular RADs, and will
attempt to keep processes running on processors that are in the same
RAD as their memory footprints.
· Load balancing
To minimize the requirement for remote memory allocation (overflow),
the scheduler will take into account memory availability on a RAD as
well as the processor load average for the RAD. Although these two
factors may at times conflict with one another, the scheduler will
attempt to balance the load so that processes run where there are
memory pages as well as processor cycles available. This balancing
involves both the initial selection of a RAD at process creation and
migration of processes or individual pages in response to changing
loads as processes come and go or their resource requirements or
access patterns change.
NUMA Enhancements to Application Programming Interfaces
Application programmers can use new or modified library routines to further
increase local accesses on CC NUMA platforms. Using these APIs, programmers
can write new applications or modify old ones to provide additional
information to the operating system or to take explicit control over
process, thread, memory object placement, or some combination of these.
NUMA aware routines are included in the following libraries:
· The Standard C Library (libc)
· The POSIX Threads Library (libpthread)
· The NUMA Library (libnuma)
The reference pages that document NUMA-aware APIs note their library
location.
SEE ALSO
Files: numa_types(4)
 |
Index for Section 3 |
|
 |
Alphabetical listing for N |
|
 |
Top of page |
|