Index Index for
Section 3
Index Alphabetical
listing for N
Bottom of page Bottom of
page

numa_intro(3)

NAME

numa_intro - Introduction to NUMA support

DESCRIPTION

NUMA, or Non-Uniform Memory Access, refers to a hardware architectural feature in modern multiprocessor platforms that attempts to address the increasing disparity between requirements for processor speed and bandwidth and the bandwidth capabilities of memory systems, including the interconnect between processors and memory. NUMA systems address this problem by grouping resources--processors, I/O buses, and memory--into building blocks that balance an appropriate number of processors and I/O buses with a local memory system that delivers the necessary bandwidth. The local building blocks are combined into a larger system by means of a system-level interconnect with a platform-specific topology. The local processor and I/O components on a particular building block can access their own "local" memory with the lowest possible latency for a particular system design. The local building block can in turn access the resources (processors, I/O, and memory) of remote building blocks at the cost of increased access latency and decreased global access bandwidth. The term "Non-Uniform Memory Access" refers to the difference in latency between "local" and "remote" memory accesses that can occur on a NUMA platform. Overall system throughput and individual application performance is optimized on a NUMA platform by maximizing the ratio of local resource accesses to remote accesses. This is achieved by recognizing and preserving the "affinity" that processes have for the various resources on the system building blocks. For this reason, the building blocks are called "Resource Affinity Domains" or RADs. RADs are supported only on a class of platforms known as Cache Coherent NUMA, or CC NUMA, where all memory is accessible and cache coherent with respect to all processors and I/O buses. The Tru64 UNIX operating system includes enhancements to optimize system throughput and application performance on CC NUMA platforms for legacy applications as well as those that use NUMA-aware APIs. System enhancements to support NUMA are discussed in the following subsections. Along with system performance monitoring and tuning facilities, these enhancements allow the operating system to make a "best effort" to optimize the performance of any given collection of applications or application components on a CC-NUMA platform. NUMA Enhancements to Basic UNIX Algorithms and Default Behaviors For NUMA, modifications to basic UNIX algorithms (scheduling, memory allocation, and so forth) and to default behaviors maximize local accesses transparently to applications. These modifications, which include the following, directly benefit legacy and non-NUMA-aware applications that were designed for uniprocessors or Uniform Memory Access Symmetric Multiprocessors but run on CC NUMA platforms: · Topology-aware placement of data The operating system attempts to allocate memory for application (and kernel) data on the RAD closest to where the data will be accessed; or, for data that is globally accessed, the operating system may allocate memory across the available RADs. When there is insufficient free memory on optimal RADs, the memory allocations for data may "overflow" onto nearby RADs. · Replication of read-only code and data The operating system will attempt to make a local copy of read-only text, such as shared library and program code. Kernel code and kernel read-only data are replicated on all RADs at boot time. If insufficient free local memory is available, the operating system may choose to utilize a remote copy rather than wait for free local memory. · Memory affinity-aware scheduling The operating system scheduler takes "cache affinity" into account when choosing a processor to run a process thread on multiprocessor platforms. Cache affinity assumes that a process thread builds a "memory footprint" in a particular processor's cache. On CC NUMA platforms, the scheduler also takes into account the fact that processes will have memory allocated on particular RADs, and will attempt to keep processes running on processors that are in the same RAD as their memory footprints. · Load balancing To minimize the requirement for remote memory allocation (overflow), the scheduler will take into account memory availability on a RAD as well as the processor load average for the RAD. Although these two factors may at times conflict with one another, the scheduler will attempt to balance the load so that processes run where there are memory pages as well as processor cycles available. This balancing involves both the initial selection of a RAD at process creation and migration of processes or individual pages in response to changing loads as processes come and go or their resource requirements or access patterns change. NUMA Enhancements to Application Programming Interfaces Application programmers can use new or modified library routines to further increase local accesses on CC NUMA platforms. Using these APIs, programmers can write new applications or modify old ones to provide additional information to the operating system or to take explicit control over process, thread, memory object placement, or some combination of these. Following are tables that list the NUMA library routines that deal with RADs and RAD sets, processes and threads, memory management, CPUs and CPU sets, and NUMA Scheduling Groups. Routines are listed alphabetically in each table, and some routines are listed in more than one table. For information about NUMA types, structures, and symbolic values, see numa_types(4). For information about NUMA Scheduling Groups, see numa_scheduling_groups(4). RADs and RAD Sets _______________________________________________________________________________ Function Purpose Library Reference Page _______________________________________________________________________________ libnuma nloc(3) nloc() Returns the RAD set that is a specified distance from a resource. libnuma rad_attach_pid(3) rad_attach_pid() Attaches a process to a RAD (assigns a home RAD but allows execution on other RADs). libnuma rad_attach_pid(3) rad_bind_pid() Binds a process to a RAD (assigns a home RAD and restricts execution to the home RAD). libnuma rad_foreach(3) rad_foreach() Scans a RAD set for members and returns the first member found. libnuma rad_get_current_home() Returns the caller's home RAD. rad_get_current_home(3) libnuma rad_get_num(3) rad_get_cpus() Returns the set of CPUs that are in a RAD. libnuma rad_get_num(3) rad_get_freemem() Returns a snapshot of the free memory pages that are in a RAD. libnuma rad_get_num(3) rad_get_info() Returns information about a RAD, including its state (online or offline) and the number of CPUs and memory pages it contains. libnuma rad_get_num(3) rad_get_max() Returns the number of RADs in the system. ** libnuma rad_get_num(3) rad_get_num() Returns the number of RAD's in the caller's partition. ** libnuma rad_get_num(3) rad_get_physmem() Returns the number of memory pages assigned to a RAD. libnuma rad_get_num(3) rad_get_state() Reserved for future use. (Currently, RAD state is always set to RAD_ONLINE.) libnuma radsetops(3) radaddset() Adds a RAD to a RAD set. libnuma radsetops(3) radandset() Performs a logical AND operation on two RAD sets, storing the result in a RAD set. libnuma radsetops(3) radcopyset() Copies the contents of one RAD set to another RAD set. libnuma radsetops(3) radcountset() Returns the members of a RAD set. libnuma radsetops(3) raddelset() Removes a RAD from a RAD set. libnuma radsetops(3) raddiffset() Finds the logical difference between two RAD sets, storing the result in another RAD set. libnuma radsetops(3) rademptyset() Initializes a RAD set such that no RADs are included. libnuma radsetops(3) radfillset() Initializes a RAD set such that it includes all RADs. libnuma radsetops(3) radisemptyset() Tests whether a RAD set is empty. libnuma radsetops(3) radismember() Tests whether a RAD belongs to a given RAD set. libnuma radsetops(3) radorset() Performs a logical OR operation on two RAD sets, storing the result in another RAD set. libnuma radsetops(3) radsetcreate() Allocates a RAD set and sets it to empty. libnuma radsetops(3) radsetdestroy() Releases the memory allocated for a RAD set. libnuma radsetops(3) radxorset() Performs a logical XOR operation on two RAD sets, storing the result in another RAD set. _______________________________________________________________________________ ** On a partitioned system, the system and the partition are equivalent. In this case, the operating system returns information only for the partition in which it is installed. Processes and Threads _________________________________________________________________________________ Function Purpose Library Reference Page _________________________________________________________________________________ libnuma nfork(3) nfork() Creates a child process that is an exact copy of its parent process. See also the table entry for rad_fork(). nmadvise() libnuma nmadvise(3) Tells the system what behavior to expect from a process with respect to referencing mapped files and shared memory regions. libnuma nsg_attach_pid() Attaches a process to a NUMA scheduling group. nsg_attach_pid(3) libnuma nsg_detach_pid() Detaches a process from a NUMA scheduling group. nsg_attach_pid(3) libpthread pthread_nsg_attach() Attaches a thread to a NUMA scheduling group. pthread_nsg_attach(3) libpthread pthread_nsg_detach() Detaches a thread from a NUMA scheduling group. pthread_nsg_detach(3) libpthread pthread_rad_attach() Attaches a thread to a RAD set. pthread_rad_attach(3) libpthread pthread_rad_bind() Attaches a thread to a RAD set and restricts its execution to the home RAD. pthread_rad_attach(3) libpthread pthread_rad_detach() Detaches a thread from a RAD set. pthread_rad_detach(3) libnuma rad_attach_pid() Attaches a process to a RAD (assigns a home RAD but allows execution on other RADs). rad_attach_pid(3) libnuma rad_bind_pid() Binds a process to a RAD (assigns a home RAD and restricts execution to the home RAD). rad_attach_pid(3) libnuma rad_fork(3) rad_fork() Creates a child process on a RAD that optionally does not inherit the RAD assignment of its parent. See also the table entry for nfork(). _________________________________________________________________________________ Memory Management ______________________________________________________________________ Function Purpose Library Reference Page ______________________________________________________________________ libnuma memalloc_attr() Returns the memory allocation policy for a RAD set specified by its virtual address. memalloc_attr(3) libc amalloc(3) nacreate() Sets up an arena for memory allocation for use with the amalloc() function.. An arena is used in multithreaded programs when there is a need for thread-specific heap memory allocation. libnuma nmadvise(3) nmadvise() Tells the system what behavior to expect from a process with respect to referencing mapped files and shared memory regions. libnuma nmmap(3) nmmap() Maps an open file (or anonymous memory) onto the address space for a process by using a specified memory allocation policy. libnuma nshmget(3) nshmget() Returns or creates the ID for a shared memory region. ______________________________________________________________________ CPUs and CPU Sets _________________________________________________________________________ Function Purpose Library Reference Page _________________________________________________________________________ libc cpu_foreach() Enumerates the members of a CPU set. cpu_foreach(3) libc cpu_get_current() Returns the identifier of the current CPU on which the calling process is running. cpu_get_current(3) libc cpu_get_info() Returns CPU information for the system. ** cpu_get_info(3) libc cpu_get_max() Returns the number of CPU slots available in the caller's partition. ** cpu_get_info(3) libc cpu_get_num() Returns the number of available CPUs. cpu_get_info(3) libnuma cpu_get_rad() Returns the RAD identifier for a CPU. cpu_get_rad(3) libc cpusetops(3) cpuaddset() Adds a CPU to a CPU set. libc cpusetops(3) cpuandset() Performs a logical AND operation on the contents of two CPU sets, storing the result in a third CPU set. libc cpusetops(3) cpucopyset() Copies the contents of one CPU set to another CPU set. libc cpusetops(3) cpucountset() Returns the number of CPUs in a CPU set. libnuma cpusetops(3) cpudelset() Deletes a CPU from a CPU set. libnuma cpusetops(3) cpudiffset() Finds the logical difference between two CPU sets, storing the result in a third CPU set. libnuma cpusetops(3) cpuemptyset() Initializes a CPU set such that it includes no CPUs. libnuma cpusetops(3) cpufillset() Initializes a CPU set such that it includes all CPUs. libnuma cpusetops(3) cpuisemptyset() Tests whether a CPU set is empty. libnuma cpusetops(3) cpuismember() Tests whether a CPU is a member of a particular CPU set. libnuma cpusetops(3) cpuorset() Performs a logical OR operation on the contents of two CPU sets, storing the result in a third CPU set. libnuma cpusetops(3) cpusetcreate() Allocates a CPU set and sets it to empty. libnuma cpusetops(3) cpusetdestroy() Releases the memory allocated to a CPU set. libnuma cpusetops(3) cpuxorset() Performs a logical XOR operation on the contents of two CPU sets, storing the result in a third CPU set. _________________________________________________________________________ ** On a partitioned system, the system and the partition are equivalent. In this case, the operating system returns information only for the partition in which it is installed. NUMA Scheduling Groups ________________________________________________________________________________ Function Purpose Library Reference Page ________________________________________________________________________________ libnuma nsg_attach_pid() Attaches a process to a NUMA scheduling group. nsg_attach_pid(3) libnuma nsg_destroy(3) nsg_destroy() Removes a NUMA scheduling group and deallocates its structures. libnuma nsg_detach_pid() Detaches a process from a NUMA scheduling group. nsg_attach_pid(3) libpthread pthread_nsg_attach() Attaches a thread to a NUMA scheduling group. pthread_nsg_attach(3) libpthread pthread_nsg_detach() Detaches a thread from a NUMA scheduling group. pthread_nsg_detach(3) libnuma nsg_get(3) nsg_get() Returns the status of a NUMA scheduling group. libnuma nsg_get_nsgs(3) nsg_get_nsgs() Returns a list of NUMA scheduling groups that are active. libnuma nsg_get_pids(3) nsg_get_pids() Returns a list of processes attached to a NUMA scheduling group. libnuma nsg_init(3) nsg_init() Looks up (and possibly creates) a NUMA scheduling group. libnuma nsg_set(3) nsg_set() Sets group ID, user ID, and permissions for a NUMA scheduling group. libpthread pthread_nsg_get() Returns a list of threads attached to a NUMA scheduling group. pthread_nsg_get(3) ________________________________________________________________________________ NUMA Enhancements to System Utilities and Deamons A number of system commands display RAD-specific information or perform RAD-specific operations. The following list briefly describes the NUMA options supported by system utilities and daemons: · The runon -r command executes an application on a specific RAD. · The vmstat -r command displays virtual memory statistics for a specific RAD. · The netstat -R command displays network routing tables for each RAD. · The ps -o RAD command includes RAD binding in the information displayed about processes running on the system. · The hwmgr -view hier command displays the RAD location of CPUs and devices. In this case, in place of a RAD identifier, the command identifies the contruct in hardware that corresponds to a RAD. When run on a GS80, GS160, or GS320 AlphaServer platform, the command shows the hierarchy of CPUs and devices within QBBs. When run on an ES80 or GS1280 AlphaServer platform, the command shows the hierarchy of CPUs and devices within PIDs (processing unit IDs). · The sched_stat -R command also displays the RAD location of system CPUs. In addition, this command shows the relative distance (number of hops) between CPUs. · The -t and -u options on the nfsd command allow customization of the number of TCP and UCP server threads, respectively, that are spawned per RAD. This feature allows the NFS server to automatically scale the number of TCP and UCP server threads according to the size of the system. · The -r option on the inetd command allows customization of the RAD locations on which to start Internet server child daemons. By default, one child deamon is started on each RAD. · The route -R command of the kdbx kernel debugger displays network route tables for all RADs.

SEE ALSO

NUMA Overview The NUMA Overview is a web-only document that includes a complete NUMA programming example. Starting with Tru64 UNIX Version 5.1, this web-only document can be accessed through the version-specific web pages for Tru64 UNIX documentation. Links to documentation sets for different product versions are available at the following URL: http://www.Tru64UNIX.compaq.com/docs/pub_page/doc_list.html

Index Index for
Section 3
Index Alphabetical
listing for N
Top of page Top of
page