The Windows Processes and Threads 4






Multiple Processors


Computers with multiple processors are typically designed for one of two architectures:


  1. Non-uniform memory access (NUMA) or
  2. Symmetric multiprocessing (SMP)


In a NUMA computer, each processor is closer to some parts of memory than others, making memory access faster for some parts of memory than other parts. Under the NUMA model, the system attempts to schedule threads on processors that are close to the memory being used. In an SMP computer, two or more identical processors or cores connect to a single shared main memory. Under the SMP model, any thread can be assigned to any processor. Therefore, scheduling threads on an SMP computer is similar to scheduling threads on a computer with a single processor. However, the scheduler has a pool of processors, so that it can schedule threads to run concurrently. Scheduling is still determined by thread priority, but it can be influenced by setting thread affinity and thread ideal processor, as discussed in the following section.


Thread Affinity


Thread affinity forces a thread to run on a specific subset of processors.


Windows Thread affinity seen in Task Manager


Setting thread affinity should generally be avoided, because it can interfere with the scheduler's ability to schedule threads effectively across processors. This can decrease the performance gains produced by parallel processing. An appropriate use of thread affinity is testing each processor. The system represents affinity with a bitmask called a processor affinity mask. The affinity mask is the size of the maximum number of processors in the system, with bits set to identify a subset of processors. Initially, the system determines the subset of processors in the mask. You can obtain the current thread affinity for all threads of the process by calling the GetProcessAffinityMask() function. Use the SetProcessAffinityMask() function to specify thread affinity for all threads of the process. To set the thread affinity for a single thread, use the SetThreadAffinityMask() function. The thread affinity must be a subset of the process affinity. On systems with more than 64 processors, the affinity mask initially represents processors in a single processor group. However, thread affinity can be set to a processor in a different group, which alters the affinity mask for the process.


Thread Ideal Processor


When you specify a thread ideal processor, the scheduler runs the thread on the specified processor when possible. Use the SetThreadIdealProcessor() function to specify a preferred processor for a thread. This does not guarantee that the ideal processor will be chosen but provides a useful hint to the scheduler. On systems with more than 64 processors, you can use the SetThreadIdealProcessorEx() function to specify a preferred processor in a specific processor group.


NUMA Support


The traditional model for multiprocessor support is symmetric multiprocessor (SMP). In this model, each processor has equal access to memory and I/O. As more processors are added, the processor bus becomes a limitation for system performance. System designers use non-uniform memory access (NUMA) to increase processor speed without increasing the load on the processor bus. The architecture is non-uniform because each processor is close to some parts of memory and farther from other parts of memory. The processor quickly gains access to the memory it is close to, while it can take longer to gain access to memory that is farther away. In a NUMA system, CPUs are arranged in smaller systems called nodes. Each node has its own processors and memory, and is connected to the larger system through a cache-coherent interconnect bus. The system attempts to improve performance by scheduling threads on processors that are in the same node as the memory being used. It attempts to satisfy memory-allocation requests from within the node, but will allocate memory from other nodes if necessary. It also provides an API to make the topology of the system available to applications. You can improve the performance of your applications by using the NUMA functions to optimize scheduling and memory usage.

First of all, you will need to determine the layout of nodes in the system. To retrieve the highest numbered node in the system, use the GetNumaHighestNodeNumber() function. Note that this number is not guaranteed to equal the total number of nodes in the system. Also, nodes with sequential numbers are not guaranteed to be close together. To retrieve the list of processors on the system, use the GetProcessAffinityMask() function. You can determine the node for each processor in the list by using the GetNumaProcessorNode() function. Alternatively, to retrieve a list of all processors in a node, use the GetNumaNodeProcessorMask() function.

After you have determined which processors belong to which nodes, you can optimize your application's performance. To ensure that all threads for your process run on the same node, use the SetProcessAffinityMask() function with a process affinity mask that specifies processors in the same node. This increases the efficiency of applications whose threads need to access the same memory. Alternatively, to limit the number of threads on each node, use the SetThreadAffinityMask() function. Memory-intensive applications will need to optimize their memory usage. To retrieve the amount of free memory available to a node, use the GetNumaAvailableMemoryNode() function. The VirtualAllocExNuma() function enables the application to specify a preferred node for the memory allocation. VirtualAllocExNuma() does not allocate any physical pages, so it will succeed whether or not the pages are available on that node or elsewhere in the system. The physical pages are allocated on demand. If the preferred node runs out of pages, the memory manager will use pages from other nodes. If the memory is paged out, the same process is used when it is brought back in.


NUMA Support on Systems With More Than 64 Logical Processors


On systems with more than 64 logical processors, nodes are assigned to processor groups according to the capacity of the nodes. The capacity of a node is the number of processors that are present when the system starts together with any additional logical processors that can be added while the system is running. Windows Server 2008, Windows Vista, Windows Server 2003, and Windows XP/2000:  Processor groups are not supported. Each node must be fully contained within a group. If the capacities of the nodes are relatively small, the system assigns more than one node to the same group, choosing nodes that are physically close to one another for better performance. If a node's capacity exceeds the maximum number of processors in a group, the system splits the node into multiple smaller nodes, each small enough to fit in a group. An ideal NUMA node for a new process can be requested using the PROC_THREAD_ATTRIBUTE_PREFERRED_NODE extended attribute when the process is created. Like a thread ideal processor, the ideal node is a hint to the scheduler, which assigns the new process to the group that contains the requested node if possible.

The extended NUMA functions GetNumaAvailableMemoryNodeEx(), GetNumaNodeProcessorMaskEx(), GetNumaProcessorNodeEx(), and GetNumaProximityNodeEx() differ from their unextended counterparts in that the node number is a USHORT value rather than a UCHAR, to accommodate the potentially greater number of nodes on a system with more than 64 logical processors. Also, the processor specified with or retrieved by the extended functions includes the processor group; the processor specified with or retrieved by the unextended functions is group-relative. For details, see the individual function reference topics. A group-aware application can assign all of its threads to a particular node in a similar fashion to that described earlier in this topic, using the corresponding extended NUMA functions. The application uses GetLogicalProcessorInformationEx() to get the list of all processors on the system. Note that the application cannot set the process affinity mask unless the process is assigned to a single group and the intended node is located in that group. Usually the application must call SetThreadGroupAffinity() to limit its threads to the intended node.




The following table describes the NUMA APIs.





Allocates physical memory pages to be mapped and unmapped within any Address Windowing Extensions (AWE) region of a specified process and specifies the NUMA node for the physical memory.


Creates or opens a named or unnamed file mapping object for a specified file, and specifies the NUMA node for the physical memory.


Retrieves information about logical processors and related hardware.


Retrieves information about the relationships of logical processors and related hardware.


Retrieves the amount of memory available in the specified node.


Retrieves the amount of memory available in a node specified as a USHORT value.


Retrieves the node that currently has the highest number.


Retrieves the processor mask for the specified node.


Retrieves the processor mask for a node specified as a USHORT value.


Retrieves the node number for the specified processor.


Retrieves the node number as a USHORT value for the specified processor.


Retrieves the node number for the specified proximity identifier.


Retrieves the node number as a USHORT value for the specified proximity identifier.


Maps a view of a file mapping into the address space of a calling process, and specifies the NUMA node for the physical memory.


Reserves or commits a region of memory within the virtual address space of the specified process, and specifies the NUMA node for the physical memory.




< Processes & Threads 3 | Win32 Process & Thread Programming | Win32 Programming | Processes & Threads 5 >