Improving the scalabiliy of neutron cross-section lookup codes on
  multicore NUMA system by Yoshii, Kazutomo et al.
Improving the scalabiliy of neutron cross-section
lookup codes on multicore NUMA system
Kazutomo Yoshii, John Tramm, Andrew Siegel, Pete Beckman
Argonne National Laboratory
Mathematics and Computer Science Division
9700 South Cass Avenue, Argonne, IL
Abstract—We use the XSBench proxy application, a memory-
intensive OpenMP program, to explore the source of on-node
scalability degradation of a popular Monte Carlo (MC) reactor
physics benchmark on non-uniform memory access (NUMA)
systems. As background, we present the details of XSBench, a
performance abstraction “proxy app” for the full MC simulation,
as well as the internal design of the Linux kernel. We explain
how the physical memory allocation inside the kernel affects
the multicore scalability of XSBench. On a sixteen-core, two-
socket NUMA testbed, the scaling efficiency is improved from
a nonoptimized 70% to an optimized 95%, and the optimized
version consumes 25% less energy than does the nonoptimized
version. In addition to the NUMA optimization we evaluate a
page-size optimization to XSBench and observe a 1.5x perfor-
mance improvement, compared with a nonoptimized one.
I. INTRODUCTION
Improved processor performance on the path to exascale
computing [1], [2] is widely expected to involve increasing
degrees of on-node concurrency. This hardware trajectory in
turn has deep implications for the adoption of mathematical
models, specific choice of discretization, and implementation
on next-generation compute resources.
An excellent example involves the choice between determin-
istic and Monte Carlo (MC) approaches to particle transport.
Since MC methods fundamentally are embarrassingly parallel
over particle tracks, for many classes of MC problems the
expected performance improvements are likely to enable cal-
culations that were once impractical because of unrealistically
long integration times. This situation is true especially for
classic reactor physics calculations, where robust computations
of detailed neutron distributions in a realistic reactor core are
largely still beyond the reach of current compute capabilities.
As we begin to take advantage of new multicore architec-
tures, however, the picture is starting to change. Indeed,
contrary to initial expectations, new challenges are arising.
For example, although MC methods are naturally parallel, we
have already observed scalability limits [3] even on today’s
multicore processors, particularly with non-uniform memory
access (NUMA) systems. Therefore, identifying the specific
causes of these scalability bottlenecks is urgent.
Typically, researchers have relied on processors’ perfor-
mance counters to identify bottlenecks in high-performance
computing (HPC) applications. The counter number itself,
however, is no longer sufficient for analyzing scalability prob-
lems in a modern compute node because of operating system
(OS) abstractions. The OS abstraction is an important concept
for portability, but it creates a gap between the hardware and
the application. With multicore scaling, how OS subsystems
such as the task scheduler and memory management manage
hardware resources becomes key. The performance counters
cannot tell us the OS behavior directly. The OS scheduler may
load-balance processes automatically in order to maximize
hardware utilization unless explicitly specified, thus affecting
cache locality. The OS memory management basically hides
physical memory allocation from applications, but it affects
the performance if memory access latency is nonuniform.
Additionally, a runtime system such as OpenMP [4] does
dynamic balancing, with almost no coordination between the
OS and runtime. Thus, cross-cutting analysis is becoming vital
in order to understand the interaction between each layer and
to identify the real cause of the scalability problems of MC
methods on multicore.
This paper focuses on popular modern multicore NUMA
processors such as Intel Xeon SandyBridge processors [5].
More than 80% of the processors in the TOP500 list are Intel
Xeon processors, which are multicore processors with sym-
metric multithreading. Xeon systems are usually multi socket
with NUMA, which is an architectural design to mitigate the
memory bottleneck by adding additional memory controllers
to a core or a set of cores. Bigger systems may have more than
100 cores; 15-core Intel Ivy Bridge processors can scale up to
8 sockets (120 cores in total). With NUMA, memory access
latency is not uniform. Thus, exploiting memory locality is an
important factor in achieving good performance. This factor,
however, poses a big challenge to both the system software
and applications.
In the present analysis we focus on the Linux kernel, the de
facto standard OS kernel in high-performance computing. The
Linux environment offers numerous benefits to users. One ben-
efit is that myriad tools and libraries are available in the Linux
environment, such as the GNU debugger, strace, valgrind, and
LLVM. These software tools not only are powerful but also
are continuously debugged and updated because Linux or open
source communities attract numerous users and developers.
According to the TOP500 [6] statistics in June 2014, 485 of the
500 entries (97%) are categorized as belonging to the Linux
ar
X
iv
:1
90
9.
03
63
2v
1 
 [c
s.D
C]
  9
 Se
p 2
01
9
family.
Technically a few of those listed are not true Linux kernels.
IBM’s Blue Gene architecture, for example, runs CNK [7],
which is a lightweight OS kernel optimized for HPC appli-
cations written from scratch by IBM; but it supports Linux
application binary interface and a subset of Linux system calls
(multitasking related system calls are not supported, however).
CNK is designed to yield the real hardware performance to
applications with statically mapping memory. Because of its
overly simplified memory management and task scheduler,
however, multitask-related system calls are not implemented
or partially implemented in a limited way, such as clone()
and mprotect(); thus CNK users cannot benefit from some
of the Linux tools such as strace.
Unlike CNK, Linux kernels are primarily optimized for
handle desktop and/or server workloads that consist of various
tasks that could be short lived, CPU intensive, I/O intensive,
highly interactive, or a combination of all of these. Most
tasks are loosely coupled or independent. Handling them
fairly and efficiently is challenging, and the current Linux
subsystems tend to cause negative effects on parallel HPC
applications (e.g.,OpenMP, MPI) that monopolize virtually all
node resources. The Linux kernel treats those MPI or OpenMP
applications as a collection of regular processes or threads and
schedules them without any bias. The Linux kernel provides
several transparent mechanisms to optimize the performance
such as process migration, NUMA balancing, transparent
hugetlbfs. However, these transparent mechanisms may also
wider the gap between the hardware and the application and
complicate the resource identification in user space.
Our contributions include the following:
• Understanding of the multicore scalability problem on
NUMA, analyzing both details XSBench and the Linux
kernel design.
• Analysis of the importance of physical memory allocation
in the Linux kernel. We present a few user-space NUMA
optimizations to mitigate the scalability problem and
improve the performance.
• Detailed scalability and performance analysis of the
NUMA optimizations, including energy consumption, on
different running modes.
II. THE XSBENCH PROXY APPLICATION
Monte Carlo methods of reactor simulation have a pro-
hibitively long time to solution on current-generation super-
computers, although the embarrassingly parallel nature of the
MC particle transport algorithm suggests that it should be
an exceptional candidate for good performance scaling on
exascale class supercomputers. Thus, exascale supercomputers
offer the possibility of completing a robust, full-core nuclear
reactor simulation with hundreds of nuclides and millions of
geometric regions within a reasonable wall time, opening new
avenues in reactor design. Recent studies [3], [8], however,
have found that the MC transport algorithm is generally bound
by bandwidth and DRAM latency, rather than by the floating-
point capabilities of modern processors. Since the likely path
1 p a r a l l e l f o r :
2 p ene rgy = p i c k e ne rg y randomly
3 p mat = p i c k m a t e r i a l randomly
4 x s v e c t o r = 0
5 f o r j i n num nucs [ p mat ] :
6 nuc = mats [ p mat ] [ j ] / / g e t n u c l i d e ID
7 pos = b i n a r y s e a r c h p ene rgy i n n u l i d e g r i d s [ nuc ]
8 x s v e c t o r += i n t e r p o l a t e n u c l i d e g r i d s [ nuc ] a t pos
9 ( o u t p u t : x s v e c t o r )
Fig. 1. Pseudo code: cross-section lookup (basic algorithm)
to exascale will involve greatly increasing the floating-point
capacities of nodes, while only marginally increasing the
bandwidth [9]–[12], it is extremely important to optimize the
application, software runtime, OS, and hardware in order to
maximize bandwidth efficiency.
To this end, the XSBench proxy application was created.
It abstracts the key performance aspects of full-scale MC
transport codes, such as OpenMC [13] and MCNP [14], into
a smaller package that is easier to port, run, and analyze
on various novel and experimental architectures. XSBench
executes only macroscopic neutron cross-section lookups, a
key computational kernel in MC transport applications that
constitutes 85% of the total runtime of OpenMC [3]. XS-
Bench has been shown to accurately mimic the computational
requirements of full-scale MC transport applications [15], so
performance analysis done with XSBench will translate well
to full-scale applications. XSBench is written in C, with node-
level parallelism support by OpenMP. Reactor parameters that
define the size and scope of the problem, such as the number of
nuclides and materials used, are based on a well-known com-
munity reactor benchmark model, the Hoogenboom-Martin
model [16]. XSBench is developed by the Center for Exascale
Simulation of Advanced Reactors (CESAR) and is an open
source software project [17].
A. Algorithm and Data Structure
Figure 1 is a C-like pseudo code that presents a basic idea of
the MC particle transport algorithm. The major data structures
involved in the basic algorithm are the material data and the
nuclide grids data shown in Figure 2 and Figure 3. With default
runtime configuration (355 nuclides tracked, roughly 4 million
gridpoints, and 5 cross-section interaction types), the nuclide
grids data consumes approximately 184 MB. Compared with
the nuclide grids data, the material data size is negligible.
Each iteration is highly independent, one can easily exploit
node-level parallelism using the OpenMP parallel for loop.
However, the computational cost of the lookup is quasilinear
(see Table I) because of a binary search in the innermost loop.
Using the unionized energy grid, described by
Leppa¨nen [18] and Romano [13], one can improve the
computational cost of the lookup to linear from quasilinear
(see Table I). However, the drawback of the unionized
energy grid is its memory footprint. In this study we use the
unionized energy grid. Figure 5 presents the unionized energy
grid implemented in XSBench. This structure holds the sorted
energy of all nuclide grid points and the pre-calculated closest
num_nucs[]
(e.g.,
12 materials)
e.g.,
 0: 321
 1: 5
 2: 4
 3: 4
 4: 27
 . 
 .
 .
10: 9
11: 9
mats[] (e.g., 321 nucs)
(e.g., 9 nucs)
(58,59,60,61,40,...)
(24,41,4,5,...)
Fig. 2. Material
nuclide_grids[][]
n_isotopes
(default:355)
sorted by energy 
~3KB
~184MB
n_grigpoints (default:11303)
NuclideGridPoint
 energy
 total_xs
 elastic_xs
 absorbtion_xs
 fission_xs
 nu_fission_xs
Fig. 3. Nuclide grids
corresponding energy level on each of the different nuclide
grids.
TABLE I
LOOKUP ALGORITHM COMPARISON
Lookup Cost Memory Requirement
Basic quasilinear 184 MB
Unionized grid linear 5617 MB
The algorithm exhibits highly random memory access pat-
terns due to multiple levels of indirect memory accesses. With
the default configuration, nuclide grids is 184 MB in size,
which fits only in the last level of cache and is likely to cause
translation lookaside buffer (TLB) misses. On the Intel Xeon
node, every access to this structure causes a TLB miss with
the default 4 KB page size because the Intel Xeon has only 64
TLB entries.
B. Multicore Scalability
To observe the multicore scaling efficiency, we run XS-
Bench on an Intel Sandy Bridge node, changing the number of
OpenMP threads. The Sandy Bridge node includes two Xeon
E5-2670 processors, 8 cores and 16 hardware threads with
hyper-threading (HT), which runs at 2.6 GHz, up to 3.3 GHz
with the Intel turboboost technology. These two processors are
connected via two Intel Quick Path Interconnect (QPI) links,
1 p a r a l l e l f o r :
2 p ene rgy = p i c k e ne rg y randomly
3 p mat = p i c k m a t e r i a l randomly
4 i d x = b i n a r y s e a r c h e n e r gy on e n e r g y g r i d
5 x s v e c t o r = 0
6 f o r j i n num nucs [ p mat ] :
7 nuc = mats [ p mat ] [ j ] / / g e t n u c l i d e ID
8 pos = e n e r g y g r i d [ i d x ] . x s p t r s [ nuc ]
9 x s v e c t o r += i n t e r p o l a t e n u c l i d e g r i d s [ nuc ] a t pos
10 ( o u t p u t : x s v e c t o r )
Fig. 4. Pseudo code: cross-section lookup (with unionized energy grids)
energy_grid[]
~30MB
n_isotopes*
n_gridpoints
~5435MB
(4*355)*355*11303
n_isotopes
~1.4KB
(4*355)
xs_ptr[]
Fig. 5. Energy Grids
socket0 QPI socket1
core
core
core
core
core
core
core
core
memory memory
MC
L
L
C
core
core
core
core
core
core
core
core
MC
L
L
C
local remote
Fig. 6. A dual socket Sandy Bridge NUMA node
which forms a cache-coherent NUMA node (see Figure 6).
The node runs Linux kernel version 3.16 with NUMA enabled.
Figure 7 is the cross-section lookup performance, measuring
XSBench run at 2.6 GHz. Figure 8 is the multicore scalability
efficiency, which is calculated by the following equation.
Efficiencyn(%) =
Pn
P1 ∗ n ∗ 100 (1)
where n is the number of threads and Pn is the measured
performance. The scaling efficiency is dropped to approx-
imately 70 % on 16 OpenMP threads. This result matches
with the scaling problem previously reported by Tramm et
al. [3] on an Intel Xeon NUMA node, two eight-core E5-2650
processors runs at 2 GHz and up to 2.8 GHz with turboboost.
They observed approximately 1.6 million lookups/s and 70 %
efficiency at 16 threads. However on a IBM BlueGene/Q
(BGQ) node, the scaling efficiency is only dropped to 96 % at
16 threads. The major difference between the BGQ node and
the Xeon node is that the BGQ node is a uniform memory
1 8 16 32
# of OpenMP threads
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
M
ill
io
n
Lo
ok
up
s/
s
Hyper-Threading
icc (reference)
gcc
ideal
Fig. 7. Cross section lookup
1 8 16 32
# of OpenMP threads
30
40
50
60
70
80
90
100
E
ffi
ci
en
cy
[%
]
Hyper-Threading
icc (reference)
gcc
Fig. 8. Multicore scaling
access shared memory architecture while the Xeon nodes are
NUMA. Other notable difference is that the BGQ node runs a
custom OS kernel that is optimized for HPC workloads while
the Xeon nodes runs the Linux kernel. In the previous study
only marginal performance gains are observed with HT. We
find that XSBench compiled by an Intel compiler (icc) shows
a reasonable performance gain with HT.
III. OPERATING SYSTEM AND RUNTIME SYSTEM
Data locality is a critical factor for performance on a NUMA
node. However, controlling NUMA-aware data placement is
not a simple task for user-space codes because of the OS
abstraction, the virtual address, and load balancing. In this
section we detail the Linux kernel’s internal design and explain
why the current design affects the multicore scalability of a
memory-intensive OpenMP parallel code on a NUMA node.
A. Kernel Management Structure and OpenMP
Figure 9 is a simplified view of the major management
structure in the Linux kernel related to this study. In this figure,
each rounded rectangle represents a process or thread, and each
(shape-edged) rectangle represents an internal data structure.
The “task struct” structure contains all the information related
to a process or thread such as process ID. The difference
between process and thread is that a process has its own
“mm struct” while a thread shares the parent’s “mm struct.”
An OpenMP program, for example, is simply a collection of
a process (the master thread) and one or more threads that are
cloned by the master thread. Thus the Linux kernel does not
distinguish an OpenMP program from others.
When a new process/thread is created, many attributes are
inherited by the child process from a parent, including the CPU
affinity mask and the NUMA policy. In fact, the taskset
command leverages this behavior in order to control the child
process’s CPU affinity mask (e.g., start a program on the
second core). However, the taskset command is unable to
control the affinity mask of non-master threads in an OpenMP
program because the attribute inheritance is only from its
parent process. For a multithreaded program, each thread
needs to call the sched_setaffinity() system call in
order to bind an OpenMP thread to a specific CPU core.
[task_struct]
[mm_struct] [vm_area_struct]
start, end
NUMA policyvma list
[task_struct]
pid, ...
CPU affinity
NUMA policy
e.g.,
bash,taskset,
numactl
clone
Master Thread
[task_struct]
subthreads
clone
a multithread code (e.g., XSBench)
NUMA policy
(page tables)
pid, ...
CPU affinity
NUMA policy
pid, ...
CPU affinity
NUMA policy
Fig. 9. Kernel management structure
TABLE II
INTEL SANDY BRIDGE’S DATA TLB
4 KB 2 MB 1 GB
First level 64 32 4
Second level 512 node none
B. Physical Memory Allocation
One of the important roles of the OS kernel is to provide
a linear virtual address space to each process/thread while
keeping better physical memory utilization. The Linux kernel,
for example, relies on a memory management unit (MMU)
that supports paged virtual memory, which divides a virtual
address space into contiguous memory blocks called pages.
With a normal operating mode (e.g., long mode on x86-64),
CPU instructions cannot access data with physical addresses
directly, technically even inside the OS kernel because proces-
sors always translate a virtual address to a physical address.
The OS kernel is responsible for managing in-memory page
tables for virtual-to-physical address translation. Processors
normally have a translation lookaside buffer to cache the
content of recent page tables, because scanning through in-
memory page tables for every translation is prohibitively
expensive. In general TLB is a scarce resource; the number
of TLB entries is limited, so TLB misses still occur, leading
to performance degradation [19], [20]. A larger page size1
can reduce TLB misses and improve the memory access
performance if available [21]. Table II shows TLB sizes and
the number of the TLB slots on available page sizes in the
Intel Sandy Bridge microarchitecture. With the 2 MB page, for
example, a 64 MB of memory area can be accessed without
TLB misses.
Figure 10 is an example of the flow of a typical memory al-
location in the Linux environment. First an application process
calls a memory allocation function such as the malloc()
library function, which internally invokes the mmap() system
call. The mmap() itself only creates a virtual memory area
when it is invoked; physical memory is not allocated at
this point unless the MAP POPULATE flag is specified. In
Figure 9, the mmap() manipulates “vm area struct.” When
1The start address of each page must be aligned with its own page size.
malloc()
mmap() 
system call
virtual memory
touch memory
e.g., store r8, 0x1100
page fault
exception
domain0
domain1
physical memory
0x1000
0x4000
allocate a physical page
install a PTE
Fig. 10. Memory allocation
Parallel Region
Master Thread
0
1
2
3
doman0 doman1
page fault
Fig. 11. A multithreaded program
the user-space program attempts to store to or touch the
virtual memory range created by mmap() for the first time,
the processor raises a page fault exception, and the Linux
kernel allocates a physical memory page and installs a page
table entry into main memory for virtual-to-physical address
translation.
C. NUMA Memory Domain
A NUMA system has multiple memory domains. Our Sandy
Bridge testbed, for example, has two NUMA domains. By
default, the Linux kernel allocates a physical page from the
NUMA domain where a thread causes the page fault. This
is called “first-touch.” XSBench, an OpenMP multithreaded
application, allocates and touches the data buffer from the
master thread (Figure 11). By default, a physical page is
allocated from the domain where the master thread causes
a page fault; hence, the application’s data buffer is likely to
be in one of the NUMA domains. During its computational
phase in the parallel region (Figure 11) that spreads over
all domains, half of the CPU cores have to incur expensive
remote memory accesses. This is a typical cause of the
multicore scalability problem on a NUMA node, particularly
for a memory-intensive OpenMP code such as XSBench. The
problem is expected to become much more pronounced as the
number of the NUMA domains increases and the system’s
energy efficiency is drastically reduced.
The Linux kernel detects the NUMA topology of the system
during the boot time (e.g., by parsing the system resource
affinity table on the Sandy Bridge system) and initializes the
NUMA scheduler domains with the detected topology. The
Linux kernel internally has service routines to identify the
NUMA domain ID associated with the CPU or the memory
range of each NUMA domain. The Linux kernel exposes the
NUMA topology information to the user space via the sysfs
virtual file system, but user-space programs are unable to
identify the NUMA domain ID related to a specific memory
address. Although one can look up a physical address from
a process’s virtual address using the pagemap interface, the
Linux kernel currently does not provide a mechanism for the
user space to look up the NUMA domain ID related to the
physical address.
D. Load Balancing
The Linux kernel distributes workloads over the CPU cores
by migrating processes/threads in order to utilize all the
resources efficiently. It takes into account various runtime
attributes such as CPU business/idleness, cache hotness, mem-
ory pressure, and scheduling domains in order to choose
a CPU core that a process runs, considering the process’s
CPU affinity mask.2 In fact the Linux kernel scheduler is
aware of the NUMA domain, migrating processes within the
associated NUMA domain as much as possible. With the
previous XSBench example, its master thread is likely to stay
in the same NUMA domain.
The recent Linux kernel (e.g., 3.8 or later) has a configura-
tion option for automatic NUMA-aware memory placement,
which utilizes the page fault handling mechanism to detect
expensive remote accesses. Once these are detected, it attempts
to migrate the physical pages close to the thread that caused
the page fault.
However, the process migration and the automatic NUMA
balancing basically conflict with each other, because locality
is important to the NUMA balancing while equal distribution
is important to the CPU load-balancing. In general the Linux
kernel is optimized primarily for general-purpose workloads
such as server workloads that consist of independent, irregular
tasks. On the other hand, HPC workloads are usually parallel
and tend to monopolize the resources for a long duration;
thus such automatic or transparent mechanisms tend to have
a negative effect on HPC workloads.
E. NUMA Memory Policy
The Linux kernel provides two system calls to control
the NUMA policy: the set_mempolicy() system call
sets the policy per task, and the mbind() system call sets
the policy of each virtual memory range individually. The
major NUMA policy includes default, bind, and interleaved
(see Figure 12). The default policy is the first-touch policy
previously described. The bind policy restricts physical mem-
ory allocation to specified domains. The interleaved policy
interleaves physical memory allocation within specified nodes.
The granularity of the interleave is the page size (e.g., 4 KB).
The unionized energy grids would be a perfect candidate for
this policy.
2By default, a process’s CPU affinity mask is set to all CPUs.
core core core core
socket0 socket1
data
(a) Bind
core core core core
socket0 socket1
data data
(b) Interleaved
Fig. 12. NUMA memory policy
The interleave policy can convert NUMA to a reasonable
uniform memory access with a performance penalty, which
potentially mitigates the scalability problem but stresses the
interconnect.
IV. MEMORY OPTIMIZATIONS
In the previous section we explained how an unbalanced dis-
tribution of the physical memory allocation prevents XSBench
from scaling on a multicore NUMA system. In this section we
present three memory optimizations: “numactl,” “numag,” and
“numag+hugetlb.”
1) default:: The default provides no memory optimization.
XSBench is executed with the default CPU affinity mask (set
to all CPUs), which means the Linux kernel can migrate
XSBench threads within their default scheduling policy.
2) numactl:: The numactl optimization requires no code
modification. XSBench is executed via the numactl com-
mand with the “–interleave=all” option, which sets the inter-
leave policy to all XSBench threads, with the granularity of the
default page size (4 KB). Internally the numactl command
sets the interleave flag to the NUMA policy attribute in
“task struct” (see Figure 9) and starts XSBench. The NUMA
policy is inherited from the XSBench master thread to its child
threads; thus all XSBench threads are set to the interleave.
3) numag:: The numag optimization requires a minimum
code modification: the target memory allocation functions
needs to be replaced with custom NUMA-aware memory
allocation functions provided by a small library called “nu-
mag,” which we implement for this experiment. The major
functionality of the “numag” library is summarized as follows.
• Find a socket ID (a NUMA domain ID) from a CPU ID.
• Find a per-socket master from a CPU ID.
• Allocate a buffer in all sockets. This is used to duplicate
read-only data.
• Allocate a buffer with interleaved enabled, which calls
the mbind() system call internally.
In order to provide this functionality, the numag library has
to disable the process migration by strictly binding OpenMP
threads to CPUs.
As for the XSBench data structures, the nuclide grids data
structure (Figure 3) is a good candidate for duplicating because
it is the most frequently accessed structure and the access
pattern to this structure is highly random. With the data
core core core core
socket0 socket1
nuclide grids
duplicate
unionized
energy grids
interleaving
Fig. 13. XSBench data placement
structure size 184 MB and the Sandy Bridge TLB specification
(Table II), every data access is likely to end up with a TLB
miss. If this structure is allocated on a remote node, for exam-
ple, page tables associated with this structure are also located
in the remote node. Refilling TLB entries and loading actual
data remotely increase both the access latency and the energy
consumption. The good news is that the data size is relatively
small; duplicating this structure in every socket will not create
critical memory pressure. On the other hand, the unionized
energy grids (see Figure 5) is huge and is less frequently
accessed than the nuclide grids. We decided to interleave the
unionized energy grids in the numag optimization. Figure 13
depicts how the numag optimization allocates XSBench’s data
structures.
4) numag+hugetlb:: The numag library also provides a
function that allocates a buffer that is mapped with 2 MB pages
explicitly using the Linux kernel hugetlb support. Note that
the Linux kernel’s transparent hugetlb support is disabled in
our test environment. In this optimization we only allocate
the nuclide grids with 2 MB pages; others are allocated with
the default 4 KB pages. This is technically not a NUMA
optimization; however, it does reduce the memory traffic
regarding TLB misses.
V. RESULTS
We evaluate the memory optimizations on the Intel Sandy
Bridge–based dual-socket NUMA node (described in Sec-
tion II-B). Here we first compare the performance between
the optimizations and shows that the optimizations mitigate
the multicore scalability (Section V-A). We then compare the
performance and energy consumption between the optimiza-
tions on four different running modes (Section V-B).
A. Memory Optimizations
Figure 14 shows a comparison of the multicore scalability
efficiency of XSBench with the three memory optimizations
described in the previous section; the running environment is
the same as that used in Section II-B. The scaling efficiency
in this comparison is calculated by the following equation:
Efficiencyn(%) =
Pn
Pd1 ∗ n ∗ 100, (2)
1 8 16 32
# of OpenMP threads
40
60
80
100
120
140
E
ffi
ci
en
cy
[%
]
Hyper-Threading
default
numactl
numag
numag+hugetlb
Fig. 14. Multicore scaling
1 8 16 32
# of OpenMP threads
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
M
ill
io
n
Lo
ok
up
s/
s
Hyper-Threading
default
numactl
numag
numag+hugetlb
Fig. 15. Cross-section lookup
where n is the number of threads, Pn is the measured perfor-
mance, and Pd1 is the measured performance of “default.”
The results clearly show that both “numactl” and “numag”
improve the multicore scalability. With one thread, however,
the “numactl” performance improvement is less than that of
the “default.” The reason is that the data is interleaved, so
half the data requires expensive remote access. At 16 threads,
the efficiency is improved to 80% from a nonoptimized 70%
with “numactl,” which requires no code modification; and it is
improved to approximately 95% with “numag,” which is close
to that of IBM BG/Q uniform memory access node observed
by Tramm et al. [3]. Although the efficiency drastically drops
at 32 threads, which is in hyper-threading (HT), this is
expected because threads share the core resources such as L1
cache.
Figure 15 show a comparison of the neutron cross-section
lookup performance among the memory optimizations. We
note that hyper-threading is effective for all memory opti-
mizations. With “default,” XSBench achieves 2.7 MLookups/s.
Without any code modification (“numactl”), it achieves 3.6
MLookups/s. With the maximum optimization, it achieves 4.4
MLookups/s with hyper-threading, which is a considerable
improvement.
In addition to the regular scalability and performance
analysis, we measure the detailed energy consumption of
each memory optimization (see Figure 16) at 16 threads, by
reading Intel running average power limit (RAPL) counters
default numactl numag numag+hugetlb
0
5
10
15
20
25
30
35
40
E
ne
rg
y
pe
rL
oo
ku
p
[u
J]
CPU0
CPU1
DRAM0
DRAM1
Fig. 16. Energy breakdown at 16 threads
through the sysfs interface.3 CPU0 and CPU1 are an 8-core
CPU in socket0 (domain0) and socket1 (doamin1), respec-
tively. DRAM0 and DRAM1 are CPU0’s DRAM and CPU1’s
DRAM, respectively.
With “default,” DRAM0 consumes more than DRAM1
does. The reason is that XSBench’s data are primarily al-
located with socket0. The DRAM energy consumption is
basically proportional to the total number of memory requests.
With the NUMA optimizations, both DRAM0 and DRAM1
consume about the same energy because data accesses are
balanced out. Moreover, the total DRAM consumption is about
the same. On the other hand, the CPU energy consumption is
basically proportional to the time to solution; that is, the total
CPU energy consumption decreases with higher optimization.
B. Running Modes
In addition to the memory optimizations, we explore the
running modes described in Table III. Each running mode is
described below. Figure 17 compares the normalized lookup
performance among the memory optimizations on the four
different running modes. In a similar manner, Figure 18
compares the energy consumption per lookup.
Disabling the kernel-level, automatic NUMA-aware mem-
ory placement (nobal) improves “default” and “numactl,” but
it has no effect on “numactl.” We need to investigate this
behavior further, but the physical pages associated with the
nuclides grids may migrate back and forth between socket0
and socket1. The running mode “gen” has a negative impact
on “numag,” presumably because the unionized energy grids
are not interleaved properly. We observe that “turboboost” is
always effective. However, “turboboost” leverages the thermal
head room, so the performance may vary. In terms of the total
energy consumption (CPU and DRAM), the most optimized
version consumes 25% less energy than does the nonoptimized
version in the base running mode.
1) Initialization:: In the table, the column “Initialization”
refers to how XSBench initializes its data set. The “file”
mode is described in Section II-B and Section III-C. XSBench
reads a pre-computed data set from a file into the memory
buffer allocated during its OpenMP serial region (by its
3The newer Linux kernel provides the sysfs interface for both measuring
energy and capping power using RAPL.
TABLE III
RUNNING MODES
Label Initialization NUMA Balancing Frequency
base file enable 2.6 GHz
gen runtime enable 2.6 GHz
nobal file disable 2.6 GHz
turbo file enable boost (up to 3.3 GHz)
default numactl numag numag+hugetlb
0.4
0.6
0.8
1.0
1.2
1.4
1.6
Lo
ok
up
/s
N
or
m
al
iz
ed
base
gen
nobal
turbo
Fig. 17. Normalized lookup performance at 16 threads
master thread), which creates an unbalanced data placement.
Initializing its data set in the “runtime” mode takes XSBench
longer than initializing in the “file” mode because initializing
the unionized energy grid is expensive; the computational
cost of the initialization is n2m logm, where n is the total
number of isotopes and m is the total number of gridpoints.
With the default XSBench configuration, initialization takes
approximately 300 seconds on our Sandy Bridge testbed with
one thread. In order to amortize the initialization cost, the
OpenMP parallel for loop is used. With the first-touch policy,
the physical memory allocation is likely to be interleaved in
some way.
2) NUMA Balancing:: This refers to the automatic NUMA-
aware memory placement kernel option described in Sec-
tion III-C. The running mode “nobal” disables this kernel
option, otherwise it is enabled.
3) Frequency:: This refers to the processor frequency con-
figuration. Aside from the running mode “turbo,” the frequency
is set to 2.6 GHz, which is the maximum user-configurable
frequency on our testbed. With “turbo,” the processor can
increase the frequency for a heavy workload, depending on
the current thermal headroom. Since the energy consumption
default numactl numag numag+hugetlb
0
20
40
60
80
100
E
ne
rg
y
pe
rl
oo
ku
p
[u
J]
base
gen
nobal
turbo
Fig. 18. Energy consumption per lookup at 16 threads
depends on temperature and since turboboot is a thermal-aware
mechanism, we cool the processor down below 40 degrees
Celsius before starting XSBench each time in this study.
VI. RELATED WORK
Since the concept of NUMA has been around for a while,
there are numerous studies on NUMA architecture and NUMA
related memory management techniques [22]–[25]. The advent
of higher core-count multicore processors and modern inter-
connects poses a new, interesting challenge to from algorithms
to system software designs. Majo et al. [26] analyzed the
memory controller behavior of the Intel Xeon 5520 processor.
They also developed a model to characterize the memory
bandwidth. They found that maximizing data locality does not
improve the performance and suggested that allocating data
on a remote processor may benefit applications. Li et al. [27]
optimize a data shuffling algorithm for NUMA, considering
modern NUMA architecture. They delineate the bandwidth
and latency of a 4-socket Nehalem-EX system and present the
problem of data shuffling in NUMA. They showed that the
optimized version is three time faster than its naive version.
Furthermore the scalability of multi-threaded, shared-
memory programming languages or APIs such as OpenMP
is also becoming a major issue on a multicore, NUMA
system since those languages or APIs are originally designed,
based on the assumption of uniform memory access (UMA).
Many studies have been conducted to extend shared memory
programming languages to NUMA. Broquedis et al. [28]
combined NUMA-aware memory manager with their runtime
system to enable dynamic load distribution, utilizing the infor-
mation from the application structure and hardware topology.
Olivier et al. [29] propose a hierarchical scheduling strategy
to improve the performance. They successfully demonstrated
several benchmarks to successfully scale to 192 CPUs of an
SGI Altix with their strategy.
In this study we detail the internal design of the Linux kernel
as well as the algorithm because we believe the increasingly
complex system software layer tend to wider a gap between the
hardware and the application. In addition we also investigate
on the influences of running modes and the energy consump-
tion.
VII. CONCLUSION
The multicore scalability of OpenMP programs on NUMA
is becoming a big issue. We explain what causes the multicore
scalability, presenting details of both XSBench and the Linux
kernel. We demonstrate that precise control of the physical
memory allocation is important on NUMA. We find that a sim-
ple technique like duplication can be very effective. Using the
technique we present, we realize a significant improvement:
the scaling efficiency is improved from a nonoptimized 70%
to an optimized 95% , and the optimized version consumes
25% less energy than does the nonoptimized version. The
lookup performance of XSBench is also improved from 2.1
million Lookups/s to 3.2, with minimum code modification,
which is a considerable improvement. We also note that both
turboboost and hyper-threading are effective with the memory
optimization. We plan to evaluate the optimization technique
on a four-way or higher NUMA system. Our interests are
the energy and the operating system scalability as well as the
scalability on a bigger NUMA system.
Acknowledgments
The submitted manuscript has been created by UChicago
Argonne, LLC, Operator of Argonne National Laboratory
(“Argonne”). Argonne, a U.S. Department of Energy Office
of Science laboratory, is operated under Contract No. DE-
AC02-06CH11357. The U.S. Government retains for itself,
and others acting on its behalf, a paid-up nonexclusive,
irrevocable worldwide license in said article to reproduce,
prepare derivative works, distribute copies to the public, and
perform publicly and display publicly, by or on behalf of the
Government.
REFERENCES
[1] Bergman, K., Borkar, S., Campbell, D., Carlson, W., Dally, W., Denneau,
M., Franzon, P., Harrod, W., Hill, K., Hiller, J., Karp, S., Keckler, S.,
Klein, D., Lucas, R., Richards, M., Scarpelli, A., Scott, S., Snavely,
A., Sterling, T., Williams, R.S., Yelick, K., et al.: ExaScale computing
study: Technology challenges in achieving exascale systems. Technical
report, DARPA IPTO, AFRL (September 2008)
[2] Dongarra, J., et al.: The international exascale software project roadmap.
http://www.exascale.org/mediawiki/images/a/a1/Iesp-roadmap-draft-0.
93-complete.pdf
[3] Tramm, J.R., Siegel, A.R.: Memory bottlenecks and memory contention
in multi-core Monte Carlo transport codes. In: Joint International
Conference on Supercomputing in Nuclear Applications + Monte Carlo,
Paris (2013)
[4] OpenMP Architecture Review Board: OpenMP Application Program
Interface. 3.0 edn. OpenMP Architecture Review Board (April 2008)
[5] Rotem, E., Naveh, A., Ananthakrishnan, A., Rajwan, D., Weissmann,
E.: Power-management architecture of the intel microarchitecture code-
named sandy bridge. IEEE Micro 32(2) (2012) 20–27
[6] Top500. http://www.top500.org
[7] Giampapa, M., Gooding, T., Inglett, T., Wisniewski, R.W.: Experi-
ences with a lightweight supercomputer kernel: Lessons learned from
Blue Gene’s CNK. In: Proceedings of the ACM/IEEE International
Conference for High Performance Computing, Networking, Storage and
Analysis. (2010)
[8] Siegel, A.R., Smith, K., Romano, P.K., Forget, B., Felker, K.G.: Multi-
core performance studies of a Monte Carlo neutron transport code.
International Journal of High Performance Computing Applications
28(1) (2014) 87–96
[9] Dosanjh, S., Barrett, R., Doerfler, D., Hammond, S., Hemmert, K.,
Heroux, M., Lin, P., Pedretti, K., Rodrigues, A., Trucano, T., Luitjens,
J.: Exascale design space exploration and co-design. Future Generation
Computer Systems 30(0) (2014) 46–58 Special Issue on Extreme Scale
Parallel Architectures and Systems, Cryptography in Cloud Computing
and Recent Advances in Parallel and Distributed Systems, {ICPADS}
2012 Selected Papers .
[10] Attig, N., Gibbon, P., Lippert, T.: Trends in supercomputing: The
European path to exascale. Computer Physics Communications 182(9)
(2011) 2041–2046
[11] Rajovic, N., Vilanova, L., Villavieja, C., Puzovic, N., Ramirez, A.: The
low power architecture approach towards exascale computing. Journal
of Computational Science 4(6) (2013) 439–443 Scalable Algorithms for
Large-Scale Systems Workshop (ScalA2011), Supercomputing 2011.
[12] Engelmann, C.: Scaling to a million cores and beyond: Using light-
weight simulation to understand the challenges ahead on the road to
exascale. Future Generation Computer Systems 30(0) (2014) 59–65
Special Issue on Extreme Scale Parallel Architectures and Systems,
Cryptography in Cloud Computing and Recent Advances in Parallel and
Distributed Systems, {ICPADS} 2012 Selected Papers.
[13] Romano, P.K., Forget, B.: The OpenMC Monte Carlo particle transport
code. Annals of Nuclear Energy 51(C) (2013) 274–281
[14] Monte Carlo Team, X..: Mcnp - a general Monte Carlo N-particle
transport code, version 5, volume III: Developer’s guide. (LA-CP-03-
0284) (2008)
[15] Tramm, J.R., Siegel, A.R., Islam, T., Schulz, M.: XSBench - the
development and verification of a performance abstraction for Monte
Carlo reactor analysis. In: PHYSOR 2014 - The Role of Reactor Physics
toward a Sustainable Future, Kyoto
[16] Hoogenboom, J.E., Martin, W.R., Petrovic, B.: Monte Carlo perfor-
mance benchmark for detailed power density calculation in a full size
reactor core benchmark specifications. http://www.oecd-nea.org/dbprog/
documents/MonteCarlobenchmarkguideline 004.pdf (2010)
[17] Tramm, J.: XSBench: The Monte Carlo macroscopic cross section
lookup benchmark. https://github.com/ANL-CESAR/XSBench (2014)
[18] Leppa¨nen, J.: Two practical methods for unionized energy grid construc-
tion in continuous-energy Monte Carlo neutron transport calculation.
Annals of Nuclear Energy 36(7) (2009) 878–885
[19] Chen, J.B., Borg, A., Jouppi, N.P.: A simulation based study of TLB
performance. SIGARCH Comput. Archit. News 20(2) (April 1992) 114–
123
[20] Yoshii, K., Iskra, K., Naik, H., manm, P.B., Broekema, P.C.: Perfor-
mance and scalability evaluation of “Big Memory” on Blue Gene Linux.
The International Journal of High Performance Computing (2010)
[21] Gorman, M.: Huge pages part 1 (Introduction). http://lwn.net/Articles/
374424/
[22] Stenstro¨m, P., Joe, T., Gupta, A.: Comparative performance evaluation
of cache-coherent NUMA and COMA architectures. In: ISCA ’92:
Proceedings of the 19th Annual International Symposium on Computer
Architecture, New York, ACM (1992) 80–91
[23] Bolosky, W., Fitzgerald, R., Scott, M.: Simple but effective techniques
for NUMA memory management. SIGOPS Oper. Syst. Rev. 23(5)
(November 1989) 19–31
[24] LaRowe, P.R., Ellis, S.C.: NUMA multiprocessor page placement
policies. Technical report, Duke University, Durham, NC (1989)
[25] Bolosky, W.J., Scott, M.L., Fitzgerald, R.P., Fowler, R.J., Cox, A.L.:
NUMA policies and their relation to memory architecture. In: ASPLOS-
IV: Proceedings of the Fourth International Conference on Architectural
support for Programming Languages and Operating Systems, New York,
NY, ACM (1991) 212–221
[26] Majo, Z., Gross, T.R.: Memory system performance in a NUMA mul-
ticore multiprocessor. In: Proceedings of the 4th Annual International
Conference on Systems and Storage. SYSTOR ’11, New York, NY,
USA, ACM (2011) 12:1–12:10
[27] Li, Y., Pandis, I., Mller, R., Raman, V., Lohman, G.M.: NUMA-aware
algorithms: the case of data shuffling. In: CIDR, www.cidrdb.org (2013)
[28] Broquedis, F., Furmento, N., Goglin, B., Wacrenier, P.A., Namyst,
R.: ForestGOMP: An efficient OpenMP environment for NUMA
architectures. International Journal of Parallel Programming 38(5-6)
(2010) 418–439
[29] Olivier, S.L., Porterfield, A.K., Wheeler, K.B., Spiegel, M., Prins, J.F.:
OpenMP task scheduling strategies for multicore NUMA systems. Int.
J. High Perform. Comput. Appl. 26(2) (May 2012) 110–124
