Calculation of Stochastic Heating and Emissivity of Cosmic Dust Grains
  with Optimization for the Intel Many Integrated Core Architecture by Porter, Troy A. & Vladimirov, Andrey E.
Calculation of Stochastic Heating and Emissivity of Cosmic Dust Grains
with Optimization for the Intel Many Integrated Core Architecture
Troy A. Porter1 , Andrey E. Vladimirov1,2
1 Hansen Experimental Physics Laboratory, Stanford University, 452 Lomita Mall, Stanford, CA 94305-4085, USA
2 Colfax International, 750 Palomar Ave, Sunnyvale, CA 94085, USA
Abstract
Cosmic dust particles effectively attenuate starlight. Their absorption of starlight produces emission spectra from the near- to
far-infrared, which depends on the sizes and properties of the dust grains, and spectrum of the heating radiation field. The near-
to mid-infrared is dominated by the emissions by very small grains. Modeling the absorption of starlight by these particles is,
however, computationally expensive and a significant bottleneck for self-consistent radiation transport codes treating the heating
of dust by stars. In this paper, we summarize the formalism for computing the stochastic emissivity of cosmic dust, which was
developed in earlier works, and present a new library HEATCODE implementing this formalism for the calculation for arbitrary
grain properties and heating radiation fields. Our library is highly optimized for general-purpose processors with multiple cores and
vector instructions, with hierarchical memory cache structure. The HEATCODE library also efficiently runs on co-processor cards
implementing the Intel Many Integrated Core (Intel MIC) architecture. We discuss in detail the optimization steps that we took in
order to optimize for the Intel MIC architecture, which also significantly benefited the performance of the code on general-purpose
processors, and provide code samples and performance benchmarks for each step. The HEATCODE library performance on a single
Intel Xeon Phi coprocessor (Intel MIC architecture) is ∼ 2 times a general-purpose two-socket multicore processor system with
approximately the same nominal power consumption. The library supports heterogeneous calculations employing host processors
simultaneously with multiple coprocessors, and can be easily incorporated into existing radiation transport codes.
Keywords:
interstellar radiation field, cosmic dust, stochastic emission, radiative transport, optimization, many integrated core architecture
1. Introduction
Astrophysical objects can be difficult to study in the ultra-
violet (UV) and optical wavebands because of the effective at-
tenuation of the visible light by cosmic dust particles. To un-
derstand the broadband electromagnetic emissions from them
therefore requires radiation transfer (RT) calculations that self-
consistently account for the dust absorption and scattering of
the UV/optical light, and subsequent re-emission of the ab-
sorbed radiation at near infrared (NIR) to far infrared (FIR)
wavelengths. These calculations can be time consuming, partic-
ularly for determining the dust re-emission spectrum because it
depends on the sizes of the emitting grains and their associated
properties.
Dust grains with sizes & 1 micron typically attain a thermal
equilibrium with the heating radiation field because the time-
averaged vibrational energy of a grain is greater than the energy
of the heating photons. The emissivity for such dust particles
can be straightforwardly obtained by balancing the absorption
and re-emission rates. However, for smaller grains attaining
thermal equilibrium is often not the case because the absorp-
tion of a photon results in stochastic heating temperature spikes.
Email addresses: tporter@stanford.edu (Troy A. Porter1),
avladim@galprop.stanford.edu (Andrey E. Vladimirov1,2)
Most of the energy of the grain is re-emitted in these events
and no thermal equlibrium is attained. Consequently, stochas-
tic grain heating cannot be modeled using methods that assume
thermal equilibrium.
Methods for treating the stochastic heating of very small dust
grains have been developed [e.g., 1, 2, 3, 4, 5]. However, in RT
simulations of systems with a fine spatial voxelization, the cal-
culation of the stochastic grain emissivity for each cell can be
prohibitively computationally expensive. A typical approxima-
tion used to speed up these calculations involves pre-calculating
a table of emission spectra corresponding to different spectra of
a parameterized synthetic heating radiation field [e.g., 6, 7, 8],
and interpolating within the table on the basis of the incident
total energy density of the heating radiation field in each sim-
ulation cell to obtain the stochastic heating emissivity. While
this approximation has in some cases been sufficient, it does
not allow a fully self-consistent treatment for the small grain
heating.
The use of graphics cards to provide improvements in speed
has been explored for Monte Carlo RT simulations by different
authors [e.g., 6, 9]. For the former, this usage was restricted
to increasing the speed of the calculations for grains in ther-
mal equilbrium with the heating radiation field using Nvidia’s
CUDA architecture [10]. For the latter, the calculations for
Preprint submitted to Computer Physics Communications September 5, 2018
ar
X
iv
:1
31
1.
46
27
v1
  [
as
tro
-p
h.I
M
]  
19
 N
ov
 20
13
the stochastic and thermal equilibrium heating of grains were
made employing CUDA. However, a significant effort is often
required to recode algorithms in the C-like CUDA language to
exploit the hardware. An attractive alternative is the recently
released Intel Many Integrated Core (MIC) architecture [11]. A
significant advantage of the MIC architecture allows the same
programming frameworks and optimization methods as those
employed for multi-core general purpose processors1. There-
fore, the same code in a high-level programming language (C,
C++ or Fortran) can be used to perform calculations on com-
mon general-purpose processors and coprocessor cards that im-
plement the MIC architecture.
In this paper we describe the implementation and optimiza-
tion of a common method [4] for treating the stochastic heating
of small dust grains that is based on a matrix formalism. We
provide details for a general-purpose processor target as well as
the MIC architecture. Our goal was to make the stochastic grain
heating emissivity calculation for an arbitrary radiation field ef-
ficient enough to make it feasible for fully self-consistent cal-
culations in simulations with high spatial voxelizations, while
using essentially the same code for the general-purpose pro-
cessors and the additional co-processor accelerators. We show
that significant effort must be invested into the optimization of
the calculation to utilize the available compute power of the
MIC architecture. However, we find that most optimizations
for the MIC architecture also improve the performance of the
algorithm on the host system using general-purpose proces-
sors. In contrast, with GPGPU2 -like architectures the acceler-
ator code is generally distinct from the processor code, and the
architecture-level optimization of the application must be done
independently for the CPU code and for the GPGPU code.
The result of the work presented in this paper is a new astro-
physical tool, the HEATCODE3 library, which efficiently com-
putes the stochastic grain emissivity for arbitrary heating ra-
diation fields both for general-purpose CPUs and co-processor
cards implementing the Intel MIC architecture. In Section 2 we
summarize the theoretical model for stochastic emissivity cal-
culation. Section 3 summarizes the performance improvement
measures that were taken in order to optimize HEATCODE for
the multi-core and many-core architectures and demonstrates
the performance on a benchmark server equipped with copro-
cessors featuring the Intel MIC architecture. We summarize
our findings in Section 4. Appendix A describes the user in-
terface of the HEATCODE library and outlines the structure of
the code. HEATCODE can be obtained freely from the CPC
Program Library. In Appendix B the optimization steps are
described in detail.
1The term “general purpose processor” is used throughout this paper as a
synonym for the terms CPU (Central Processing Unit), “host”, and simply “pro-
cessor”. The term “coprocessor” is used interchangeably with “Intel Xeon Phi
coprocessor” to denote an accelerator card implementing the Intel MIC archi-
tecture.
2General-Purpose Graphics Processing Units
3HEATCODE is an acronym for HEterogeneous Architecture library for
sTochastic COsmic Dust Emissions
2. Stochastic Heating of Very Small Dust Grains
Cosmic ‘dust’ is a mixture of particles of different sizes and
compositions; see the book by Draine [12] for an overview. For
a given grain species, e.g., silicates, graphitic grains, there is
a distribution of sizes for the particles with the optical proper-
ties depending on the grain species. The size distribution is, in
general, unknown but parameterizations for various grain mix-
tures have been developed to explain features in the extinction
curves for the Milky Way and other galaxies, e.g., [13, 14, 15].
HEATCODE can take the size distribution and optical prop-
erties as user-supplied inputs. However, for convenience we
have included the size distributions from the authors referenced
above, along with the grain absorption cross sections for sil-
icate, graphite, and polycyclic aromatic hydrocarbons (PAHs)
from B. Draine’s website4; our calculations below use the PAH-
s/graphite/silicate mixture and optical properties from Li and
Draine [16] and the size distributions for Milky Way dust from
Weingartner and Draine [14].
For a given dust species, the heating radiation field is an input
parameter supplied to the HEATCODE library, and the spec-
trum of the re-emitted photons is the output of the calculation.
This is obtained by first calculating the absorption of stellar
light and the spectrum of re-emitted light produced when tran-
sient excited states of the grains relax for each grain size. De-
termining the re-emitted spectrum for each grain size involves
two steps. The temperature (or vibrational energy E) distribu-
tion of grains in response to the incident radiation spectrum is
first computed. We assume static equilibrium for this calcula-
tion, i.e., the stochastic grain heating is assumed to be exactly
balanced by the grain relaxation with re-emission of photons.
This assumption is valid as long as the time scale of incident
light variability is long compared to the relaxation times of the
transient excited grain states. Then, given the temperature dis-
tribution, the spectrum of re-emitted light for each grain size
is calculated. The output stochastic emission spectrum for the
ensemble of grains is obtained by convolution of the individual
grain size emission spectra with the grain size distribution.
2.1. Calculation of the Temperature Distribution
Guhathakurta and Draine [2] proposed a matrix formalism
for the treatment of the interaction between dust grains and
photons that was further elaborated by Draine and Li [4]. In
their methodology, the temperature distribution of radiatively
heated dust grains is described by P, the “state vector”. The
elements of this vector Pi are the probabilities for the grains of
the given size to be in the energy bin i. Energy bins span the
range [Emin, Emax], and we assume M + 1 narrow energy bins
(∆E  E), so i = 0, 1, . . .M.
The temperature distribution P is obtained by solving the
system of linear algebraic equations that expresses balance be-
tween the population and depopulation rates of each energy
level,∑
j,i
Ti jP j −
∑
j,i
T jiPi = 0 (i = 0, 1, . . . ,M) , (1)
4http://www.astro.princeton.edu/˜draine/dust/dust.diel.html
2
with the normalization condition
M∑
i=0
Pi = 1. (2)
Here the matrix elements Ti j are the probabilities per unit time
for a grain in energy bin j to transition to one of the states in
bin i.
Draine and Li [4] describe the exact statistical treatment for
the cooling transitions (i < j) as well as less computationally
demanding thermal approximations for the corresponding cool-
ing terms. The thermal approach introduces the notion of grain
temperature to approximate calculation of the downward tran-
sition rates. The “thermal discrete” approximation allows dis-
crete downward transitions to the same energy levels as the ex-
act statistical treatment. The “thermal continuous” approxima-
tion assumes that the only allowed spontaneous transition from
an excited vibrational level i is to the adjacent level i − 1:
Ti j = 0, if i < j − 1. (3)
This approximation has been shown by Draine and Li [4] to
be accurate for grains comprised of as few as 30 atoms and al-
lows for a fast solution method compared to inverting the whole
system of equations for the other methods. The HEATCODE li-
brary uses the “thermal continuous” approximation because it
allows the computational cost of solving for the temperature
distribution P to be considerably reduced while achieving rea-
sonably accurate results.
The solution of the system of equations (1), (2) may seem
straightforward. However, the diagonal elements of the ma-
trix corresponding to the system contain a difference of positive
numbers with nearly equal values. The solution can be unstable
because of finite numerical precision. Guhathakurta and Draine
[2] suggested that to minimize this effect on the solution, the
system of equations (1) can be transformed by adding to ev-
ery equation all preceding equations. This opreation gives the
following system:
f−1∑
j=0
B f jP j − T( f−1) f P f = 0 ( f = 1, . . . ,M), (4)
where
B f j =
M∑
k= f
Tk j ( f > j). (5)
Now the temperature distribution P can be determined by in-
troducing a temporary vector X ≡ P/P0 so that X0 = 1. Other
values of X can be computed recursively:
X f =
1
T( f−1) f
f−1∑
j=0
B f jX j ( f = 1, 2, . . . ,M). (6)
Then the values of P are computed by re-normalization in order
to satisfy Equation (2):
Pi =
Xi
X0 + X1 + . . . + XM
(i = 0, 1, . . . ,M). (7)
2.2. Heating and Cooling Rates
Heating transition rates are proportional to the incident light
energy density W(λ) at the wavelength corresponding to the
transition energy ∆Eul, and to the photon absorption cross sec-
tion at this wavelength σ(λ), which is a quantum mechanical
property of the grains. The heating rate is calculated as
Tul = W(λ)σ(λ)
λ3∆Eul
hc2
for u > l. (8)
The numerical wavelength grid on which W(λ) and σ(λ) are
defined can map to an energy grid different from the grid used
to represent the grain excitation levels. Therefore, in the nu-
merical solution, the quantity W(λ)σ(λ) ≡ Ω(λ) in Equation (8)
must be computed using interpolation:
log
[
Ω(λ)
Ω(λ j−1)
]
=
log
(
λ/λ j−1
)
log
(
λ j/λ j−1
) log [ Ω(λ j)
Ω(λ j−1)
]
. (9)
Here λ j−1 and λ j are, respectively, the lower and upper bound-
aries of the wavelength bins at which the energy density W and
cross section σ are defined.
The rate of spontaneous relaxation to the adjacent level with
photon emission is calculated as
T f−1, f ≈ 1E f − E f−1
8pi
h3c3
E f∫
0
E3σ(E)dE
exp(E/kT f ) − 1 , f > 0. (10)
This rate does not depend on the incident radiation field, and
therefore the convolution of the absorption cross section with
the energy distribution function in Equation (10) can be pre-
computed.
2.3. Calculation of the Grain Emission Spectrum
The emission spectrum for grains of size a at wavelength ν =
c/λ can be calculated using the following expression based on
Equation (56) in [4]:
νFa(ν) = σ(c/ν)
M∑
i=0
Pi(a)Λ(ν, Ei), (11)
where
Λ(ν, Ei) =

0, if Ei < hν,
2hν4
c2
Pi
exp(hν/kTi) − 1 , otherwise.
(12)
The contribution of stimulated emission is neglected in this ex-
pression.
The combined emissivity from grains of all sizes is obtained
by integration over the grain size distribution Q(a):
νF(ν) =
amax∫
amin
νFa(ν)Q(a)da, (13)
where amin and amax constrain the range of grain sizes consid-
ered in the model. The maximum size amax is chosen so that
for a > amax, the radiative grain heating process is no longer
stochastic, and can be treated in the thermal equilibrium ap-
proximation.
3
10-3
10-2
10-1
100
10-1 100 101 102 103
 
Blackbody at 4000 K(re-normalized)
Incident radiation field
PAH grains
Graphitic grains
Silicate grains
10-3
10-2
10-1
100
En
er
gy
 d
en
sit
y 
  λ
⋅
W
(λ
)   
[eV
 cm
-
3 ]
Blackbody at 10000 K(re-normalized)
10-3
10-2
10-1
100
10-1 100 101 102 103
 
Wavelength λ, µm
Composite spectrum Cosmic microwave(Mathis et al., 1983) background
Figure 1: Stochastic grain heating emissivities calculated by HEATCODE for
three incident light spectra. See text for details.
2.3.1. Example
An example of the output of HEATCODE is given in Fig-
ure 1, where the stochastic grain emissivity is computed for
three heating radiation fields. The heating radiation fields are
plotted with solid red lines in each panel of the figure, and cor-
respond to: (top) a blackbody spectrum at 4,000 K, (middle) a
blackbody spectrum at 10,000 K, and (bottom) the interstellar
radiation field (ISRF) from Mathis et al. [17]. The two black-
bodies are normalized to the same energy density as the spec-
trum from Mathis et al. [17], 0.46 eV cm−3. (For comparison
with the heating radiation field energy densities, the Cosmic
Microwave Background 2.735 K blackbody spectrum is shown
in the bottom panel.) The calculated emissivity is converted to
an equivalent energy density using the full solid angle and a gas
column density NH = 1022 cm−2. This conversion allows us
to depict the incident radiation field and the dust-emitted light
on the same plot for illustrative purposes. Multiplication by the
above mentioned gas column density is an optically thin ap-
proximation of the radiative transport throughout the Galaxy,
assuming that
a) the local radiation field is characteristic of the radiation field
across the Galaxy, and
b) infrared light is not attenuated by absorption.
Dashed and dash-dotted curves correspond to the emissions
from the three grain models included in the library: graphitic,
polycyclic aromatic hydrocarbons (PAH), and silicate grains.
3. Performance Results
The HEATCODE library can be used in self-consistent sim-
ulations that compute the stochastic heating and emissivity of
small dust grains in every simulation cell using the in-cell heat-
ing radiation field. Compared to an implementation of the al-
gorithm relying solely on standard compiler optimizations and
parallelization (e.g., the GNU C++ compiler using OpenMP –
see Section 3.2) we obtain almost a factor of 100 times speed-
up. Achieving this performance is made possible by
a) Parallelization of calculations by distributing a set of inci-
dent radiation spectra across multiple processor cores;
b) Tuning various aspects of code performance for the general-
purpose CPU architecture and for the Intel MIC architec-
ture;
c) Relying on the Intel C++ compiler to perform automatic
vectorization and other architecture-specific optimizations;
d) Performing the offload of the whole calculation or a part of
it to Intel Xeon Phi coprocessors, if the computing system
is enabled with them. Offload leads to additional speedup if
the set of incident radiation fields is large enough.
This section summarizes the optimizations applied to produce
HEATCODE and reports the benchmarks of performance. For
more detail on the optimization of stochastic heating and emis-
sivity calculation, refer to Appendix B.
3.1. Performance Tuning
HEATCODE is optimized for general-purpose processors,
and is able to run on Intel Xeon Phi coprocessors if they
are available. The performance tuning methods used in
HEATCODE apply to CPUs and Intel Xeon Phi coprocessors
alike. But, in this section we emphasize the optimization pro-
cedure for the coprocessors because of the novel nature of the
architecture.
Intel Xeon Phi coprocessors that we used for benchmarks in
the present paper contain 60 independent cores (technical spec-
ifications stated hereafter are obtained from [18]). These cores
have a low clock frequency (∼1 GHz) and four-way hyper-
threading. Each core relies on a dedicated vector arithmetics
unit to maintain high performance with low power consump-
tion. The vector units support 512-bit vector registers for SIMD
(Single Instruction Multiple Data) operations5. The SIMD in-
struction set supported by Intel Xeon Phi coprocessors is called
IMCI (Initial Many-Core Instructions). IMCI includes floating-
point multiplication, addition, fused multiply-add and division,
some transcendental instructions, such as exponentials, loga-
rithms and trigonometric functions, and a range of bitwise log-
ical operators, shuffle, gather/scatter memory accesses and bit-
masked versions for most operations.
Onboard memory in Intel Xeon Phi coprocessors used for
benchmarks in the present work comes in the form of 8 GB of
5For comparison, the SSE registers supported by most general-purpose
CPUs prior to the year 2011 are 128-bit long, and the AVX (Advanced Vector
Extensions) registers supported by the Intel Sandy Bridge and AMD Bulldozer
architectures are 256-bit long.
4
GDDR5 (Graphics Double Data Rate) with bandwidth capabil-
ities typically exceeding that of the current generation CPU-
based hosts. The onboard memory is cached. The Level-2
caches of each core are linked with a ring interconnect. This
effectively merges local Level-2 caches of the cores into an ag-
gregate coherent Level-2 cache of the coprocessor with non-
uniform access and a typical access time of 11 cycles. In addi-
tion, each core has a Level-1 cache with a typical access time of
1 cycle. Coherency is maintained in all caches at the hardware
level. This means that after one core modifies data in a cache
line, all other cores are guaranteed to see this modification when
they access this cache line at a later time.
There are five major optimization methods used in
HEATCODE appropriate for the MIC architecture, as described
below.
3.1.1. Thread parallelism improvement
With 60 × 4 = 240 logical cores in the Intel Xeon Phi co-
processor, the performance-critical part of the application must
have sufficiently large iteration space for work distribution, per-
form little synchronization between threads, and incur a small
enough per-thread memory footprint. In HEATCODE, the first
two issue are addressed by implementing thread parallelism in
the following way:
a) the stochastic heating and emissivity for a single incident
radiation field is calculated serially (i.e., by only one thread)
b) the library interface allows to call the emisivity calculation
on a set of radiation fields corresponding to distinct voxels
in the encompassing RT simulation
c) for processing a large number of radiation fields, multiple
OpenMP threads are spawned, and each thread processes its
share of radiation fields
d) dynamic loop scheduling in OpenMP is used to maintain
load balance across threads
As long as the number of input radiation fields supplied by the
user is significantly greater than the number of threads (240),
all coprocessor cores are utilized. This model of parallelism
requires no synchronization between threads, because each ra-
diation field is processed independently. The third issue, the
memory footprint of HEATCODE, is addressed by optimiz-
ing the re-use of scratch space in each thread, so that, even
with 240 threads, the application fits in memory. This allows
HEATCODE to employ all of the logical cores, thus maximis-
ing the computational capacity of the co-processor.
3.1.2. Vectorization refinements
Within each thread, the calculation can be further parallelized
through vectorization, i.e., by utilizing SIMD instructions of the
computing platform. Vector instructions apply the same arith-
metical or logical operation to multiple numbers, thus multiply-
ing the arithmetic performance by a factor equal to the width
of the vector register. The 512-bit vector registers in the MIC
architecture can fit 16 single precision floating-point numbers
or 8 double precision numbers, which corresponds to potential
speedups of 16x and 8x, respectively, compared to scalar (non-
SIMD) code.
We chose the most portable way to utilize vectorization in
HEATCODE: automatic vectorization by the compiler. Auto-
matic vectorization in the Intel C++ compiler is enabled by de-
fault at high optimization levels. With this functionality, the
compiler analyzes regular C++ loops, partitions them in blocks
of 16 (or 8) iterations, and produces an assembly code that ex-
ecutes these blocks with SIMD instructions. This occurs au-
tomatically and usually does not require the programmer’s in-
volvement. However, we assisted the compiler by designing the
C++ code of HEATCODE with the following properties
a) Data in automatically vectorized loops are accessed with
unit stride (i.e., contiguously)
b) The constructor of the HEATCODE data containers allo-
cates memory structures so that the address of the first el-
ement of each array is a multiple of 64 (i.e., the data are
64-byte aligned), which enables faster loading of data into
SIMD registers.
c) The inner dimensions of multidimensional arrays are a mul-
tiple of 64 bytes, so that the first element of every row is
64-byte aligned.
d) Most vector loops in HEATCODE are constructed so that
number of iterations is a multiple of 16 (the number of iter-
ations was padded to a multiple of 16 when possible)
e) Compiler hints in the form of vectorization pragmas were
added to some vector loops in order to inform the compiler
of data alignment, and to enable automatic vectorization in
situations where the compiler could not detect whether vec-
torization is safe.
With automatic vectorization, the C++ files of HEATCODE
can be compiled into exectutable code that runs with SIMD in-
structions on regular CPUs, as well as on Intel Xeon Phi copro-
cessors.
An alternative way to employ SIMD instructions would be
through special functions called compiler intrinsics. Intrin-
sics provide direct access to SIMD functionality, much like in-
line assembly. However, sets of SIMD instructions available
in CPUs and in the Intel MIC architecture are different. This
means that with intrinsics, the performance-critical portion of
HEATCODE would have to be written and optimized for every
architecture independently.
3.1.3. Scalar optimizations
The thread-parallel, vectorized algorithm of the stochastic
emissivity calculation in HEATCODE can be compiled to per-
form in serial (i.e., single-threaded), scalar (i.e., non-vector)
mode. The more optimized that underlying scalar algorithm
is, the better performance results are obtained when vectoriza-
tion and thread parallelism are enabled. Scalar optimizations
employed in HEATCODE are common for the many-core and
multi-core architecture:
a) All floating-point calculations are performed in single pre-
cision; floating-point over/underflows are avoided by re-
normalizing the data where necessary. The algorithm with
re-normalization is able to avoid floating-point exceptions
5
as long as the input heating radiation fields do not have inte-
grated energy densities orders of magnitude larger than the
typical ISRF value of ≈ 1 eV cm−3 Otherwise, the ratio be-
tween x f and x f−1 in Equation (6) may be too large to handle
in single precision
b) Optimized implementations for transcendental functions
and strength reduction (substitution of multiple inexpensive
operations for a single expensive operation) in arithmetic ex-
pressions are used. Appendix B.4 describes this in more
detail.
c) In the constructor of the main library object, class
TransientHeatingModelXeonPhi, precomputation of
some quantities (e.g., the radiation field interpolation pa-
rameters) is performed. The precomputed quantites are used
during the evaluation of the stochastic heating and emissiv-
ity of dust (see Appendix B.4).
3.1.4. Memory and cache traffic streamlining
Main memory accesses in most computing architectures are
more time-consuming than arithmetic operations. This dispar-
ity is usually alleviated by storing a limited amount of data in
a smaller, but faster memory known as cache. The minimum
amount of data that can be written to or read from a cache is
called a cache line. Cache lines in Intel Xeon processors and
Xeon Phi coprocessors are 64 contiguous bytes long. In order
to make the most out of caches, HEATCODE is optimized in
the following ways:
a) Data access in loops is made contiguous where possible. In
addition, in order to avoid scattered memory access in the ra-
diation field interpolation routine, the interpolation param-
eters for each wavelength bin are precomputed and stored
in a dense array (see Appendix B.3). These optimizations
rely on the property that when one element of a cache line
is accessed, the whole cache line with adjacent elements
is served by the cache. Additionally, because of the pre-
dictability of streamlined access pattern, data can be auto-
matically prefetched by the cache before it is even requested
by the processor. Contiguous memory access and the pack-
ing of data into dense arrays improve what is called data
access locality in space.
b) Nested loops are tiled (see Appendix B.5), and consecutive
loops of the same form are fused into one loop (see Ap-
pendix B.2). The effect of these optimizations is that when
a portion of data is used multiple times in an algorithm, it is
re-used as soon as possible after the previous usage, while
it is still in cache. These methods, nested loop tiling and
loop fusion, improve what is known as data access locality
in time.
3.1.5. Communication with the coprocessor
HEATCODE utilizes Intel Xeon Phi coprocessors using the
offload programming model. In this approach, the application
runs on the host system (i.e., on the CPU), where it initializes
the physics data (cross sections, precomputed quantities related
to the radiation field interpolation, Planck emissivity spectra,
etc.) and accepts the input radiation fields from the user ap-
plication. With the data prepared on the host, HEATCODE
communicates with the coprocessor driver and initiates offload
(i.e., transfer) via the PCIe (Periperal Component Interconnect
Express) bus of the necessary data and executable code to the
coprocessor, where the emissivity calculation functions are ex-
ecuted. After that, the calculated emissivities are returned to
the host system via the PCIe bus.
The design of HEATCODE allows offloaded work to be dis-
tributed across multiple coprocessors, or across coprocessors
and host CPUs. This is done by splitting up the input array of
incident radiation fields into chunks contatining multiple radi-
ation fields, and offloading the next unprocessed chunk to the
next available compute device. In this process, the emissiv-
ity calculation function is offloaded to the coprocessor multiple
times with different input data. Because the physics data used
by the calculation is the same in every offload, HEATCODE
optimizes the offload process by storing the physics data on
the coprocessor between offloads. In addition, to eliminate the
overhead of memory allocation, a persistent buffer is allocated
on the coprocessor to hold the (input) radiation field array and
the (output) stochatic heating emissivity array.
The constructor of the main library object, class
TransientHeatingModelXeonPhi, offloads all data re-
lated to the grain model or models to the coprocessor. The
desctructor of TransientHeatingModelXeonPhi removes
all persistent data and buffers from the coprocessor.
3.2. Benchmarks
For all benchmarks, we set up the runtime environment as
described in Appendix B.9. The size of the problem is chosen
as n = 104 radiation fields for benchmarks using a single com-
pute device, and n = 105 for heterogeneous calculations. All
radiation fields are set to the same spectrum for the benchmark,
corresponding to the bottom panel of Figure 1. The calcula-
tion time of all stochastic heating emissivities was measured 4
times, and the average of the last 3 is reported. Multiple runs
were used to control that the statistical accuracy of performance
does not exceed the last reported significant figure. Error bars
are therefore omitted in all plots.
For the performance benchmarks, we used a server based
on two Intel Xeon E5-2680 processors with 64 GB of DDR3
memory at 1,333 MHz. This server hosts two Intel Xeon Phi
coprocessors B1QS-5110P, each with 60 cores and 8 GB of on-
board GDDR5 memory. The server runs a Linux operating sys-
tem CentOS 6.4 on the host and the MPSS (Intel MIC Platform
Software Stack) version 2.1.4982-15 on coprocessors. Our tests
were performed with one of the two compilers: Intel C++ com-
piler version 13.1.1 and GNU C++ compiler version 4.4.7. The
former was used to benchmark the host, coprocessors, and the
whole system in a heterogeneous calculation. The latter could
be used only to benchmark the host processors, because the cur-
rent GNU C++ compiler does not produce executable code for
the Intel MIC architecture.
The performance of the optimized stochastic emissivity
calculation by class TransientHeatingModelXeonPhi in
HEATCODE is shown in Figure 2. The performance values in
6
the figure are normalized to the performance of the optimized
code compiled with the Intel C++ compiler and running on the
host CPU. Numerically, this normalization factor is 0.62 ms
per evaluation of the stochastic emissivity function in a batch
of n = 104 calculations. This performance is a factor of 95x
greater than the performance of the unoptimized code (here-
after referred to as the baseline performance). The performance
of optimized HEATCODE compiled with GCC is more than 3
times lower than with the Intel C++ compiler on the same plat-
form.
The difference between the performance of the calculation
on a single Intel Xeon Phi coprocessor and on host system is
a factor of 1.9x. Note that this factor reflects the comparison
of one Intel Xeon Phi coprocessor to a two-socket Intel Xeon
processor. The reason for comparing against two processor
sockets is that two processors consume approximately the same
power under load as one coprocessor. The speedup of 1.9x is
slightly lower than results of synthetic benchmarks SGEMM,
SMP LINPACK and STREAM triad, which provide speedups
of 2.9x, 2.6x and 2.2x, respectively [19]. However, we find the
1.9x performance gain acceptable, considering two factors:
a) The workload of HEATCODE has a latency-bound calcu-
lation phase with scattered memory accesses. This type of
workloads is better processed by many-core procesors than
by the MIC architecture coprocessors.
b) The HEATCODE is expressed completely in a high-level
programming language with tuning via compiler pragmas.
This approach ensures cross-platform portability and for-
ward compatibility. In contrast, some of the synthetic
benchmarks mentioned above are tuned using low-level
code with assembly instructions and/or intrinsic functions,
which is not a portable solution.
The last bar in Figure 2 shows the performance of a heteroge-
neous calculation, which uses the host system simultaneously
with two Intel Xeon Phi coprocessors. The performance gain
of the heterogeneous calculation is 4.4x compared to the CPU-
only calculation. This is approximately 10% lower than the sum
of the raw performances of two coprocessors and host proces-
sors 1.9 + 1.9 + 1.0 = 4.8x. There is a performance loss caused
by the unavoidable load imbalance across compute devices and
the scheduling overhead of the heterogeneous calculation. In
our benchmarking we used a problem size of n=105 spectra;
for a greater problem size, the performance loss due to schedul-
ing would be lower.
4. Summary and Discussion
We have presented HEATCODE, a new library for the cal-
culation of the stochastic heating and emissivity of cosmic dust
grains for arbitrary incident radiation fields. HEATCODE can
be run on general-purpose processors. In addition, it is opti-
mized to also efficiently run on Intel Xeon Phi coprocessors
and in heterogeneous systems with general-purpose processors
and MIC architecture coprocessors.
Overall, the optimized HEATCODE calculation running
on our benchmarking system and utilizing the host CPUs
0.0x
1.0x
2.0x
3.0x
4.0x
5.0x
Host CPU
w/GCC
Host CPU Coprocessor Host + 
Coprocessor
       Host + 
       2 coprocessors
R
el
at
iv
e 
Pe
rfo
rm
an
ce
Normalized to 0.62 ms per spectrum
(optimized code, Intel C++, on host CPUs)
 0.3x
 1.0x
 1.9x
 2.8x
 4.4x
Figure 2: Relative performance of the optimized version of HEATCODE on the
host, coprocessor, and in the heterogeneous mode.
together with two Intel Xeon Phi coprocessors processes
105 individual stochastic emissivity spectra in 14 seconds,
on average 0.14 ms per evaluation. This is signifi-
cantly faster than the unoptimized version (implemented in
class TransientHeatingModelUnoptimised) compiled with
GCC, which takes 6 · 103 seconds for the same calcula-
tion (60 ms per evaluation). An intended application of
HEATCODE is in conjunction with a RT code that is used for
calculating multiple instances of the observed local spectrum
of the interstellar radiation field and comparing each with ob-
servations, where each instance is evaluated for a set of Galac-
tic structural parameters encoding, e.g., the locations of spi-
ral arms, or the geometry of the Galactic bulge, over a pa-
rameter space. The stochastic dust emissivity for every spa-
tial grid cell must be calculated for each realization of the pa-
rameters. We estimate that 105 RT calculations are required
to explore a representative space of 4 parameters, with ∼ 105
voxels for each realization of the Galaxy. The single comput-
ing server with two Intel Xeon Phi coprocessors that we used
for benchmarks spends 14 × 105 s ≈ 16 days to compute the
stochastic heating grain emissivities in such a calculation. That
is a reasonable duration for a research project, as opposed to
6 · 103 × 105 s ≈ 19 years with the unoptimized code unable to
use the coprocessor cards.
This performance improvement was made possible by using
optimization practices that improve the performance of HPC
applications on general-purpose processors and on the Intel
MIC architecture. An extended Appendix to this paper is ded-
icated to the discussion of these optimization practices. As a
supplement to this discussion, we provide in Appendix B nine
code revisions corresponding to each optimization step. This
supplementary material in combination with the discussion in
the text can be used to further improve the performance of the
HEATCODE library, and to transfer the methods practiced here
to other applications with similar workloads.
7
The results discussed in Appendix B show that for almost all
optimization steps, a performance increase is observed both in
the host and in the coprocessor version of the library. With the
exception of minor fragments, the optimized C++ code is the
same for all platforms and compilers. Therefore, the common
trend in all three curves in Figure B.4 corroborates the claim
that the optimization for the MIC architecture relies on the same
principles as optimization for general-purpose processors [20].
At the same time, the unoptimized code on the coprocessor per-
forms a factor of 3x worse than the host system. This obser-
vation supports the recommendation that applications must be
optimized to fully exploit the general-purpose CPU capabilities
before they can be efficiently run on the MIC architecture [20].
Acknowledgements
The work on the HEATCODE library is supported by NASA
grants NNX12AO69G and NNX12AO73G.
Appendix A. The HEATCODE Library Interface and
Structure
The HEATCODE library is written in the C++ language and
designed to function in the Linux environment. It can be com-
piled using the supplied Makefile, which supports the usage
of the GNU Compiler Collection (GCC) and of the Intel C++
compiler. The Intel compiler is necessary in order to utilize
the MIC architecture support, which offloads parts of the cal-
culation to Intel Xeon Phi coprocessors. In addition, the Intel
compiler produces a faster performing executable code than the
GNU compiler, due to better support for automatic vectoriza-
tion and single precision transcendental arithmetics.
The linking of HEATCODE to a user application is done in
the usual way by passing the corresponding object files to the
linker. The compilation of an example application utilizing the
HEATCODE library is demonstrated in the Makefile.
Appendix A.1. Interface
The interface to the library is demonstrated in the files
GrainModel.h and TransientHeatingModel.h:
a) Class GrainModel is an abstract class which declares the
data containers for the properties of dust interaction with
light and size distribution of dust grains. It must be
overridden by concrete derived classes in order to imple-
ment specific grain models. HEATCODE contains de-
rived classes GraphiteGrainModel, PAHGrainModel and
SilicateGrainModel that implement the corresponding
models of grains. The constructor of GrainModel must be
called with an input argument that defines the light wave-
length grid used throughout the library.
b) Abstract class TransientHeatingModel instru-
ments the interface and the computing engine (see
Section 2) that uses a GrainModel object to com-
pute the stochastic emissivity of the corresponding
type of grains heated by incident light spectrum.
HEATCODE contains two concrete implementations
of this class: TransientHeatingModelUnoptimised,
which is an unoptimized version of the calculation, and
TransientHeatingModelXeonPhi, which is optimized
for heterogeneous calculations using general-purpose multi-
core processors and Intel Xeon Phi coprocessors. The latter
implementation is recommended for best performance even
on systems not equipped with Intel Xeon Phi coprocessors.
The constructor of TransientHeatingModel must be
called with two arguments that define the GrainModel
class and the temperature grid used in the calculations. The
implementation TransientHeatingModelXeonPhi takes
additional arguments that define the set of compute devices,
which may be used for calculations.
c) The method CalculateTransientEmissivity of class
TransientHeatingModel is the main interface function
of the HEATCODE library. Its signature is shown in Fig-
ure A.3. In this listing, classes vector and valarray are
defined in the namespace std in the C++ Standard Library.
The input vector rf contains any number of incident radi-
ation fields. These radiation fields are direction-integrated
light intensities measured in the units of eV cm−2 s−1 m−1
and defined on the wavelength grid used to initialize the
class GrainModel. The output vector emissivity upon
the exit from the function contains the stochastic grain
emissivities corresponding to each of the incident radiation
fields. The emissivity is the rate of energy emission multi-
plied by the wavelength. In the code, it is measured in the
units of eV s−1 sr−1 (H-atom)−1 and defined on the same
wavelength grid.
1 v i r t u a l vo id C a l c u l a t e T r a n s i e n t E m i s s i v i t y (
2 cons t v e c t o r < cons t v a l a r r a y <double >∗ > &r f ,
3 cons t v e c t o r < v a l a r r a y <double >∗ > &e m i s s i v i t y
4 ) = 0 ;
Figure A.3: Signature of the main interface function of HEATCODE.
Appendix A.2. Dependencies
HEATCODE uses OpenMP in order to distribute calculations
across multiple threads on many-core processors and Intel Xeon
Phi coprocessors. Therefore, applications using HEATCODE
must be compiled with the compiler argument -fopenmp with
GCC or -openmp with Intel C++ compiler.
Appendix A.3. Usage in Parallel Computing Platforms
The function CalculateTransientEmissivity automat-
ically distributes work across multiple CPU cores and, if re-
quested and possible, performs offload of parts of the calcu-
lation to coprocessors. Usually, there is is no need to call
this function from a parallel region, because it exploits all
parallelism in the hardware system available to it. How-
ever, CalculateTransientEmissivity may be safely in-
voked from multiple OpenMP threads. In this case, only one
instance of the calculation proceeds at a time, because the cal-
culation is executed inside a critical OpenMP barier.
8
HEATCODE may be used in cluster environments if MPI
processes operating on independent nodes call independent
instances of the library. However, within HEATCODE,
distributed-memory parallelism with MPI is not supported.
When multiple instances of an application using
HEATCODE are running on a compute node with Intel
Xeon Phi coprocessors, each instance will try to claim all avail-
able compute resources: CPU cores and coprocessors. In cases
when it is necessary to share these resources between several
instances of the HEATCODE library, the user must implement
inter-process locking or otherwise modify HEATCODE in
order to avoid resource contention.
Appendix B. Optimization of HEATCODE
In this section we discuss the optimizations that had to be
made in order to improve the performance of the stochastic
heating and emissivity calculation and utilize the Intel MIC
architecture. Most of these optimizations apply to all mod-
ern general-purpose processors and are not specific to the na-
ture of the HEATCODE algorithms. We report our optimiza-
tion work in detail in order to facilitate future development of
HEATCODE and of other applications relying on similar com-
putational algorithms. As of the writing of this work, the Intel
MIC architecture is a young technology with only eight months
of public exposure. We believe that the sharing of best opti-
mization practices at this time is essential to the evolution of
the scientific developer community.
Optimization steps discussed below are revisions of the class
TransientHeatingModelXeonPhi. Most optimization steps
address a specific category of hardware capabilities or algo-
rithm properties; however, some optimization improve perfor-
mance by affecting multiple issues. We emphasize such multi-
issue optimizations in the discussion. The explicit code of all
optimization steps can be viewed, cross-examined and bench-
marked by downloading the supplement to HEATCODE in the
CPC library.
The unoptimized code (we refer to it as “optimization
step 0”) is our own application developed without the aim to
optimize it for the Intel MIC architecture or to tune it for a
specific C++ compiler. The code of step 0 is not intentionally
handicapped for the purpose of highlighting the optimization
benefits. It was designed as an architecture-oblivious, straight-
forward implementation of the matrix formalism of Draine and
Li [4]. The performance of the code in step 0 compiled with
the GNU C++ compiler is hereafter referred to as the baseline
performance. Quantitatively, the baseline corresponds to an av-
erage of 59 ms per stochastic heating and emissivity calculation
(in a batch of 104 spectra).
Figure B.4 shows the performance of HEATCODE for all op-
timization steps with three curves. The green dashed line shows
the performance of the code compiled with GCC on the host
system utilizing all cores. The blue dash-dotted line is the per-
formance of the same code compiled with Intel C++ compiler,
also on the host system. The red dash-double-dot line is the
performance on a single Intel Xeon Phi coprocessor (for steps 0
through 7), or on the whole system with the host and two co-
processors used concurrently (step 8). All benchmark results
shown in Figure B.4 are measured relative to the baseline.
The performance of the optimized GCC version (step 7) is
approximately 32x better than the baseline. The Intel C++
compiler version on the host is approximately 95x faster than
the baseline, and the code compiled with Intel C++ compiler on
the coprocessor is 180x faster. The heterogeneous system com-
prised of the host processors and two coprocessors performs
420x faster than the baseline version.
With GCC, the performance falls short of the corresponding
performance with Intel C++ compiler. Two factors contribute
to this performance difference. First, according to the vectoriza-
tion report, GCC was not able to vectorize some of the loops in
the code, including the calculation of the interpolated radiation
field. That calculation involves costly transcendental arithmetic
operations and takes a significant fraction of the computation
time. Second, the Intel Math Library provides optimized ex-
ponential and logarithm functions with base 2, which are faster
than their natural base counterparts, and HEATCODE takes ad-
vantage of these optimized functions. The math library used
by GCC does also provides base 2 exponential and logarithm
functions, but they are not faster than the natural base logarithm
and exponential. In order to compensate for that, when GCC is
used for compiling HEATCODE, the code falls back to using
the natural base exponential and logarithmic functions.
In the rest of the Appendix, the specific measures taken in
each optimization step are discussed and supplemented with
fragments of C++ code and explanatory diagrams.
Appendix B.1. Algorithm Optimizations: Pruning, Recurrence
Some applications can be improved by eliminating redun-
dant operations from the algorithm. In the context of computer
algorithms, this procedure is known as “pruning”. Pruning is
generally hardware-independent, but may require adopting a set
of assumptions. For calculations in HEATCODE, two assump-
tions are made:
1) We assume that above a certain threshold grain size, grains
become large enough so that the stochastic heating does not
have to be computed. Instead, adiabatic heating and emis-
sion formalism can be used, which is outside the scope of
the HEATCODE applicability. In step 1, we optimize the
code by precomputing the index gIMax, which corresponds
to the threshold grain size. Using this index, we restrict the
bounds of corresponding loops of and reduce the amount of
memory allocated for some storage containers.
2) The radiation field is assumed to be zero outside the wave-
length range covered by the numerical wavelength grid. Ac-
cordingly, the range of grain excitation energies can be con-
strained. We optimize the code by precomputing the index
of the highest exited level for each grain size. This index is
represented by array int fMax[gIMax]. Using fMax, we
restrict the loop bounds in the respective matrix operations.
Additionally, in step 1, we implement the pruning of grain
sizes for which all heating terms calculated in the course of
9
10-1
100
101
102
103
 0  1  2  3  4  5  6  7  8
Pe
rfo
rm
an
ce
 R
el
at
iv
e 
to
 B
as
el
in
e
Optimization Step
Baseline: unoptimized,
compiled with GCC,
running on host CPUs
(59 ms per spectrum)
GCC on CPUs
Intel C
++ on
 Xeon
 Phi
Intel C
++ on C
PUs
Unoptimized
with
Offload
Algorithm
Optimization:
Pruning, Recurrence
Thread Parallelism:
Fit All Threads
 in Memory
Improved
 Interpolation Method: 
Packed Operations
Scalar Optimizations:
Precomputation,
 Precision Control
Memory Access:
Packed Data,
Loop Tiling
   Vectorization:
   Alignment,
   Padding, Hints
Offload Traffic:
Data Persistence
on Coprocessor
Heterogeneous:          
Using Host +          
+ Two Coprocessors             
Figure B.4: Performance of HEATCODE at optimization stages from the original implementation to the code highly optimized for heterogeneous multi- and
many-core SIMD architectures.
radiation field interpolation are zero. This happens if the radi-
ation field is non-zero only in a limited waveband, which does
not interact with grains of a given size. For such non-interacting
grains, the temperature distribution is immediately obtained as
Pi = δi0, where δi j is the Kronecker delta. The pruning is
done by counting and storing the number of non-zero heating
terms in array int nNonZero[gIMax] and using this array in
all functions to skip through trivial grain sizes. This optimiza-
tion does not involve any assumptions.
Another approach to algorithm optimization is the use of
recurrence relations. These relations express a quantity in a
computed sequence via one or more preceding values in the se-
quence. As Guhathakurta and Draine [2] mention, the elements
of matrix B f j defined by Equation (5) can be computed using a
recurrence relation, which in our notation is
B f ,i =
{
B( f +1),i + T f ,i for i = 0, 1, . . . ,M − 1;
T f ,i for i = M.
(B.1)
In optimization step 0 (previous), the calculation of Bi j is
performed explicitly (i.e., without recurrence) using the code
shown in the left-hand side panel of Figure B.5. In step 1
(currently discussed), we modify this code as shown in the
right-hand side panel of Figure B.5. This optimization sig-
nificantly improves the performance by reducing the cost of
the calculation of Bi j from O
(
M3
)
to O
(
M2
)
. Figures B.5,
variables transientMatrix and bMatrix are 3-dimensional
arrays containing, respectively, matrices Ti j and Bi j for each
grain size. The variable gI is the index of the grain size, and
tempBins is the size of the temperature grid corresponding to
(M + 1) in Equations (5) and (B.1).
Appendix B.2. Thread Parallelism: Inter-Procedural Loop
Fusion for Reduced Memory Footprint
While the host system used in our tests has 32 logical cores,
each of its two coprocessors has 240 logical cores. As discussed
in Section 3.1.1, HEATCODE employs thread parallelism by
distributing across processor cores a large number of radiation
fields for which the stochastic heating and emissivity must be
computed. Intel Xeon Phi coprocessors used for testing have
8 GB of onboard memory, with some of it used for adminis-
trative purposes. Each thread allocates scratch data in order
to perform the calculation of stochastic heating and emissiv-
ity. The greatest amount of memory is used for scratch arrays
which hold the results of intermediate calculations, and occupy
up to 60 MB per thread (the exact size depends on the grain
model). On the coprocessor, each of the 240 threads allocates a
private copy of all three arrays, resulting in a memory footprint
in excess of 10 GB. Therefore, in order to run the calculation,
we have to reduce the number of threads down to an amount
that allows all scratch data to fit in memory, sacrificing the per-
formance of unused cores. In optimization step 2 (current), we
resolve this situation as described below.
In each parallel thread spawned by the function
CalculateTransientEmissivity, three arrays are
10
1 f o r ( i n t f = 0 ; f < tempBins ; ++ f ) {
2 f o r ( i n t i = 0 ; i < tempBins ; ++ i ) {
3
4 double sum = 0 ;
5 f o r ( i n t k = f ; k < tempBins ; ++k )
6 sum += t r a n s i e n t M a t r i x [ g I ∗ tempBins ∗ tempBins + k
∗ tempBins + i ] ;
7
8 bMat r ix [ g I ∗ tempBins ∗ tempBins + f ∗ tempBins + i ] =
sum ;
9 }
1 double rSum [ tempBins ] ;
2 f o r ( i n t i = 0 ; i < tempBins ; i ++)
3 rSum [ i ] = 0 ;
4
5 f o r ( i n t f = fMax [ g I ] ; f >= 1 ; −− f )
6 f o r ( i n t i = 0 ; i < f ; ++ i ) {
7 rSum [ i ] += t r a n s i e n t M a t r i x [ g I ∗ tempBins ∗ tempBins +
f ∗ tempBins + i ] ;
8 bMat r ix [ g I ∗ tempBins ∗ tempBins + f ∗ tempBins + i ] =
rSum [ i ] ;
9 }
Figure B.5: Left: unoptimized calculation of Bi j for each grain size. Right: calculation of Bi j for optimized by using a recurrence relation.
used to temporarily hold scratch data of the calcula-
tion: weightedRadiationField, transientMatrix
and bMatrix. The size of each of these arrays is
gIMax × tempBins × tempBins. The factor gIMax & 100
occurs because the calculation stores one matrix of size
tempBins × tempBins for each grain size. This is nec-
essary because the unoptimized calculation is structured as
a pipeline. Three functions process the radiation field into
the temperature distribution. One function performs the
radiation field interpolation (9) and places it into the array
weightedRadiationField. The second uses the interpolated
radiation fields to prepare matrices Ti j and Bi j stored in
transientMatrix and bMatrix, respectively. The third
function uses bMatrix to solve, for each grain size, the system
of equations (4) with these matrices. The structure of the
unoptimized calculation is shown in Figure B.6.
The optimization that we performe in step 2 involves
loop fusion. Each of the three pipeline functions con-
tains a loop in the variable gI, which spans all grain
sizes. The calculation for each grain size is independent
of all other grain sizes. Therefore, if these three loops
are fused into one, then, instead of a set of 3-dimensional
arrays of size gIMax × tempBins × tempBins, a set of
2-dimensional arrays of size tempBins × tempBins can
be used. We perform the optimization by moving the
code of the three functions into a single function named
RadiationFieldToTemperatureDistribution and fusing
the loop in gI. Naturally, we could once again partition the
body of the new gI loop into three functions operating on only
one index gI. However, we chose to keep the code in one func-
tion, aiming for the best performance. Argument passing and
function calling could incur additional performance overhead.
The optimized structure of the code is listed in Figure B.7. A
schematic view of this optimization is shown in Figure B.8.
This optimization does not significantly affect the perfor-
mance of the host system based on general-purpose Intel pro-
cessors. Indeed, the host system with 64 GB of memory and 32
logical cores has sufficient memory to hold the scratch data for
all 32 threads even in the unoptimized case. However, the per-
formance on Intel Xeon Phi coprocessors is improved by loop
fusion by a factor of 4. One reason for this is that with re-
duced memory footprint, the calculation is able to run as many
OpenMP threads as there are logical cores on the coproces-
sor. Another reason is that instead of an expensive allocation
of large scratch arrays on the heap, threads perform quick allo-
cation of small scratch arrays on the stack. Dynamic allocation
on the heap is particularly costly on the Intel MIC architecture,
because this operation is inherently sequential.
Appendix B.3. Efficient Interpolation Method: Packing and
Precomputing Scattered Memory Accesses
Optimization for multi- and many-core architectures usually
requires the maintenance of data access locality. Locality in
this context means that the data in the main memory requested
by cores should be requested from adjacent addresses (“local-
ity in space”) and, if these data are used multiple times in an
algorithm, they should be re-used as soon as possible (“locality
in time”). These properties of data access make the best use of
hierarchical memory caches.
One of the operations performed in the stochastic heating cal-
culation algorithm has an inherently scattered pattern of mem-
ory access. This operation is the interpolation of the incident
radiation field from the wavelength grid on which it is defined
onto the grid on which the absorption cross section of the grains
is discretized. This pattern of memory access is inefficient be-
cause it involves memory access with poor locality in space.
We also argue below that it is inefficient for vectorization (i.e.,
the use of SIMD instructions). The unoptimized interpolation
algorithm is shown in Figure B.9.
There are a number of issues in the unoptimized interpolation
algorithm that prevent the multi- and many-core architecture
from using its full potential:
1) The function std::lower bound is employed to find the
radiation field wavelength bin interacting with the current
grain wavelength. This function uses the binary search al-
gorithm, which incurs poorly predictable branches and scat-
tered memory accesses to array wavelength. Due to the
combinatorial, non-vector nature of this algorithm, the Intel
MIC architecture is used sub-optimally.
2) The data are read from arrays radiationField and
absorptionCrossSection with a scattered pattern ac-
cording to the index j found in the binary search. This
pattern complicates effective cache use. It also prevents the
vectorization of the transcendental expressions used to com-
pute the interpolated value.
3) The data written to the matrix weightedRadiationField
are re-used in the code when the matrix transientMatrix
11
1 void I n t e r p o l a t e W e i g h t e d R a d i a t i o n F i e l d ( / ∗ o u t p u t ∗ / f l o a t ∗ w e i g h t e d R a d i a t i o n F i e l d , . . . ) {
2 f o r ( i n t gI = 0 ; g I < gMax ; g I ++)
3 { / ∗ . . . c a l c u l a t i n g w e i g h t e d R a d i a t i o n F i e l d [ g I ∗ tempBins ∗ tempBins + o f f s e t ] ∗ / }
4 }
5
6 void C a l c u l a t e M a t r i c e s ( / ∗ i n p u t ∗ / cons t f l o a t ∗ w e i g h t e d R a d i a t i o n F i e l d , / ∗ s c r a t c h ∗ / f l o a t ∗ t r a n s i e n t M a t r i x ,
/ ∗ o u t p u t ∗ / f l o a t ∗ bMatr ix , . . . ) {
7 f o r ( i n t gI = 0 ; g I < gMax ; g I ++)
8 { / ∗ . . . c a l c u l a t i n g bMat r ix [ g I ∗ tempBins ∗ tempBins + o f f s e t ] ∗ / }
9 }
10
11 void C a l c u l a t e T e m p e r a t u r e D i s t r i b u t i o n ( / ∗ i n p u t ∗ / cons t f l o a t ∗ bMatr ix , . . . ) {
12 f o r ( i n t gI = 0 ; g I < gMax ; g I ++)
13 { / ∗ . . . u s i n g bMat r ix [ g I ∗ tempBins ∗ tempBins + o f f s e t ] ∗ / }
14 }
15
16 void C a l c u l a t e T r a n s i e n t E m i s s i v i t y ( . . . ) {
17 / ∗ Reducing t h e number o f OpenMP t h r e a d s t o f i t t h e problem i n memory ∗ /
18 cons t i n t r educedNThreads = max ( 2 4 0 , ava i l ab l eMemory / memoryPerThread ) ;
19 o m p s e t n u m t h r e a d s ( reducedNThreads ) ;
20 #pragma omp p a r a l l e l f o r
21 f o r ( i n t r = 0 ; r < n R a d i a t i o n F i e l d s ; r ++) {
22 / ∗ A l l o c a t i o n o f t e m p o r a r y a r r a y s i n each t h r e a d ∗ /
23 f l o a t ∗ w e i g h t e d R a d i a t i o n F i e l d = ( f l o a t ∗ ) m a l lo c ( s i z e o f ( f l o a t ) ∗gIMax∗ tempBins ∗ tempBins ) ;
24 f l o a t ∗ t r a n s i e n t M a t r i x = ( f l o a t ∗ ) m a l lo c ( s i z e o f ( f l o a t ) ∗gIMax∗ tempBins ∗ tempBins ) ;
25 f l o a t ∗ bMat r ix = ( f l o a t ∗ ) m a l lo c ( s i z e o f ( f l o a t ) ∗gIMax∗ tempBins ∗ tempBins ) ;
26
27 / ∗ P a s s i n g d a t a a c r o s s f u n c t i o n s ∗ /
28 I n t e r p o l a t e W e i g h t e d R a d i a t i o n F i e l d ( w e i g h t e d R a d i a t i o n F i e l d , . . . ) ;
29 C a l c u l a t e M a t r i c e s ( w e i g h t e d R a d i a t i o n F i e l d , t r a n s i e n t M a t r i x , bMatr ix , . . . ) ;
30 C a l c u l a t e T e m p e r a t u r e D i s t r i b u t i o n ( bMatr ix , . . . ) ;
31
32 / ∗ . . . ∗ /
33 }
34 }
Figure B.6: Unoptimized structure of stochastic heating and emissivity calculation. The processing pipeline involves three distinct functions. The output of one
function is the input of the subsequent function. In order to pass the data from one function to another, 3-dimensional temporary arrays are allocated on the stack,
which requires too much memory, and the number of logical cores must be reduced.
1 void R a d i a t i o n F i e l d T o T e m p e r a t u r e D i s t r i b u t i o n ( / ∗ . . . ∗ / ) {
2 / ∗ 2− d i m e n s i o n a l i n s t e a d o f 3− d i m e n s i o n a l a r r a y s
3 S t a c k a l l o c a t i o n h e r e i s f a s t e r t h a n heap a l l o c a t i o n i n t h e u n o p t i m i z e d v e r s i o n . ∗ /
4 f l o a t w e i g h t e d R a d i a t i o n F i e l d [ tempBins ∗ tempBins ] ;
5 f l o a t t r a n s i e n t M a t r i x [ tempBins ∗ tempBins ] ;
6 f l o a t bMat r ix [ tempBins ∗ tempBins ] ;
7
8 f o r ( i n t gI = 0 ; g I < gIMax ; g I ++) {
9 / ∗ Here go t h e c a l c u l a t i o n s p r e v i o u s l y pe r fo rmed i n t h e body of t h e g I loop i n
10 I n t e r p o l a t e W e i g h t e d R a d i a t i o n F i e l d s ( . . . )
11 C a l c u l a t e M a t r i c e s ( . . . )
12 C a l c u l a t e T e m p e r a t u r e D i s t r i b u t i o n ( . . . )
13 ∗ /
14 }
15 }
16
17 void C a l c u l a t e T r a n s i e n t E m i s s i v i t y ( . . . ) {
18 #pragma omp p a r a l l e l f o r
19 f o r ( i n t r = 0 ; r < n R a d i a t i o n F i e l d s ; r ++) {
20 R a d i a t i o n F i e l d T o T e m p e r a t u r e D i s t r i b u t i o n ( / ∗ . . . ∗ / )
21 }
22 }
Figure B.7: Optimized structure of stochastic heating and emissivity calculation. The processing pipeline is placed in one function, and three loops in variable gI
are fused into one. This allows smaller 2-dimensional temporary arrays. Now the problem fits into memory even when all logical cores are used. In addition, stack
allocation of scratch arrays is possible, which is much faster than the heap allocation of large arrays.
12
  
RadiationFieldToTemperatureDistribution() {
  for (int i = 0; i < gIMax; i++)
    { 
 /* … InterpolateWeightedRadiationField … */ 
 /*        … CalculateMatrices …          */ 
 /* … CalculateTemperatureDistribution …  */ 
    }
}
gIMax
weightedRadiationField
transientMatrix
InterpolateWeightedRadiationField() {
  for (int i = 0; i < gIMax; i++)
    { /* … */ }}
CalculateMatrices() {
  for (int i = 0; i < gIMax; i++)
    { /* … */ }}
gIMax
bMarix
gIMax
CalculateTemperatureDistribution() {
  for (int i = 0; i < gIMax; i++)
    { /* … */ }}
Unoptimized Version 
with 3-dimensional Temporary Arrays
Optimized Version with Fused Loops 
and 2-dimensional Temporary Arrays
weightedRadiationField
transientMatrix
bMatrix
#pragma omp parallel for #pragma omp parallel for
Figure B.8: Inter-procedural loop fusion reduces the memory footprint, which increases the usable number of OpenMP threads.
is computed, however, the calculations of these two ma-
trices are performed in distinct loops. This leads to
sub-optimal data access locality in time. By the time
the calculation of transientMatrix begins, the bulk of
weightedRadiationField is likely to have been evicted
from processor caches into the main memory. Loading
weightedRadiationField from memory is the bottleneck
for the calculation of transientMatrix.
In step 3, we optimize the algorithm of radiation field in-
terpolation and the computation of transientMatrix using
several measures, as shown in Figure B.10:
a) First, we interchange the order of nesting in the traver-
sal of the two wavelength grids. Instead of traversing the
grain wavelength grid first and for each set of indices (f,
i) computing the index j in the radiation field grid, we
made the optimized outer loop iterate through index j and
compute the indices (f, i) for it. This improves the pat-
tern of memory access, because the index f*tempBins
+ i remains constant or monotonically increases with in-
creasing j. Even though now the calculation accesses
weightedInterpolationField with a scattered pattern,
the performance increases because the latency of scattered
writes can be overlapped with computation, unlike the la-
tency of scattered reads. The overlap occurs automatically
thanks to out-of-order computation in Intel Xeon cores and
due to hyper-threading in both Intel Xeon and Intel Xeon
Phi cores.
b) The optimized code precomputes during the initializa-
tion phase the set of values (f, i) resonant with each
wavelength bin j. These values are stored in ar-
ray interpolationPatternIndex. An auxiliary ar-
ray interpolationPatternCount stores the number of
grain wavelengths in the former array that resonate with
wavelength[j]. This completely eliminates a sorted array
search, along with the poorly predictable branches caused
by it, from the interpolation code.
c) The algebraic expression for the interpolated radiation field
value is simplified, and most of the terms in it precomputed
and stored in array interpolationLogOffset. This re-
duces the number of transcendental functions called for ev-
ery interpolation event from 9 to an average value in the
range (1 . . . 2]. The specific number depends on how many
grain wavelength bins interact with the given radiation field
bin, on average.
d) The calculation of the elements of transientMatrix is
moved from a separate loop into the interpolation loop. This
improves the temporal locality of data access, because the
values of weightedInterpolationField are still in the
processor’s registers when they are re-used.
The optimizations discussed above increase the code perfor-
mance thanks to improvements on multiple fronts. Loop inter-
change streamlines the data access pattern, replacing scattered
reads with scattered writes. Precomputation of wavelength in-
dices and interpolation expressions eliminates a combinatorial
search and reduces the volume of transcendental arithmetic op-
erations and non-local memory accesses. Fusing the calcula-
tion of weightedRadiationField with its subsequent usage
to compute transientMatrix improves the locality of data
access in time.
A schematic diagram of the optimization of the interpolation
algorithm is shown in Figure B.11.
13
1 f o r ( i n t f = 1 ; f <= fMax [ g I ] ; ++ f ) { / ∗ Loop t h r o u g h a l l h e a t i n g t r a n s i t i o n s f o r t h i s g r a i n s i z e ∗ /
2 f o r ( i n t i = 0 ; i < f ; ++ i ) {
3 / ∗ Wavelength a t which c r o s s s e c t i o n i s d e f i n e d ∗ /
4 cons t double wl = g r a i n W a v e l e n g t h [ g I ∗ tempBins ∗ tempBins + f ∗ tempBins + i ] ;
5
6 i f ( wl >= w a v e l e n g t h [ 0 ] && wl <= w a v e l e n g t h [ wlBins −1] ) { / ∗ Guard a g a i n s t o u t o f r a n g e r e s u l t s ∗ /
7
8 cons t f l o a t ∗ wlVal = s t d : : l ower bound (& w a v e l e n g t h [ 0 ] , &w a v e l e n g t h [ wlBins −1] , wl ) ;
9 cons t i n t j = wlVal − &w a v e l e n g t h [ 0 ] ; / ∗ Index of c o r r e s p o n d i n g w a v e l e n g t h i n r a d i a t i o n f i e l d g r i d ∗ /
10
11 / ∗ Two v a l u e s o f w e i g h t e d r a d i a t i o n f i e l d t o i n t e r p o l a t e between : ∗ /
12 cons t double l ower = r a d i a t i o n F i e l d [ j ]∗ a b s o r p t i o n C r o s s S e c t i o n [ g I ∗wlBins + j ] ;
13 cons t double uppe r = r a d i a t i o n F i e l d [ j −1]∗ a b s o r p t i o n C r o s s S e c t i o n [ g I ∗wlBins + j −1 ] ;
14
15 i f ( ( uppe r > 0) && ( lower > 0) ) {
16 cons t double r e s u l t = / ∗ L i n e a r i n t e r p o l a t i o n i n t h e log− l o g s p a c e ∗ /
17 exp ( l o g ( lower ) + ( l o g ( uppe r ) − l o g ( lower ) ) ∗
18 ( l o g ( wl ) − l o g ( w a v e l e n g t h [ j −1] ) ) / ( l o g ( w a v e l e n g t h [ j ] ) − l o g ( w a v e l e n g t h [ j −1] ) ) ) ;
19
20 w e i g h t e d R a d i a t i o n F i e l d [ g I ∗ tempBins ∗ tempBins + f ∗ tempBins + i ] =
21 r e s u l t ∗ ( u t l : : m/ u t l : : cm ) ∗ ( u t l : : m/ u t l : : cm ) ; / ∗ C o n v e r s i o n o f u n i t s ∗ /
22 }
23 }
24 }
25 }
26
27 / ∗ t r a n s i e n t M a t r i x i s computed from w e i g h t e d R a d i a t i o n F i e l d i n a s e p a r a t e loop ∗ /
28 f o r ( i n t f = 1 ; f <= fMax [ g I ] ; ++ f )
29 f o r ( i n t i = 0 ; i < f ; ++ i )
30 t r a n s i e n t M a t r i x [ f ∗ tempBins + i ] = w e i g h t e d R a d i a t i o n F i e l d [ f ∗ tempBins + i ]∗
31 e n t h a l p y D e l t a [ g I ∗ tempBins + f ]∗
32 g r a i n W a v e l e n g t h [ g I ∗ tempBins ∗ tempBins + f ∗ tempBins + i ]∗
33 g r a i n W a v e l e n g t h [ g I ∗ tempBins ∗ tempBins + f ∗ tempBins + i ]∗
34 g r a i n W a v e l e n g t h [ g I ∗ tempBins ∗ tempBins + f ∗ tempBins + i ]∗
35 i n v e r s e P l a n c k T i m e s S p e e d O f L i g h t S q u a r e d ;
Figure B.9: Unoptimized interpolation algorithm for the calculation of weightedRadiationField and the calculation of heating terms in transientMatrix.
1 f o r ( i n t j = 1 ; j < wlBins ; j ++) { / ∗ Now o u t e r loop t r a v e r s e s t h e r a d i a t i o n f i e l d w a v e l e n g t h g r i d ∗ /
2 / ∗ Two v a l u e s o f w e i g h t e d r a d i a t i o n f i e l d t o i n t e r p o l a t e between : ∗ /
3 cons t double l ower = r a d i a t i o n F i e l d [ j ]∗ a b s o r p t i o n C r o s s S e c t i o n [ g I ∗wlBins + j ] ;
4 cons t double upper = r a d i a t i o n F i e l d [ j −1]∗ a b s o r p t i o n C r o s s S e c t i o n [ g I ∗wlBins + j −1 ] ;
5
6 / ∗ How many g r a i n w a v e l e n g t h s f a l l w i t h i n t h e r a d i a t i o n f i e l d b i n j ( p recomputed ) ∗ /
7 cons t i n t qCount = i n t e r p o l a t i o n P a t t e r n C o u n t [ g I ∗wlBins + j ] ;
8
9 i f ( ( uppe r > 0) && ( lower > 0) ) { / ∗ Guard a g a i n s o u t o f r a n g e r e s u l t s ∗ /
10 cons t double dLogUpperLower = l o g ( uppe r / l ower ) ; / ∗ Same v a l u e f o r a l l b i n s ∗ /
11 f o r ( i n t c = 0 ; c < qCount ; c++) {
12 / ∗ Index of t h e g r a i n w a v e l e n g t h f ∗ tempBins+ i i s p recomputed d u r i n g i n i t i a l i z a t i o n ∗ /
13 cons t i n t i d x = i n t e r p o l a t i o n P a t t e r n I n d e x [ qCt r + c ] ;
14
15 / ∗ I n t e r p o l a t i o n wi th precomputed l o g o f f s e t ∗ /
16 w e i g h t e d R a d i a t i o n F i e l d [ i d x ] = l ower ∗ exp ( dLogUpperLower ∗ i n t e r p o l a t i o n L o g O f f s e t [ qC t r + c ] ) ;
17
18 / ∗ H e a t i n g t e r m s and u n i t c o n v e r s i o n ∗ /
19 t r a n s i e n t M a t r i x [ i d x ] = w e i g h t e d R a d i a t i o n F i e l d [ i d x ]∗ i n t e r p o l a t i o n M a t r i x F a c t o r [ qC t r + c ] ;
20 }
21 }
22 }
Figure B.10: Optimized interpolation algorithm with improved spatial data locality (packed data) and precomputed expensive operations. The calculation of
transientMatrix in the same loop as weightedRadiationField improves the data access locality in time.
14
  
j+1j
Compute slope and offset 
for interpolation in bin j, 
interpolate to grain wavelength
Wavelength                                                                   
Wavelength                                                                  
Write to
weightedRadiationField[f][i]
Iterate over 
grain transitions (f,i)  grainWavelength[gI][f][i]
R
ad
ia
ti o
n 
f ie
ld
 ⋅
 
cr
os
s s
e c
tio
n
Binary search
for wavelength bin index j
R
ad
ia
ti o
n 
f ie
ld
 ⋅
 
cr
os
s s
e c
tio
n
Iterate over 
wavelength index j  
j+1j Look up precomputed (f,i)
and slope & offset for j
 
Vector loop 
through all (f,i) 
for current j  
compute 
interpolated RF,
write to
weighted\
Radiation\
Field[f][i]
Unoptimized Interpolation of Weighted Radiation Field
with Binary Search for Wavelength Bins 
and Scalar Arithmetics
Optimized Interpolation Algorithm
with Grain Wavelength Bin and  Slope&Offset Precomputation
and Automatic Vectorization
Figure B.11: Schematic interpolation algorithm before and after optimization.
Appendix B.4. Scalar Optimizations: Precision Control, Pre-
computation
The preceding optimization steps (see Appendix B.1 through
Appendix B.3) eliminate algorithmically redundant operations,
ensure scalable thread parallelism, facilitate automatic vector-
ization, and improve the memory access pattern. The next step
is optimization on the level of scalar operations by reducing the
cost of arithmetics.
The unoptimized code uses mixed precision arithmetics. The
usage of mixed precision is detrimental to performance for two
reasons. First, double precision operations take more time
than their single precision counterparts. Second, in vector-
ized mixed precision operations, additional type conversion and
load/store operations are necessary, which may make mixed
precision even less efficient than pure double precision. In
the unoptimized code, matrices weightedRadiationField,
transientMatrix and bMatrix are single precision contain-
ers. However, some intermediate arithmetic operations are per-
formed in double precision in order to avoid floating-point un-
derflow and overflow.
In optimization step 4 (currently discussed), we convert all
floating-point variables and operations in the code to single pre-
cision. In order to avoid overflow, we use scaling factors that
decrease the magnitude of intermediate arithmetic results, so
that the overflow does not occur. These modifications are nec-
essary in two locations in the code:
1) The calculation of the temperature distribution according to
Equation (6) can have overflow if the heating is strong. In
single precision, overflow occurs when for some f > 1, the
value X f is a factor of FLT MAX ≈ 1038 greater than X0 = 1.
1 f l o a t sum = 0 . 0 f ;
2 f o r ( i n t i = 0 ; i <= f − 1 ; ++ i )
3 sum += bMat r ix [ f ∗ tempBins + i ]∗ x [ i ] ;
4
5 x [ f ] = sum∗ t r a n s i e n t M a t r i x [ ( f −1) ∗ tempBins + f ] ;
6
7 i f ( x [ f ] > 1 . 0 e30 ) {
8 / ∗ Thi s code a l l o w s t o use s i n g l e p r e c i s i o n f o r
t h e e n t i r e s o l v e r . I f t h e e l e m e n t s o f x [ ]
become t o o l a r g e , t h e v e c t o r i s s c a l e d down .
∗ /
9 cons t f l o a t df = 1 . 0 e −30;
10 f o r ( i n t i = 0 ; i <= f ; i ++)
11 x [ i ] ∗= df ;
12 }
Figure B.12: Optimized calculation of the temperature distribution with dy-
namic re-normalization avoids using double precision floating-point arith-
metics.
We eliminate that overflow by checking that every new com-
puted X f does not exceed a pre-determined threshold value
(we used 1030 as the threshold). If the value of X f exceeds
the threshold, the whole vector X is multiplied by a fac-
tor 10−30. This procedure, illustrated in Figure B.12, affects
the result only to machine precision, because eventually, the
temperature distribution is calculated by re-normalization
according to Equation (7).
2) In the calculation of stochastic emissivity from the com-
puted temperature distributions, out-of-range values occur
because some physical quantities expressed in the SI units
have extremely large or small absolute values. For exam-
ple, photon absorption cross sections expressed in m2 have
values of order 10−20. In the optimized code, we avoid us-
15
ing double precision by introducing the appropriate scaling
factors for intermediate calculations with these quantities.
In order to conform all quantities and functions in the code to
the pure single precision implementation, two additional modi-
fications are made:
1) All floating-point constants are explicitly marked as sin-
gle precision. By default, floating-point constants in C++
expressions, such as “0”, “1.0”, etc., are treated as dou-
ble precision numbers by the compiler. In order to explic-
itly declare them as single precision, the letter “f” should
be appended: “0.0f”, “1.0f”, etc. This avoids unneces-
sary type conversion in single precision expressions that use
these constants.
2) Mathematical functions from the Math library optimized for
single precision are used. That is, the double precision expo-
nential function exp() is replaced with its single precision
version expf(), and the logarithm log() with logf().
The exponential and logarithm transcendental functions can
be further accelerated with the Intel Math library by using base-
2 implementations exp2f() and log2f(). In the Intel Math
library, these implementations are more efficient than the natu-
ral base versions expf() and logf(). This is not true for the
Math library used by the GNU compiler version 4.4.7, which
we used. Therefore, we implemented compiler-dependent code
preprocessing in order to use the most efficient exponential and
logarithm functions available in the compiler used to build the
application, as shown in Figure B.13. The preprocessor macro
INTEL COMPILER is defined by the Intel C++ compiler; it is
not defined by GCC. This macro is also used to protect other
syntax and functionality specific to the Intel C++ compiler,
such as compiler vectorization hints (see below), and another
automatically defined macro INTEL OFFLOAD is used to pro-
tect the offload pragmas that perform calculations on Intel Xeon
Phi coprocessors.
Finally, in order to speed up some arithmetic operations, we
employ a procedure known as strength reduction. This pro-
cedure algebraically transforms arithmetic expressions, aiming
to replace long-latency arithmetic operations, such as division,
with faster operations, such as multiplication. The result may
not be bitwise-equivalent to the original expression, however,
this is acceptable in cases such as ours. Some strength reduc-
tion had already been performed in the interpolation algorithm
in optimization step 3 (see Appendix B.3). In the current op-
timization step, we replace some divisions with multiplications
by the reciprocal number, as illustrated in Figure B.14. The
code in this figure implements the re-normalization of the tem-
perature distribution expressed by Equation (7).
Appendix B.5. Memory Access Optimization: Redundant Stor-
age Elimination, Loop Interchange and Tiling
In optimization steps 2 and 3 (Appendix B.2 and Appendix
B.3) we already addressed the alleviation of some of the mem-
ory access issues by reducing the scratch storage size and im-
proving the memory access locality in space and time. In op-
timization step 5 (currently discussed), we continue memory
1 # i f d e f INTEL COMPILER
2 / ∗ In ICC
3 l o g 2 f i s f a s t e r t h a n l o g and f a s t e r t h a n l o g f ,
4 exp2f i s f a s t e r t h a n exp and f a s t e r t h a n exp f ∗ /
5 # de f i n e FASTLOG l o g 2 f
6 # de f i n e FASTEXP exp2f
7 # e l s e
8 / ∗ In GCC,
9 l o g f i s f a s t e r t h a n l o g and l o g 2 f ,
10 exp f i s f a s t e r t h a n exp and exp2f ∗ /
11 # de f i n e FASTLOG l o g f
12 # de f i n e FASTEXP expf
13 # end i f
14
15 / ∗ . . . L a t e r i n t h e i n t e r p o l a t i o n code : ∗ /
16 cons t f l o a t logUpperLower = FASTLOG( upper / l ower ) ;
17
18 / ∗ Pre−computed p a t t e r n o f a c c e s s ∗ /
19 f o r ( i n t c = 0 ; c < qCount ; c++) {
20 cons t i n t i d x = i n t e r p o l a t i o n P a t t e r n I n d e x [ qCt r+c ] ;
21 / ∗ I n t e r p o l a t i o n ∗ /
22 w e i g h t e d R a d i a t i o n F i e l d [ i d x ] = l ower ∗FASTEXP(
logUpperLower ∗ i n t e r p o l a t i o n L o g O f f s e t [ qC t r+c ] ) ;
23 / ∗ . . . ∗ /
24 }
Figure B.13: Compiler-dependent choice of the fastest implementations of the
logarithm and exponential functions for interpolation.
1 f l o a t pSum = 0 . 0 f ;
2 f o r ( i n t i = 0 ; i < tempBins ; i ++)
3 pSum += x [ i ] ;
4
5 cons t f l o a t invPSum = 1 . 0 f / pSum ;
6 f o r ( i n t f = 0 ; f <= fMax [ g I ] ; ++ f )
7 / ∗ d i s t r i b u t i o n [ g I ∗ tempBins + f ] = x [ f ] / pSum ; ∗ /
8 d i s t r i b u t i o n [ g I ∗ tempBins + f ] = x [ f ]∗ invPSum ;
Figure B.14: Strength reduction example: replacing division with multiplica-
tion by the reciprocal value.
16
1 f l o a t bMat r ix [ tempBins ∗ tempBins ] ;
2 f l o a t ∗ t r a n s i e n t M a t r i x = &bMat r ix [ 0 ] ;
3
4 / ∗ . . . L a t e r i n t h e i n t e r p o l a t i o n a l g o r i t h m : ∗ /
5 f o r ( i n t c = 0 ; c < qCount ; c++) {
6 cons t i n t i d x = i n t e r p o l a t i o n P a t t e r n I n d e x [ qCt r+c ] ;
7 / ∗ Array w e i g h t e d R a d i a t i o n F i e l d i s e l i m i n a t e d : ∗ /
8 cons t f l o a t w e i g h t e d R a d i a t i o n F i e l d = l ower ∗FASTEXP(
logUpperLower ∗ i n t e r p o l a t i o n L o g O f f s e t [ qC t r+c ] ) ;
9 t r a n s i e n t M a t r i x [ i d x ] = w e i g h t e d R a d i a t i o n F i e l d ∗
i n t e r p o l a t i o n M a t r i x F a c t o r [ qC t r + c ] ;
10 }
11
12 / ∗ . . . L a t e r i n t h e c a l c u l a t i o n o f bMat r ix : ∗ /
13 f o r ( i n t f = fMax [ g I ] ; f >= 1 ; −− f )
14 f o r ( i n t i = 0 ; i < f ; ++ i ) {
15 / ∗ bMat r ix and t r a n s i e n t M a t r i x s h a r e t h e same
memory s p a c e . P r e v i o u s l y used l i n e
16 rSum [ i ] += t r a n s i e n t M a t r i x [ f ∗ tempBins + i ] ;
17 was m o d i f i e d i n o r d e r t o p r e s e r v e e f f i c i e n t
v e c t o r i z a t i o n : ∗ /
18 rSum [ i ] += bMat r ix [ f ∗ tempBins + i ] ;
19 bMat r ix [ f ∗ tempBins + i ] = rSum [ i ] ;
20 }
Figure B.15: Improving memory traffic by sharing memory space be-
tween scratch arrays and loop fusion for weightedRadiationField and
transientMatrix calculation.
access optimization and perform three additional code modifi-
cations.
Appendix B.5.1. Sharing Memory Between Arrays
First, we recognize that in the pipeline that processes
the radiation field to compute the temperature distribu-
tion, three temporary arrays: weightedRadiationField,
transientMatrix and bMatrix are used sequentially. When
the next array is filled, the data in the previous array are no
longer needed. This situation lends us the opportunity to fur-
ther reduce the memory footprint and increase data locality by
using a shared memory location for all three arrays. We ac-
cordingly improve the code by allocating space for only one
array and declaring another array as a pointer referring to the
same memory location. The array weightedRadiationField
is eliminated, and only a temporary scalar variable is used in its
place. Some care must has to be taken in order to preserve auto-
matic vectorization, because the compiler is wary of vector de-
pendences in loops operating on arrays with overlapping mem-
ory locations. The vectorization success is controlled using the
compiler argument -vec-report3 of the Intel C++ compiler
in order to produce a detailed vectorization report. This opti-
mization and the construction of the modified loops is shown in
Figure B.15.
Appendix B.5.2. Loop Interchange
In the calculation of stochastic emissivity from the tempera-
ture distribution, expressed by Equations (11) and (13), nested
loops are used to traverse the emissivity wavelengths, grain
sizes and the temperature bins. In the unoptimized calculation,
the wavelengths are traversed first, followed by grain sizes, fol-
lowed by the temperature bins in the inner nested loop. We
have found that permuting the outer two loops (i.e., traversing
grain sizes first and wavelengths afterwards) improves the per-
formance. The unoptimized code (before permutation) and the
optimized code (after) are shown in Figure B.16.
The difference in memory access between unoptimized
and optimized codes is the pattern of read access to arrays
distribution and planckDistribution. In the unopti-
mized code, the latter array has better access locality than the
former. Indeed, for every iteration of the outer loop in vari-
able i, only elements of the j-th row of planckDistribution
are used, but the entirety of array distribution is read from
memory. In contrast, with permuted loops in the right-hand
side panel of Figure B.16, for every iteration of the outer loop
in variable j, only the i-th row of distribution is used,
whereas planckDistribution is read from front to back. The
permuted loop version works faster because when the calcula-
tion is distributed across multiple threads, each thread is oper-
ating on the same copy of planckDistribution, but the data
in distribution are private to each thread. In Intel Xeon pro-
cessors, hyper-threading places two threads per physical core,
and in Intel Xeon Phi coprocessors, four threads per physi-
cal core. Therefore, when the code operates on thread-private
distribution, two (on the CPU) or four (on coprocessor) in-
stances of this array must be fetched into the cache of each core.
In contrast, for shared planckDistribution, only one copy
per core must be fetched for all threads on that physical core to
share. This means that the application pays a greater penalty for
cache misses on distribution, which is why the permuted
loop version, with better locality of access to distribution,
works faster.
Appendix B.5.3. Loop Tiling
Further improvement of cache performance in the calculation
of stochastic emissivity may be achieved with the use of loop
tiling. Loop tiling transforms two nested loops in such a way
that
a) the outer loop variable is incremented with a stride greater
than one (the stride is called the tile size), and
b) for every value of the inner loop variable, a fixed small num-
ber of outer loop iterations are performed to traverse the tile.
This technique is often used in matrix operations with nested
loops in order to increase the re-use of data fetched into pro-
cessor registers or cache before these data are evicted. In the
optimized version in step 5, we employ double tiling, applying
the technique to both the i and j loops. Some empirical tuning
is required in order to find the optimal tile size. The resulting
code (incomplete) is shown in Figure B.17.
The optimizations of loop permutation and tiling in this func-
tion lead to very significant performance improvement on the
multi-core platform, and an even stronger improvement on the
MIC architecture. These methods are often required for mem-
ory traffic optimization in a large class of matrix calculations
with a low arithmetic intensity. Without loop tiling, such ap-
plications may be bottlenecked by memory access latency, nul-
lifying the arithmetic and bandwidth advantages of specialized
computing architectures such as Intel Xeon Phi coprocessors.
17
1 f o r ( i n t i = 0 ; i < wlBins ; ++ i ) { / ∗ i f i r s t ∗ /
2
3 f l o a t sum = 0 . 0 f ;
4 f o r ( i n t j = 0 ; j < gIMax ; ++ j ) {
5 cons t f l o a t gsw = g r a i n S i z e [ j ]∗
g r a i n S i z e D i s t r i b u t i o n [ j ] ;
6 cons t f l o a t c r o s s S e c t i o n = a b s o r p t i o n C r o s s S e c t i o n
[ j ∗wlBins + i ] ;
7 cons t f l o a t p r o d u c t = gsw∗ c r o s s S e c t i o n ∗
s c a l i n g F a c t o r ;
8
9 f l o a t r e s u l t = 0 . 0 f ;
10 / ∗ i − t h row of p l a n c k D i s t r i b u t i o n [ ] i s re −used
f o r a l l j − i t e r a t i o n s , whereas d i s t r i b u t i o n [ ]
i s r e a d f r o n t t o back ∗ /
11 f o r ( i n t k = 0 ; k < tempBins ; ++k )
12 r e s u l t += p l a n c k D i s t r i b u t i o n [ i ∗ tempBins + k ]∗
d i s t r i b u t i o n [ j ∗ tempBins + k ] ;
13 sum += r e s u l t ∗ p r o d u c t ;
14 }
15 t r a n s [ i ] = sum∗w a v e l e n g t h [ i ]∗ u n i t s C o n v e r s i o n ;
16
17 }
1 f o r ( i n t i = 0 ; i < wlBins ; ++ i )
2 t r a n s [ i ] = 0 . 0 f ;
3 f o r ( i n t j = 0 ; j < gIMax ; ++ j ) { / ∗ j f i r s t ∗ /
4 cons t f l o a t gsw = g r a i n S i z e [ j ]∗
g r a i n S i z e D i s t r i b u t i o n [ j ] ;
5 f o r ( i n t i = 0 ; i < wlBins ; ++ i ) {
6 cons t f l o a t c r o s s S e c t i o n = a b s o r p t i o n C r o s s S e c t i o n
[ j ∗wlBins + i ] ;
7 cons t f l o a t p r o d u c t = gsw∗ c r o s s S e c t i o n ∗
s c a l i n g F a c t o r ;
8
9 f l o a t r e s u l t = 0 . 0 f ;
10 / ∗ j − t h row of d i s t r i b u t i o n [ ] i s re −used f o r a l l
i − i t e r a t i o n s , whereas p l a n c k D i s t r i b u t i o n [ ] i s
r e a d f r o n t t o back ∗ /
11 f o r ( i n t k = 0 ; k < tempBins ; ++k )
12 r e s u l t += p l a n c k D i s t r i b u t i o n [ i ∗ tempBins + k ]∗
d i s t r i b u t i o n [ j ∗ tempBins + k ] ;
13 t r a n s [ i ] += r e s u l t ∗ p r o d u c t ;
14 }
15 }
16 f o r ( i n t i = 0 ; i < wlBins ; ++ i )
17 t r a n s [ i ] ∗= w a v e l e n g t h [ i ]∗ u n i t s C o n v e r s i o n ;
Figure B.16: Left: Unoptimized code for stochastic emissivity calculation from the temperature distribution. Right: same code optimized code by a permutation of
the outer loops.
We have contributed the code sample and optimization tech-
niques demonstrated in this section to [21], where this method
is discussed in greater detail.
Appendix B.6. Vectorization Refinements: Data Alignment,
Loop Bound Padding, Compiler Hints
This section describes code modifications that optimize the
automatic vectorization of calculations in HEATCODE.
Appendix B.6.1. Alignment
A memory address is said to be aligned on an B-byte bound-
ary when the numerical offet value of address is a multiple of B.
Most SIMD instruction sets require that when data are loaded
to or from vector registers, the virtual memory address of the
data must be aligned in a certain way. For example, the SSE2
(Streaming SIMD Extensions 2) vector instructions require data
alignment on a 32-byte boundary. The memory alignment re-
quirement is relaxed in the AVX instruction set supported by the
latest, as of this day, generation of general purpose processors.
However, the IMCI instruction set supported by Intel Xeon Phi
coprocessors requires 64-byte alignment.
A block of memory can be aligned on the heap starting at an
aligned address using the function mm malloc(), which is an
intrinsic of the Intel C++ compiler available in malloc.h. In
HEATCODE, for compatibility with other compilers, we imple-
ment a similar function that builds on the standard malloc()
operation to allocate 64-byte aligned arrays. For the alignment
of stack arrays, the Intel C++ compiler supports the corre-
sponding attribute specifier. In optimization step 6 (currently
discussed), we apply these methods to align the first element
of arrays participating in vectorized loops, as illustrated in Fig-
ure B.18.
1 # inc lude <ma l l oc . h>
2 # de f i n e ALIGN BYTES 64
3 # de f i n e FLOATS IN ALIGN BYTES 16
4
5 / ∗ 64− b y t e a l i g n m e n t o f an a r r a y on t h e heap ∗ /
6 p l a n c k D i s t r i b u t i o n F = ( f l o a t ∗ ) mm malloc ( s i z e o f (
f l o a t ) ∗wlBins ∗ tempBins , ALIGN BYTES ) ;
7 mm free ( p l a n c k D i s t r i b u t i o n F ) ;
8
9 / ∗ 64− b y t e a l i g n m e n t o f an a r r a y on t h e s t a c k ∗ /
10 f l o a t bMat r ix [ tempBins ∗ tempBins ]
11 a t t r i b u t e ( ( a l i g n e d ( ALIGN BYTES ) ) ;
12
13 / ∗ In o r d e r f o r each row of t h e above a r r a y s t o be
a l i g n e d , t h e f o l l o w i n g must be s a t i s f i e d : ∗ /
14 a s s e r t ( wlBins % FLOATS IN ALIGN BYTES == 0) ;
15 a s s e r t ( tempBins % FLOATS IN ALIGN BYTES == 0) ;
Figure B.18: Alignment of arrays on the heap and on the stack.
In the case of multi-dimensional arrays, the alignment of the
first element array on a 64-byte boundary may not be suffu-
cient. Indeed, if the length of the inner dimension of this array
is not a multiple of 64 bytes, then the first element in the second
row (and subsequent rows) is not aligned. Therefore, in order
to improve vectorization in nested loops operating on multi-
dimensional arrays, the inner dimension must be a multiple of
the alignment value. We verify that this condition is satisfied
using an assert() check; however, we could also pad the row
length to a multiple of 64 bytes for generality.
Appendix B.6.2. Padding of Loops
The Intel C++ compiler is able to vectorize loops even if the
iteration count in the loop is not a multiple of the SIMD vector
length. For instance, in Intel Xeon Phi coprocessors, a 512-bit
SIMD vector packs sixteen single precision floating-point num-
18
1 cons t i n t i T i l e = 4 ; / ∗ S i z e o f i − t i l e s ∗ /
2 cons t i n t j T i l e = 4 ; / ∗ S i z e o f j − t i l e s ∗ /
3 a s s e r t ( wlBins % i T i l e == 0) ;
4
5 f o r ( i n t j j = 0 ; j j < gIMax−( gIMax%j T i l e ) ; j j += j T i l e ) / ∗ S t r i d i n g t h r o u g h j − t i l e s ∗ /
6 f o r ( i n t i i = 0 ; i i < wlBins ; i i += i T i l e ) { / ∗ S t r i d i n g t h r o u g h i − t i l e s ∗ /
7 f l o a t r e s u l t [ i T i l e ∗ j T i l e ] ;
8 f o r ( i n t c = 0 ; c < i T i l e ∗ j T i l e ; c++)
9 r e s u l t [ c ] = 0 . 0 f ;
10
11 f o r ( i n t k = 0 ; k < tempBins ; ++k ) / ∗ C a l c u l a t i o n s a r e v e c t o r i z e d i n t h e k− l oop ∗ /
12 f o r ( i n t c = 0 ; c < i T i l e ; c++) { / ∗ I n n e r loop i n s i d e t h e i − t i l e ∗ /
13 / ∗ Loop i n s i d e t h e j − t i l e i s u n r o l l e d : ∗ /
14 r e s u l t [ ( 0 ) ∗ i T i l e +c ] += d i s t r i b u t i o n [ ( j j +0) ∗ tempBins+k ]∗ p l a n c k D i s t r i b u t i o n [ ( i i +c ) ∗ tempBins+k ] ;
15 r e s u l t [ ( 1 ) ∗ i T i l e +c ] += d i s t r i b u t i o n [ ( j j +1) ∗ tempBins+k ]∗ p l a n c k D i s t r i b u t i o n [ ( i i +c ) ∗ tempBins+k ] ;
16 r e s u l t [ ( 2 ) ∗ i T i l e +c ] += d i s t r i b u t i o n [ ( j j +2) ∗ tempBins+k ]∗ p l a n c k D i s t r i b u t i o n [ ( i i +c ) ∗ tempBins+k ] ;
17 r e s u l t [ ( 3 ) ∗ i T i l e +c ] += d i s t r i b u t i o n [ ( j j +3) ∗ tempBins+k ]∗ p l a n c k D i s t r i b u t i o n [ ( i i +c ) ∗ tempBins+k ] ;
18 }
19
20 f o r ( i n t b = 0 ; b < j T i l e ; b++) { / ∗ C o l l e c t i n g r e s u l t s f o r t h e t i l e ∗ /
21 cons t f l o a t gsw = g r a i n S i z e [ j j +b ]∗ g r a i n S i z e D i s t r i b u t i o n [ j j +b ] ;
22 cons t f l o a t commonFactor = gsw∗ s c a l i n g F a c t o r ;
23 f o r ( i n t c = 0 ; c < i T i l e ; c++)
24 sum [ i i +c ] += r e s u l t [ b∗ i T i l e + c ]∗ a b s o r p t i o n C r o s s S e c t i o n [ ( j j +b ) ∗wlBins + i i +c ]∗ commonFactor ;
25 }
26 }
27
28 f o r ( i n t j = gIMax−( gIMax%j T i l e ) ; j < gIMax ; ++ j )
29 / ∗ F i n i s h i n g t h e non− t i l a b l e p a r t o f t h e j − l oop ( code n o t shown ) ∗ /
Figure B.17: Loop tiling improves the locality of memory access in the calculation of stochastic emissivity spectrum.
bers. A loop in which the number iterations N is determined at
runtime is not guaranteed to have a multiple of sixteen itera-
tions. In these cases, the compiler produces executable code for
a variety of loop lengths N. Depending on the runtime value of
N and memory alignment situation, the compiler may perform
N modulo 16 iterations at the beginning or at the end of the
loop in a non-vector form, and complete the rest using SIMD
instructions.
Despite the compiler’s flexibility, it is often beneficial for
performance to ensure that the loop count is a multiple of 16.
We have found that the HEATCODE calculations that involve
the quasi-triangular matrix Bi j can be accelerated by changing
the inner loop bounds to a multiple of 16. Figure B.19 shows
examples of code modifications that demonstrated the best per-
formance. In optimization step 6, we increase the inner loop
bounds either to the nearest multiple of 16. This increases the
number of floating-point operations, but significantly decreases
the calculation time due to uninterrupted pattern of vectoriza-
tion. The change in the nested loop pattern expressed by the
first example in Figure B.19 is illustrated in Figure B.20.
In order to avoid overwriting the cooling terms B( f−1) f
in this process, we store these terms in a separate matrix
rTransientMatrixOverDiagonal instead of storing them at
their respective locations in bMatrix.
Additionally, we modify the calculation of the pre-
computed interpolation pattern (see Figure B.10) so that
interpolationCount[i] is a multiple of 16. This achieves
two results.
a) The number of iterations of the loop in variable c in Fig-
1 / ∗ In t h e c a l c u l a t i o n o f bMat r ix . . . ∗ /
2 f o r ( i n t f = fMax [ g I ] ; f >= 1 ; −− f ) {
3 / ∗ Computing iMax such t h a t ( iMax − 1)%0 == 0 , and
f −1 <= iMax <= tempBins −1 ∗ /
4 cons t i n t uB = ( f −1) +(16−( f −1) %16) −1;
5 cons t i n t iMax = ( uB<=tempBins −1?uB : tempBins −1) ;
6
7 / ∗ Unopt imized : ” f o r ( i n t i =0; i < f ; ++ i ) ” ∗ /
8 f o r ( i n t i = 0 ; i <= iMax ; ++ i ) {
9 rSum [ i ] += bMat r ix [ f ∗ tempBins + i ] ;
10 bMat r ix [ f ∗ tempBins + i ] = rSum [ i ] ;
11 }
12 }
13
14 / ∗ L a t e r , c a l c . o f t e m p e r a t u r e d i s t r i b u t i o n : ∗ /
15 x [ 0 ] = 1 . 0 f ;
16 f o r ( i n t f = 1 ; f <= fMax [ g I ] ; ++ f ) {
17 i f ( r T r a n s i e n t M a t r i x O v e r D i a g o n a l [ f ] > 0 . 0 f ) {
18 f l o a t sum = 0 . 0 f ;
19 / ∗ Unopt imized : ” f o r ( i n t i =0; i <=f −1; ++ i ) ” ∗ /
20 f o r ( i n t i = 0 ; i < tempBins ; ++ i )
21 sum += bMat r ix [ f ∗ tempBins + i ]∗ x [ i ] ;
22 x [ f ] = sum∗ r T r a n s i e n t M a t r i x O v e r D i a g o n a l [ f ] ;
23 / ∗ . . . ∗ /
24 }
25 }
Figure B.19: Padding loop count to a multiple of 16 to improve the efficiency
of automatically vectorized loops.
19
 0
 16
 32
 48
 0  16  32  48
Fi
na
l l
ev
el
 f
Initial level i
Unoptimized Loop Pattern
Heating terms
Calculation Pattern
Main Diagonal
 0
 16
 32
 48
 0  16  32  48
Fi
na
l l
ev
el
 f
Initial level i
Optimized Loop Pattern
Heating terms
Calculation Pattern
Main Diagonal
Figure B.20: Pattern of nested loops in f and i in the first example in Figure B.19 before and after optimization. The optimized loop pattern always has a multiple
of 16 iterations in the inner vectorized loop, which is beneficial for performance.
ure B.10 is a multiple of 16, and
b) if interpolationLogOffset[0] is aligned on a 64-byte
boundary, then in all instances of the c-loop, the ele-
ment interpolationLogOffset[qCtr+0] is also 64-byte
aligned.
Note that the GNU compiler was not able to vectorize this loop.
The padding of interpolationCount to a multiple of 16 de-
grades the performance with GCC, because it increases the
number of loop iterations. In order to avoid performance degra-
dation, we restrict the padding of this array to the Intel compiler
compilation by protecting the corresponding code with a check
for the preprocessor macro INTEL COMPILER.
Appendix B.6.3. Compiler Hints
Even if the programmer ensures that every instance of a vec-
tor loop operates on aligned data, the compiler still implements
runtime checks for alignment in order to maintain error-free ex-
ecution. However, it is possible to instruct the compiler to drop
these checks and assume the data aligned, which accelerates the
program execution. This is done by including the hint #pragma
vector aligned before the vectorized loop. If the data are
not aligned at runtime, the use of this pragma results in a seg-
mentation fault.
If the runtime length of some loops is known to the program-
mer, but cannot be determined by the compiler, #pragma loop
count may accelerate execution. Using the loop count sug-
gested by the developer, the compiler more efficiently chooses
the vectorization strategy.
Another useful compiler hint is #pragma simd, which
makes it mandatory for the compiler to vectorize the loop fol-
lowing the pragma. This is particularly helpful in combination
with loop tiling (see Appendix B.5), where the vectorized loop
is not the inner loop.
The hint #pragma vector nontemporal can be used in
order to perform vector write operations in a loop directly
into the main memory, bypassing the cache. This is use-
ful when the output data of a vector loop are not used until
much later in the code, and there is no reason to store them
in a cache. Even though nontemporal write instructions incur
eviction stalls immediately, the overall application performance
may be improved by using #pragma vector nontemporal,
because caches are not contaminated with temporary data.
In optimization step 6, we supplement the code, where ap-
propriate, with the compiler hints described above. Figure B.21
demonstrates some of the loops tuned with the vectorization
pragmas. These pragmas are specific to the Intel C++ com-
piler, and GCC does not recognize them. We protect the
vectorization hints with a check for the preprocessor macro
INTEL COMPILER in order to avoid warnings with GCC.
Appendix B.7. Offload Traffic Reduction: Data Persistence
and Memory Retention on the Coprocessor
In optimization steps 1 through 6 (Appendix B.1 through
Appendix B.6) we focused on optimizing the calculations,
whether they are running on the host system or on the copro-
cessors. As a result, the execution time of the optimized code
has decreased enough that the set-up time of the calculations
has become significant. In this optimization step, we reduce the
set-up time for calculations offloaded a coprocessor.
HEATCODE uses Intel Xeon Phi coprocessors in the explicit
offload mode. In this mode, before a function is executed on
20
1 / ∗ C a l c u l a t i o n o f bMat r ix wi th loop bound padd ing ∗ /
2 cons t i n t uB = ( f − 1) + (16 −( f −1) %16) − 1 ;
3 cons t i n t iMax = ( uB<=tempBins −1?uB : tempBins −1) ;
4
5 # i f d e f INTEL COMPILER
6 #pragma v e c t o r a l i g n e d / ∗ No a l i g n m e n t check ∗ /
7 #pragma l oop c o u n t min ( 1 6 ) / ∗ Expec ted loop c o u n t ∗ /
8 # end i f
9 f o r ( i n t i = 0 ; i <= iMax ; ++ i ) {
10 rSum [ i ] += bMat r ix [ f ∗ tempBins + i ] ;
11 bMat r ix [ f ∗ tempBins + i ] = rSum [ i ] ;
12 }
13
14 / ∗ L a t e r : c a l c u l a t i o n o f e m i s s i v i t y , l oop t i l i n g ∗ /
15 # i f d e f INTEL COMPILER
16 #pragma v e c t o r a l i g n e d / ∗ No a l i g n m e n t check ∗ /
17 #pragma simd / ∗ V e c t o r i z e o u t e r l oop ∗ /
18 # end i f
19 f o r ( i n t k = 0 ; k < tempBins ; ++k )
20 f o r ( i n t c = 0 ; c < i T i l e ; c++) {
21 r e s u l t [ ( 0 ) ∗ i T i l e +c ]+= d i s t r i b u t i o n [ ( j j +0) ∗
tempBins+k ]∗ p l a n c k D i s t r i b u t i o n [ ( i i +c ) ∗
tempBins+k ] ;
22 r e s u l t [ ( 1 ) ∗ i T i l e +c ]+= d i s t r i b u t i o n [ ( j j +1) ∗
tempBins+k ]∗ p l a n c k D i s t r i b u t i o n [ ( i i +c ) ∗
tempBins+k ] ;
23 / ∗ . . . ∗ /
24 }
Figure B.21: Pragma hints indicate to the compiler how to best choose the
automatic vectorization strategy for a loop.
the coprocessor, all the data that the function accesses must be
transferred across the PCIe bus from the host to the coproces-
sor. Which data structures are sent, and in what direction, is
controlled by the clauses of #pragma offload, which initi-
ates the offload. By default, for each array transferred to the
coprocessor, memory is allocated on the coprocessor, the data
are copied across the PCIe bus, and then the calculation is run.
At the end of the offload region, the output data are sent back to
the host, and memory on the coprocessor is deallocated. How-
ever, this behavior may be modified to retain the data and/or
memory on the coprocessor between offloads. This may result
in a significant decrease in the set-up time for two reasons.
a) The rate of data transfer is limited by the PCIe bandwidth,
which, for large arrays, reaches 6 to 7 GB s−1.
b) However, an even more severe limitation is the rate of mem-
ory allocation on the coprocessor. Dynamic memory allo-
cation is an inherently sequential operation, and the low-
frequency cores of the coprocessor have an effective alloca-
tion rate of only around 0.5 GB s−1 (see [21]).
Because all calls to the stochastic heating and emissivity cal-
culation use the same common data, such as grain absorption
cross sections, precomputed interpolation pattern, grids, and
other, it makes sense to retain these data between offloads in
order to reduce the set-up time. In order to retain the memory
allocated for an array, the clause free if(0) is used, and in
order to avoid duplicate memory allocation, alloc if(0) is
placed in the list of clauses of #pragma offload. The copy-
ing of data can be avoided by specifying a zero length of the
offloaded array, i.e., length(0) if the data had already been
transferred to the coprocessor.
The situation is somewhat complicated by the fact that the
encompassing ISRF calculation may use multiple grain mod-
els. However, the offload runtime library is able to deal with
multiple instances of arrays with the same name persistently
stored on the coprocessor. It manages persistent data by storing
the mapping between host pointers to all offloaded arrays and
the respective persistent arrays on the coprocessor.
Another offload-related optimization is the retention of mem-
ory for the array of incident radiation fields and the com-
puted stochastic emissivities. These data are not the same in
all calls to CalculateTransientEmissivityXeonPhi, be-
cause the user or the encompassing ISRF calculation is ex-
pected to call the stochastic heating and emissivity calcula-
tion for different sets of radiation fields. However, as men-
tioned earlier, the effective rate of memory allocation on co-
processors is an order of magnitude lower than the data trans-
fer bandwidth. Therefore, it is still beneficial to retain the
memory allocated for the incident radiation fields and emissiv-
ities, even if the data have to be transferred and copied into
this memory in every offload instance. In order to allow arbi-
trary number of radiation fields to be passed to the function to
CalculateTransientEmissivityXeonPhi (only limited by
the memory capacity of the coprocessor), we implement mem-
ory management for the radiation field container array with dy-
namic expansion of the containers when necessary.
Appendix B.8. Heterogeneous Computing: Concurrent Usage
of Multiple Compute Devices
We have optimized all aspects of HEATCODE to achieve
good performance on the host multi-core system and on a single
coprocessor. The performance on a single coprocessor (see Sec-
tion 3.2) is 1.9x greater than the performance on two host pro-
cessors. This number, given approximately equal power con-
sumption in the host and in the coprocessor, reflects the perfor-
mance per watt improvement. However, for practical applica-
tions, we are interested in completing a calculation in the short-
est possible time, which requires using all system resources in
tandem. That said, the application must be able to utilize the
host simultaneously with the coprocessor or multiple copro-
cessors if they are present. This functionality is extremely im-
portant because hardware system configurations available today
can host up to eight Intel Xeon Phi coprocessors in a single host
server. We describe the implementation of heterogeneous and
multi-coprocessor HEATCODE calculations in this section.
The utilization of multiple compute devices is essentially
an additional level of parallelism. By compute device in this
discussion we mean either an Intel Xeon Phi coprocessor, or
the host processor. As with any task-parallel architecture, the
scheduling of work across compute devices becomes important.
Scheduling in HEATCODE must be dynamic, i.e., work must
be distributed at runtime. This is motivated by two factors:
1) The computing system is heterogeneous if the host and the
coprocessor concurrently participate in calculations. In a
21
  
Chunk 0 Chunk 1 Chunk 2 Chunk 3
Incident Radiation Fields
Thre
ad 0:
 com
pute 
on ho
st
Thread 1: 
offload to 1-st 
coprocessor
Thread 2: offload to 2-nd coprocessor
#pragma omp parallel for schedule(dynamic,1) n_threads(nComputeDevices)
for (int m = 0; m < nChunks; ++m) {
  iDevice = omp_get_thread_num();
  /* … */
omp_set_nested(1);
  #pragma offload target(mic:iDevice) if (iDevice>0) \
  /* … offload specifiers for chunk m … */
/* Same code for the host and for the coprocessor */
omp_set_num_threads(threads[iDevice]); /* 32 threads for host, 236 threads for coprocessor */
#pragma omp parallel for schedule(dynamic)
for (i=chunkStart; i<chunkEnd; i++) /* Inner parallel loop over the spectra in chunk m */
   /* … calculation for chunk m */
Figure B.22: Using nested parallelism and loop scheduling in OpenMP in order to dynamically schedule chunks of work on all available compute devices, including
the host system and the coprocessors.
heterogeneous system, work must be distributed proportion-
ally to the relative performance of compute devices. Other-
wise some devices have to idle, awaiting the completion of
work by other devices.
2) Different incident radiation spectra may take different
amounts of time to process into stochastic emissivities, be-
cause the code uses pruning to exclude heating terms or
whole spectra not interacting with grains in the stochastic
heating mode (see Appendix B.1).
In order to utilize the additional level of parallelism (multi-
ple compute devices), we use the blocking offload syntax and
perform multiple concurrent offloads from the corresponding
number of parallel threads on host. The loop scheduling func-
tionality in OpenMP library takes care of scheduling. When
multiple compute devices are used to calculate the stochastic
grain emissivity for n incident radiation fields, HEATCODE
splits these n tasks into multiple contiguous chunks (i.e., sets of
radiation fields). Then the application spawns a thread-parallel
region on the host with as many threads as there are compute de-
vices. Each thread in the parallel region is associated with one
of the compute devices. In computing systems enabled with
Intel Xeon Phi coprocessors, the host system typically has no
less than 8 logical cores (usually up to 32), but no more than
8 coprocessors (usually 1 or 2). Therefore, we expect that all
threads in this outer parallel region start simultaneously. The
OpenMP threads corresponding to compute devices are then
used to run a parallel for-loop, which iterates over all chunks.
The loop scheduling mode is set to dynamic scheduling with a
grain size of 1. In this mode, initially, one chunk from the pool
is assigned to each thread, which then launches a blocking cal-
culation for this chunk on the respective compute device. When
one of the threads (equivalently, one of the compute devices)
completes the work and unblocks, another chunk from the pool
is assigned to this thread (equivalently, to the corresponding
compute device) by the OpenMP scheduler. This process goes
on until all chunks are processed.
The offloaded calculation spawns its own parallel loop,
which distributes the calculations of all radiation fields in the
current chunk across the cores of the device. This parallel loop
uses as many threads as there are available logical cores on the
respective compute device.
The host CPU can be used as a compute device, too, by map-
ping one of the threads to the host CPU. In order to use the
OpenMP loop that schedules across devices and, at the same
time, spawn OpenMP threads on the host processor for calcula-
tions, nested parallelism in OpenMP must be enabled.
The work scheduling technique described above is illustrated
in Figure B.22.
Appendix B.9. Runtime Environment
In order to achieve maximum performance, some settings of
the OpenMP and offload runtime library have to be set to non-
default values.
Compute-bound workloads, such as the stochastic heat-
ing and emissivity calculation discussed here, usually ben-
efit from binding OpenMP threads to physical or logical
cores of the processor to prevent thread migration across
cores. For HEATCODE, we recommend setting such bind-
ing using the OpenMP thread affinity interface. On the
host, prior to starting the calculation, the Linux envi-
ronment variable KMP AFFINITY must be set by the user
22
to the value “compact,granularity=fine” (hereafter, the
quotes must not be included in the value of the envi-
ronment variables). This sets the OpenMP thread affin-
ity on the host that binds OpenMP threads to the respec-
tive logical cores, and places threads with consecutive num-
bers as close to each other as possible. Additionally, the
environment variable MIC ENV PREFIX may be set to the
value “MIC”, the variable MIC KMP AFFINITY to the value
“compact,granularity=fine”, and MIC PLACE THREADS to
“59C,4t” (for a 60-core coprocessor). These settings effect the
same type of OpenMP affinity on all Intel Xeon Phi coproces-
sors in the system, ensuring that one core dedicated to OS-level
offload tasks does not get loaded with computational work.
Another runtime optimization that improves the coprocessor
performance is using large buffers for offloading data to copro-
cessors. In order to use this feature with HEATCODE, the en-
vironment variable MIC USE 2MB BUFFERS on the host must be
set to the value “64K”, which automatically uses large buffers
for all offloaded arrays exceeding 64 kilobytes in size.
References
[1] J. L. Puget, A. Leger, F. Boulanger, A&A 142 (1985) L19–L22.
[2] P. Guhathakurta, B. T. Draine, ApJ 345 (1989) 230–244.
[3] R. Siebenmorgen, E. Kruegel, J. S. Mathis, A&A 266 (1992) 501–512.
[4] B. T. Draine, A. Li, ApJ 551 (2001) 807–824.
[5] K. A. Misselt, K. D. Gordon, G. C. Clayton, M. J. Wolff, ApJ 551 (2001)
277–293.
[6] P. Jonsson, MNRAS 372 (2006) 2–20.
[7] P. Jonsson, Sunrise: Radiation transfer through interstellar dust, 2013.
Astrophysics Source Code Library.
[8] T. P. Robitaille, E. Churchwell, R. A. Benjamin, B. A. Whitney, K. Wood,
B. L. Babler, M. R. Meade, A&A 545 (2012) A39.
[9] F. Heymann, R. Siebenmorgen, ApJ 751 (2012) 27.
[10] Nvidia Corporation, CUDA Zone, http://developer.nvidia.com/cuda (last
accessed May 13, 2013), 2013.
[11] Intel Corporation, Intel Developer Zone: Intel Xeon Phi Coproces-
sor, http://software.intel.com/mic-developer (last accessed May 13, 2013,
2013.
[12] B. T. Draine, Physics of the Interstellar and Intergalactic Medium, 2011.
[13] J. S. Mathis, W. Rumpl, K. H. Nordsieck, ApJ 217 (1977) 425–433.
[14] J. C. Weingartner, B. T. Draine, ApJ 548 (2001) 296–309.
[15] V. Zubko, E. Dwek, R. G. Arendt, ApJS 152 (2004) 211–249.
[16] A. Li, B. T. Draine, ApJ 554 (2001) 778–802.
[17] J. S. Mathis, P. G. Mezger, N. Panagia, A&A 128 (1983) 212–229.
[18] Intel Corporation, Intel Xeon Phi Coprocessor System Software De-
velopers Guide, http://software.intel.com/en-us/articles/intel-xeon-phi-
coprocessor-system-software-developers-guide (Last accessed July 23,
2013), 2013.
[19] Intel Corporation, Intel Xeon Phi Prod-
uct Family Performance, Rev. 1.1, 2/15/13,
http://www.intel.com/content/dam/www/public/us/en/documents/performance-
briefs/xeon-phi-product-family-performance-brief.pdf (last accessed Apr
16, 2013), 2013.
[20] J. Reinders, An Overview of Programming for In-
tel Xeon Processors and Intel Xeon Phi Coprocessors,
http://software.intel.com/sites/default/files/article/330164/an-overview-
of-programming-for-intel-xeon-processors-and-intel-xeon-phi-
coprocessors.pdf (last accessed Apr 16, 2013), 2012.
[21] Colfax International, Parallel Programming and Optimization with Intel
Xeon Phi Coprocessors, ISBN: 978-0-9885234-1-8, 2013.
23
