Cosmic dust particles effectively attenuate starlight. Their absorption of starlight produces emission spectra from the near-to far-infrared, which depends on the sizes and properties of the dust grains, and spectrum of the heating radiation field. The nearto mid-infrared is dominated by the emissions by very small grains. Modeling the absorption of starlight by these particles is, however, computationally expensive and a significant bottleneck for self-consistent radiation transport codes treating the heating of dust by stars. In this paper, we summarize the formalism for computing the stochastic emissivity of cosmic dust, which was developed in earlier works, and present a new library HEATCODE implementing this formalism for the calculation for arbitrary grain properties and heating radiation fields. Our library is highly optimized for general-purpose processors with multiple cores and vector instructions, with hierarchical memory cache structure. The HEATCODE library also efficiently runs on co-processor cards implementing the Intel Many Integrated Core (Intel MIC) architecture. We discuss in detail the optimization steps that we took in order to optimize for the Intel MIC architecture, which also significantly benefited the performance of the code on general-purpose processors, and provide code samples and performance benchmarks for each step. The HEATCODE library performance on a single Intel Xeon Phi coprocessor (Intel MIC architecture) is ∼ 2 times a general-purpose two-socket multicore processor system with approximately the same nominal power consumption. The library supports heterogeneous calculations employing host processors simultaneously with multiple coprocessors, and can be easily incorporated into existing radiation transport codes.
Introduction
Astrophysical objects can be difficult to study in the ultraviolet (UV) and optical wavebands because of the effective attenuation of the visible light by cosmic dust particles. To understand the broadband electromagnetic emissions from them therefore requires radiation transfer (RT) calculations that selfconsistently account for the dust absorption and scattering of the UV/optical light, and subsequent re-emission of the absorbed radiation at near infrared (NIR) to far infrared (FIR) wavelengths. These calculations can be time consuming, particularly for determining the dust re-emission spectrum because it depends on the sizes of the emitting grains and their associated properties.
Dust grains with sizes 1 micron typically attain a thermal equilibrium with the heating radiation field because the timeaveraged vibrational energy of a grain is greater than the energy of the heating photons. The emissivity for such dust particles can be straightforwardly obtained by balancing the absorption and re-emission rates. However, for smaller grains attaining thermal equilibrium is often not the case because the absorption of a photon results in stochastic heating temperature spikes.
Email addresses: tporter@stanford.edu (Troy A. Porter 1 ), avladim@galprop.stanford.edu (Andrey E. Vladimirov 1, 2 ) Most of the energy of the grain is re-emitted in these events and no thermal equlibrium is attained. Consequently, stochastic grain heating cannot be modeled using methods that assume thermal equilibrium.
Methods for treating the stochastic heating of very small dust grains have been developed [e.g., 1, 2, 3, 4, 5] . However, in RT simulations of systems with a fine spatial voxelization, the calculation of the stochastic grain emissivity for each cell can be prohibitively computationally expensive. A typical approximation used to speed up these calculations involves pre-calculating a table of emission spectra corresponding to different spectra of a parameterized synthetic heating radiation field [e.g., 6, 7, 8] , and interpolating within the table on the basis of the incident total energy density of the heating radiation field in each simulation cell to obtain the stochastic heating emissivity. While this approximation has in some cases been sufficient, it does not allow a fully self-consistent treatment for the small grain heating.
The use of graphics cards to provide improvements in speed has been explored for Monte Carlo RT simulations by different authors [e.g., 6, 9] . For the former, this usage was restricted to increasing the speed of the calculations for grains in thermal equilbrium with the heating radiation field using Nvidia's CUDA architecture [10] . For the latter, the calculations for the stochastic and thermal equilibrium heating of grains were made employing CUDA. However, a significant effort is often required to recode algorithms in the C-like CUDA language to exploit the hardware. An attractive alternative is the recently released Intel Many Integrated Core (MIC) architecture [11] . A significant advantage of the MIC architecture allows the same programming frameworks and optimization methods as those employed for multi-core general purpose processors 1 . Therefore, the same code in a high-level programming language (C, C++ or Fortran) can be used to perform calculations on common general-purpose processors and coprocessor cards that implement the MIC architecture.
In this paper we describe the implementation and optimization of a common method [4] for treating the stochastic heating of small dust grains that is based on a matrix formalism. We provide details for a general-purpose processor target as well as the MIC architecture. Our goal was to make the stochastic grain heating emissivity calculation for an arbitrary radiation field efficient enough to make it feasible for fully self-consistent calculations in simulations with high spatial voxelizations, while using essentially the same code for the general-purpose processors and the additional co-processor accelerators. We show that significant effort must be invested into the optimization of the calculation to utilize the available compute power of the MIC architecture. However, we find that most optimizations for the MIC architecture also improve the performance of the algorithm on the host system using general-purpose processors. In contrast, with GPGPU 2 -like architectures the accelerator code is generally distinct from the processor code, and the architecture-level optimization of the application must be done independently for the CPU code and for the GPGPU code.
The result of the work presented in this paper is a new astrophysical tool, the HEATCODE 3 library, which efficiently computes the stochastic grain emissivity for arbitrary heating radiation fields both for general-purpose CPUs and co-processor cards implementing the Intel MIC architecture. In Section 2 we summarize the theoretical model for stochastic emissivity calculation. Section 3 summarizes the performance improvement measures that were taken in order to optimize HEATCODE for the multi-core and many-core architectures and demonstrates the performance on a benchmark server equipped with coprocessors featuring the Intel MIC architecture. We summarize our findings in Section 4. Appendix A describes the user interface of the HEATCODE library and outlines the structure of the code. HEATCODE can be obtained freely from the CPC Program Library. In Appendix B the optimization steps are described in detail. 1 The term "general purpose processor" is used throughout this paper as a synonym for the terms CPU (Central Processing Unit), "host", and simply "processor". The term "coprocessor" is used interchangeably with "Intel Xeon Phi coprocessor" to denote an accelerator card implementing the Intel MIC architecture. 2 General-Purpose Graphics Processing Units 3 HEATCODE is an acronym for HEterogeneous Architecture library for sTochastic COsmic Dust Emissions
Stochastic Heating of Very Small Dust Grains
Cosmic 'dust' is a mixture of particles of different sizes and compositions; see the book by Draine [12] for an overview. For a given grain species, e.g., silicates, graphitic grains, there is a distribution of sizes for the particles with the optical properties depending on the grain species. The size distribution is, in general, unknown but parameterizations for various grain mixtures have been developed to explain features in the extinction curves for the Milky Way and other galaxies, e.g., [13, 14, 15] . HEATCODE can take the size distribution and optical properties as user-supplied inputs. However, for convenience we have included the size distributions from the authors referenced above, along with the grain absorption cross sections for silicate, graphite, and polycyclic aromatic hydrocarbons (PAHs) from B. Draine's website 4 ; our calculations below use the PAHs/graphite/silicate mixture and optical properties from Li and Draine [16] and the size distributions for Milky Way dust from Weingartner and Draine [14] .
For a given dust species, the heating radiation field is an input parameter supplied to the HEATCODE library, and the spectrum of the re-emitted photons is the output of the calculation. This is obtained by first calculating the absorption of stellar light and the spectrum of re-emitted light produced when transient excited states of the grains relax for each grain size. Determining the re-emitted spectrum for each grain size involves two steps. The temperature (or vibrational energy E) distribution of grains in response to the incident radiation spectrum is first computed. We assume static equilibrium for this calculation, i.e., the stochastic grain heating is assumed to be exactly balanced by the grain relaxation with re-emission of photons. This assumption is valid as long as the time scale of incident light variability is long compared to the relaxation times of the transient excited grain states. Then, given the temperature distribution, the spectrum of re-emitted light for each grain size is calculated. The output stochastic emission spectrum for the ensemble of grains is obtained by convolution of the individual grain size emission spectra with the grain size distribution.
Calculation of the Temperature Distribution
Guhathakurta and Draine [2] proposed a matrix formalism for the treatment of the interaction between dust grains and photons that was further elaborated by Draine and Li [4] . In their methodology, the temperature distribution of radiatively heated dust grains is described by P, the "state vector". The elements of this vector P i are the probabilities for the grains of the given size to be in the energy bin i. Energy bins span the range [E min , E max ], and we assume M + 1 narrow energy bins (∆E E), so i = 0, 1, . . . M. The temperature distribution P is obtained by solving the system of linear algebraic equations that expresses balance between the population and depopulation rates of each energy level,
with the normalization condition
Here the matrix elements T i j are the probabilities per unit time for a grain in energy bin j to transition to one of the states in bin i. Draine and Li [4] describe the exact statistical treatment for the cooling transitions (i < j) as well as less computationally demanding thermal approximations for the corresponding cooling terms. The thermal approach introduces the notion of grain temperature to approximate calculation of the downward transition rates. The "thermal discrete" approximation allows discrete downward transitions to the same energy levels as the exact statistical treatment. The "thermal continuous" approximation assumes that the only allowed spontaneous transition from an excited vibrational level i is to the adjacent level i − 1:
This approximation has been shown by Draine and Li [4] to be accurate for grains comprised of as few as 30 atoms and allows for a fast solution method compared to inverting the whole system of equations for the other methods. The HEATCODE library uses the "thermal continuous" approximation because it allows the computational cost of solving for the temperature distribution P to be considerably reduced while achieving reasonably accurate results. The solution of the system of equations (1), (2) may seem straightforward. However, the diagonal elements of the matrix corresponding to the system contain a difference of positive numbers with nearly equal values. The solution can be unstable because of finite numerical precision. Guhathakurta and Draine [2] suggested that to minimize this effect on the solution, the system of equations (1) can be transformed by adding to every equation all preceding equations. This opreation gives the following system:
where
Now the temperature distribution P can be determined by introducing a temporary vector X ≡ P/P 0 so that X 0 = 1. Other values of X can be computed recursively:
Then the values of P are computed by re-normalization in order to satisfy Equation (2):
Heating and Cooling Rates
Heating transition rates are proportional to the incident light energy density W(λ) at the wavelength corresponding to the transition energy ∆E ul , and to the photon absorption cross section at this wavelength σ(λ), which is a quantum mechanical property of the grains. The heating rate is calculated as
The numerical wavelength grid on which W(λ) and σ(λ) are defined can map to an energy grid different from the grid used to represent the grain excitation levels. Therefore, in the numerical solution, the quantity W(λ)σ(λ) ≡ Ω(λ) in Equation (8) must be computed using interpolation:
Here λ j−1 and λ j are, respectively, the lower and upper boundaries of the wavelength bins at which the energy density W and cross section σ are defined. The rate of spontaneous relaxation to the adjacent level with photon emission is calculated as
This rate does not depend on the incident radiation field, and therefore the convolution of the absorption cross section with the energy distribution function in Equation (10) can be precomputed.
Calculation of the Grain Emission Spectrum
The emission spectrum for grains of size a at wavelength ν = c/λ can be calculated using the following expression based on Equation (56) in [4] :
The contribution of stimulated emission is neglected in this expression. The combined emissivity from grains of all sizes is obtained by integration over the grain size distribution Q(a):
where a min and a max constrain the range of grain sizes considered in the model. The maximum size a max is chosen so that for a > a max , the radiative grain heating process is no longer stochastic, and can be treated in the thermal equilibrium approximation. 
Example
An example of the output of HEATCODE is given in Figure 1 , where the stochastic grain emissivity is computed for three heating radiation fields. The heating radiation fields are plotted with solid red lines in each panel of the figure, and correspond to: (top) a blackbody spectrum at 4,000 K, (middle) a blackbody spectrum at 10,000 K, and (bottom) the interstellar radiation field (ISRF) from Mathis et al. [17] . The two blackbodies are normalized to the same energy density as the spectrum from Mathis et al. [17] , 0.46 eV cm −3 . (For comparison with the heating radiation field energy densities, the Cosmic Microwave Background 2.735 K blackbody spectrum is shown in the bottom panel.) The calculated emissivity is converted to an equivalent energy density using the full solid angle and a gas column density N H = 10 22 cm −2 . This conversion allows us to depict the incident radiation field and the dust-emitted light on the same plot for illustrative purposes. Multiplication by the above mentioned gas column density is an optically thin approximation of the radiative transport throughout the Galaxy, assuming that a) the local radiation field is characteristic of the radiation field across the Galaxy, and b) infrared light is not attenuated by absorption.
Dashed and dash-dotted curves correspond to the emissions from the three grain models included in the library: graphitic, polycyclic aromatic hydrocarbons (PAH), and silicate grains.
Performance Results
The HEATCODE library can be used in self-consistent simulations that compute the stochastic heating and emissivity of small dust grains in every simulation cell using the in-cell heating radiation field. Compared to an implementation of the algorithm relying solely on standard compiler optimizations and parallelization (e.g., the GNU C++ compiler using OpenMPsee Section 3.2) we obtain almost a factor of 100 times speedup. Achieving this performance is made possible by a) Parallelization of calculations by distributing a set of incident radiation spectra across multiple processor cores; b) Tuning various aspects of code performance for the generalpurpose CPU architecture and for the Intel MIC architecture; c) Relying on the Intel C++ compiler to perform automatic vectorization and other architecture-specific optimizations; d) Performing the offload of the whole calculation or a part of it to Intel Xeon Phi coprocessors, if the computing system is enabled with them. Offload leads to additional speedup if the set of incident radiation fields is large enough.
This section summarizes the optimizations applied to produce HEATCODE and reports the benchmarks of performance. For more detail on the optimization of stochastic heating and emissivity calculation, refer to Appendix B.
Performance Tuning
HEATCODE is optimized for general-purpose processors, and is able to run on Intel Xeon Phi coprocessors if they are available. The performance tuning methods used in HEATCODE apply to CPUs and Intel Xeon Phi coprocessors alike. But, in this section we emphasize the optimization procedure for the coprocessors because of the novel nature of the architecture.
Intel Xeon Phi coprocessors that we used for benchmarks in the present paper contain 60 independent cores (technical specifications stated hereafter are obtained from [18] ). These cores have a low clock frequency (∼1 GHz) and four-way hyperthreading. Each core relies on a dedicated vector arithmetics unit to maintain high performance with low power consumption. The vector units support 512-bit vector registers for SIMD (Single Instruction Multiple Data) operations 5 . The SIMD instruction set supported by Intel Xeon Phi coprocessors is called IMCI (Initial Many-Core Instructions). IMCI includes floatingpoint multiplication, addition, fused multiply-add and division, some transcendental instructions, such as exponentials, logarithms and trigonometric functions, and a range of bitwise logical operators, shuffle, gather/scatter memory accesses and bitmasked versions for most operations.
Onboard memory in Intel Xeon Phi coprocessors used for benchmarks in the present work comes in the form of 8 GB of 5 For comparison, the SSE registers supported by most general-purpose CPUs prior to the year 2011 are 128-bit long, and the AVX (Advanced Vector Extensions) registers supported by the Intel Sandy Bridge and AMD Bulldozer architectures are 256-bit long. GDDR5 (Graphics Double Data Rate) with bandwidth capabilities typically exceeding that of the current generation CPUbased hosts. The onboard memory is cached. The Level-2 caches of each core are linked with a ring interconnect. This effectively merges local Level-2 caches of the cores into an aggregate coherent Level-2 cache of the coprocessor with nonuniform access and a typical access time of 11 cycles. In addition, each core has a Level-1 cache with a typical access time of 1 cycle. Coherency is maintained in all caches at the hardware level. This means that after one core modifies data in a cache line, all other cores are guaranteed to see this modification when they access this cache line at a later time.
There are five major optimization methods used in HEATCODE appropriate for the MIC architecture, as described below.
Thread parallelism improvement
With 60 × 4 = 240 logical cores in the Intel Xeon Phi coprocessor, the performance-critical part of the application must have sufficiently large iteration space for work distribution, perform little synchronization between threads, and incur a small enough per-thread memory footprint. In HEATCODE, the first two issue are addressed by implementing thread parallelism in the following way: a) the stochastic heating and emissivity for a single incident radiation field is calculated serially (i.e., by only one thread) b) the library interface allows to call the emisivity calculation on a set of radiation fields corresponding to distinct voxels in the encompassing RT simulation c) for processing a large number of radiation fields, multiple OpenMP threads are spawned, and each thread processes its share of radiation fields d) dynamic loop scheduling in OpenMP is used to maintain load balance across threads
As long as the number of input radiation fields supplied by the user is significantly greater than the number of threads (240), all coprocessor cores are utilized. This model of parallelism requires no synchronization between threads, because each radiation field is processed independently. The third issue, the memory footprint of HEATCODE, is addressed by optimizing the re-use of scratch space in each thread, so that, even with 240 threads, the application fits in memory. This allows HEATCODE to employ all of the logical cores, thus maximising the computational capacity of the co-processor.
Vectorization refinements
Within each thread, the calculation can be further parallelized through vectorization, i.e., by utilizing SIMD instructions of the computing platform. Vector instructions apply the same arithmetical or logical operation to multiple numbers, thus multiplying the arithmetic performance by a factor equal to the width of the vector register. The 512-bit vector registers in the MIC architecture can fit 16 single precision floating-point numbers or 8 double precision numbers, which corresponds to potential speedups of 16x and 8x, respectively, compared to scalar (non-SIMD) code.
We chose the most portable way to utilize vectorization in HEATCODE: automatic vectorization by the compiler. Automatic vectorization in the Intel C++ compiler is enabled by default at high optimization levels. With this functionality, the compiler analyzes regular C++ loops, partitions them in blocks of 16 (or 8) iterations, and produces an assembly code that executes these blocks with SIMD instructions. This occurs automatically and usually does not require the programmer's involvement. However, we assisted the compiler by designing the C++ code of HEATCODE with the following properties a) Data in automatically vectorized loops are accessed with unit stride (i.e., contiguously) b) The constructor of the HEATCODE data containers allocates memory structures so that the address of the first element of each array is a multiple of 64 (i.e., the data are 64-byte aligned), which enables faster loading of data into SIMD registers. c) The inner dimensions of multidimensional arrays are a multiple of 64 bytes, so that the first element of every row is 64-byte aligned. d) Most vector loops in HEATCODE are constructed so that number of iterations is a multiple of 16 (the number of iterations was padded to a multiple of 16 when possible) e) Compiler hints in the form of vectorization pragmas were added to some vector loops in order to inform the compiler of data alignment, and to enable automatic vectorization in situations where the compiler could not detect whether vectorization is safe.
With automatic vectorization, the C++ files of HEATCODE can be compiled into exectutable code that runs with SIMD instructions on regular CPUs, as well as on Intel Xeon Phi coprocessors. An alternative way to employ SIMD instructions would be through special functions called compiler intrinsics. Intrinsics provide direct access to SIMD functionality, much like inline assembly. However, sets of SIMD instructions available in CPUs and in the Intel MIC architecture are different. This means that with intrinsics, the performance-critical portion of HEATCODE would have to be written and optimized for every architecture independently.
Scalar optimizations
The thread-parallel, vectorized algorithm of the stochastic emissivity calculation in HEATCODE can be compiled to perform in serial (i.e., single-threaded), scalar (i.e., non-vector) mode. The more optimized that underlying scalar algorithm is, the better performance results are obtained when vectorization and thread parallelism are enabled. Scalar optimizations employed in HEATCODE are common for the many-core and multi-core architecture: a) All floating-point calculations are performed in single precision; floating-point over/underflows are avoided by renormalizing the data where necessary. The algorithm with re-normalization is able to avoid floating-point exceptions as long as the input heating radiation fields do not have integrated energy densities orders of magnitude larger than the typical ISRF value of ≈ 1 eV cm −3 Otherwise, the ratio between x f and x f −1 in Equation (6) may be too large to handle in single precision b) Optimized implementations for transcendental functions and strength reduction (substitution of multiple inexpensive operations for a single expensive operation) in arithmetic expressions are used. Appendix B.4 describes this in more detail. c) In the constructor of the main library object, class
TransientHeatingModelXeonPhi, precomputation of some quantities (e.g., the radiation field interpolation parameters) is performed. The precomputed quantites are used during the evaluation of the stochastic heating and emissivity of dust (see Appendix B.4).
Memory and cache traffic streamlining
Main memory accesses in most computing architectures are more time-consuming than arithmetic operations. This disparity is usually alleviated by storing a limited amount of data in a smaller, but faster memory known as cache. The minimum amount of data that can be written to or read from a cache is called a cache line. Cache lines in Intel Xeon processors and Xeon Phi coprocessors are 64 contiguous bytes long. In order to make the most out of caches, HEATCODE is optimized in the following ways: a) Data access in loops is made contiguous where possible. In addition, in order to avoid scattered memory access in the radiation field interpolation routine, the interpolation parameters for each wavelength bin are precomputed and stored in a dense array (see Appendix B.3). These optimizations rely on the property that when one element of a cache line is accessed, the whole cache line with adjacent elements is served by the cache. Additionally, because of the predictability of streamlined access pattern, data can be automatically prefetched by the cache before it is even requested by the processor. Contiguous memory access and the packing of data into dense arrays improve what is called data access locality in space. b) Nested loops are tiled (see Appendix B.5), and consecutive loops of the same form are fused into one loop (see Appendix B.2). The effect of these optimizations is that when a portion of data is used multiple times in an algorithm, it is re-used as soon as possible after the previous usage, while it is still in cache. These methods, nested loop tiling and loop fusion, improve what is known as data access locality in time.
3.1.5. Communication with the coprocessor HEATCODE utilizes Intel Xeon Phi coprocessors using the offload programming model. In this approach, the application runs on the host system (i.e., on the CPU), where it initializes the physics data (cross sections, precomputed quantities related to the radiation field interpolation, Planck emissivity spectra, etc.) and accepts the input radiation fields from the user application. With the data prepared on the host, HEATCODE communicates with the coprocessor driver and initiates offload (i.e., transfer) via the PCIe (Periperal Component Interconnect Express) bus of the necessary data and executable code to the coprocessor, where the emissivity calculation functions are executed. After that, the calculated emissivities are returned to the host system via the PCIe bus.
The design of HEATCODE allows offloaded work to be distributed across multiple coprocessors, or across coprocessors and host CPUs. This is done by splitting up the input array of incident radiation fields into chunks contatining multiple radiation fields, and offloading the next unprocessed chunk to the next available compute device. In this process, the emissivity calculation function is offloaded to the coprocessor multiple times with different input data. Because the physics data used by the calculation is the same in every offload, HEATCODE optimizes the offload process by storing the physics data on the coprocessor between offloads. In addition, to eliminate the overhead of memory allocation, a persistent buffer is allocated on the coprocessor to hold the (input) radiation field array and the (output) stochatic heating emissivity array.
The constructor of the main library object, class TransientHeatingModelXeonPhi, offloads all data related to the grain model or models to the coprocessor. The desctructor of TransientHeatingModelXeonPhi removes all persistent data and buffers from the coprocessor.
Benchmarks
For all benchmarks, we set up the runtime environment as described in Appendix B.9. The size of the problem is chosen as n = 10 4 radiation fields for benchmarks using a single compute device, and n = 10 5 for heterogeneous calculations. All radiation fields are set to the same spectrum for the benchmark, corresponding to the bottom panel of Figure 1 . The calculation time of all stochastic heating emissivities was measured 4 times, and the average of the last 3 is reported. Multiple runs were used to control that the statistical accuracy of performance does not exceed the last reported significant figure. Error bars are therefore omitted in all plots.
For the performance benchmarks, we used a server based on two Intel Xeon E5-2680 processors with 64 GB of DDR3 memory at 1,333 MHz. This server hosts two Intel Xeon Phi coprocessors B1QS-5110P, each with 60 cores and 8 GB of onboard GDDR5 memory. The server runs a Linux operating system CentOS 6.4 on the host and the MPSS (Intel MIC Platform Software Stack) version 2.1.4982-15 on coprocessors. Our tests were performed with one of the two compilers: Intel C++ compiler version 13.1.1 and GNU C++ compiler version 4.4.7. The former was used to benchmark the host, coprocessors, and the whole system in a heterogeneous calculation. The latter could be used only to benchmark the host processors, because the current GNU C++ compiler does not produce executable code for the Intel MIC architecture.
The performance of the optimized stochastic emissivity calculation by class TransientHeatingModelXeonPhi in HEATCODE is shown in Figure 2 . The performance values in the figure are normalized to the performance of the optimized code compiled with the Intel C++ compiler and running on the host CPU. Numerically, this normalization factor is 0.62 ms per evaluation of the stochastic emissivity function in a batch of n = 10 4 calculations. This performance is a factor of 95x greater than the performance of the unoptimized code (hereafter referred to as the baseline performance). The performance of optimized HEATCODE compiled with GCC is more than 3 times lower than with the Intel C++ compiler on the same platform.
The difference between the performance of the calculation on a single Intel Xeon Phi coprocessor and on host system is a factor of 1.9x. Note that this factor reflects the comparison of one Intel Xeon Phi coprocessor to a two-socket Intel Xeon processor. The reason for comparing against two processor sockets is that two processors consume approximately the same power under load as one coprocessor. The speedup of 1.9x is slightly lower than results of synthetic benchmarks SGEMM, SMP LINPACK and STREAM triad, which provide speedups of 2.9x, 2.6x and 2.2x, respectively [19] . However, we find the 1.9x performance gain acceptable, considering two factors:
a) The workload of HEATCODE has a latency-bound calculation phase with scattered memory accesses. This type of workloads is better processed by many-core procesors than by the MIC architecture coprocessors. b) The HEATCODE is expressed completely in a high-level programming language with tuning via compiler pragmas. This approach ensures cross-platform portability and forward compatibility. In contrast, some of the synthetic benchmarks mentioned above are tuned using low-level code with assembly instructions and/or intrinsic functions, which is not a portable solution.
The last bar in Figure 2 shows the performance of a heterogeneous calculation, which uses the host system simultaneously with two Intel Xeon Phi coprocessors. The performance gain of the heterogeneous calculation is 4.4x compared to the CPUonly calculation. This is approximately 10% lower than the sum of the raw performances of two coprocessors and host processors 1.9 + 1.9 + 1.0 = 4.8x. There is a performance loss caused by the unavoidable load imbalance across compute devices and the scheduling overhead of the heterogeneous calculation. In our benchmarking we used a problem size of n=10 5 spectra; for a greater problem size, the performance loss due to scheduling would be lower.
Summary and Discussion
We have presented HEATCODE, a new library for the calculation of the stochastic heating and emissivity of cosmic dust grains for arbitrary incident radiation fields. HEATCODE can be run on general-purpose processors. In addition, it is optimized to also efficiently run on Intel Xeon Phi coprocessors and in heterogeneous systems with general-purpose processors and MIC architecture coprocessors.
Overall, the optimized HEATCODE calculation running on our benchmarking system and utilizing the host CPUs together with two Intel Xeon Phi coprocessors processes 10 5 individual stochastic emissivity spectra in 14 seconds, on average 0.14 ms per evaluation. This is significantly faster than the unoptimized version (implemented in class TransientHeatingModelUnoptimised) compiled with GCC, which takes 6 · 10 3 seconds for the same calculation (60 ms per evaluation). An intended application of HEATCODE is in conjunction with a RT code that is used for calculating multiple instances of the observed local spectrum of the interstellar radiation field and comparing each with observations, where each instance is evaluated for a set of Galactic structural parameters encoding, e.g., the locations of spiral arms, or the geometry of the Galactic bulge, over a parameter space. The stochastic dust emissivity for every spatial grid cell must be calculated for each realization of the parameters. We estimate that 10 5 RT calculations are required to explore a representative space of 4 parameters, with ∼ 10 5 voxels for each realization of the Galaxy. The single computing server with two Intel Xeon Phi coprocessors that we used for benchmarks spends 14 × 10 5 s ≈ 16 days to compute the stochastic heating grain emissivities in such a calculation. That is a reasonable duration for a research project, as opposed to 6 · 10 3 × 10 5 s ≈ 19 years with the unoptimized code unable to use the coprocessor cards. This performance improvement was made possible by using optimization practices that improve the performance of HPC applications on general-purpose processors and on the Intel MIC architecture. An extended Appendix to this paper is dedicated to the discussion of these optimization practices. As a supplement to this discussion, we provide in Appendix B nine code revisions corresponding to each optimization step. This supplementary material in combination with the discussion in the text can be used to further improve the performance of the HEATCODE library, and to transfer the methods practiced here to other applications with similar workloads.
The results discussed in Appendix B show that for almost all optimization steps, a performance increase is observed both in the host and in the coprocessor version of the library. With the exception of minor fragments, the optimized C++ code is the same for all platforms and compilers. Therefore, the common trend in all three curves in Figure B .4 corroborates the claim that the optimization for the MIC architecture relies on the same principles as optimization for general-purpose processors [20] . At the same time, the unoptimized code on the coprocessor performs a factor of 3x worse than the host system. This observation supports the recommendation that applications must be optimized to fully exploit the general-purpose CPU capabilities before they can be efficiently run on the MIC architecture [20] .
Acknowledgements
The work on the HEATCODE library is supported by NASA grants NNX12AO69G and NNX12AO73G.
Appendix A. The HEATCODE Library Interface and Structure
The HEATCODE library is written in the C++ language and designed to function in the Linux environment. It can be compiled using the supplied Makefile, which supports the usage of the GNU Compiler Collection (GCC) and of the Intel C++ compiler. The Intel compiler is necessary in order to utilize the MIC architecture support, which offloads parts of the calculation to Intel Xeon Phi coprocessors. In addition, the Intel compiler produces a faster performing executable code than the GNU compiler, due to better support for automatic vectorization and single precision transcendental arithmetics.
The linking of HEATCODE to a user application is done in the usual way by passing the corresponding object files to the linker. The compilation of an example application utilizing the HEATCODE library is demonstrated in the Makefile.
Appendix A.1. Interface
The interface to the library is demonstrated in the files GrainModel.h and TransientHeatingModel.h: a) Class GrainModel is an abstract class which declares the data containers for the properties of dust interaction with light and size distribution of dust grains. It must be overridden by concrete derived classes in order to implement specific grain models. HEATCODE contains derived classes GraphiteGrainModel, PAHGrainModel and SilicateGrainModel that implement the corresponding models of grains. The constructor of GrainModel must be called with an input argument that defines the light wavelength grid used throughout the library. b) Abstract class TransientHeatingModel instruments the interface and the computing engine (see Section 2) that uses a GrainModel object to compute the stochastic emissivity of the corresponding type of grains heated by incident light spectrum. HEATCODE contains two concrete implementations of this class: TransientHeatingModelUnoptimised, which is an unoptimized version of the calculation, and TransientHeatingModelXeonPhi, which is optimized for heterogeneous calculations using general-purpose multicore processors and Intel Xeon Phi coprocessors. The latter implementation is recommended for best performance even on systems not equipped with Intel Xeon Phi coprocessors. The constructor of TransientHeatingModel must be called with two arguments that define the GrainModel class and the temperature grid used in the calculations. The implementation TransientHeatingModelXeonPhi takes additional arguments that define the set of compute devices, which may be used for calculations. and defined on the wavelength grid used to initialize the class GrainModel. The output vector emissivity upon the exit from the function contains the stochastic grain emissivities corresponding to each of the incident radiation fields. The emissivity is the rate of energy emission multiplied by the wavelength. In the code, it is measured in the units of eV s −1 sr −1 (H-atom) −1 and defined on the same wavelength grid.
1 v i r t u a l v o i d C a l c u l a t e T r a n s i e n t E m i s s i v i t y ( Appendix A.2. Dependencies HEATCODE uses OpenMP in order to distribute calculations across multiple threads on many-core processors and Intel Xeon Phi coprocessors. Therefore, applications using HEATCODE must be compiled with the compiler argument -fopenmp with GCC or -openmp with Intel C++ compiler.
Appendix A.3. Usage in Parallel Computing Platforms
The function CalculateTransientEmissivity automatically distributes work across multiple CPU cores and, if requested and possible, performs offload of parts of the calculation to coprocessors. Usually, there is is no need to call this function from a parallel region, because it exploits all parallelism in the hardware system available to it. However, CalculateTransientEmissivity may be safely invoked from multiple OpenMP threads. In this case, only one instance of the calculation proceeds at a time, because the calculation is executed inside a critical OpenMP barier.
HEATCODE may be used in cluster environments if MPI processes operating on independent nodes call independent instances of the library. However, within HEATCODE, distributed-memory parallelism with MPI is not supported.
When multiple instances of an application using HEATCODE are running on a compute node with Intel Xeon Phi coprocessors, each instance will try to claim all available compute resources: CPU cores and coprocessors. In cases when it is necessary to share these resources between several instances of the HEATCODE library, the user must implement inter-process locking or otherwise modify HEATCODE in order to avoid resource contention.
Appendix B. Optimization of HEATCODE
In this section we discuss the optimizations that had to be made in order to improve the performance of the stochastic heating and emissivity calculation and utilize the Intel MIC architecture. Most of these optimizations apply to all modern general-purpose processors and are not specific to the nature of the HEATCODE algorithms. We report our optimization work in detail in order to facilitate future development of HEATCODE and of other applications relying on similar computational algorithms. As of the writing of this work, the Intel MIC architecture is a young technology with only eight months of public exposure. We believe that the sharing of best optimization practices at this time is essential to the evolution of the scientific developer community.
Optimization steps discussed below are revisions of the class TransientHeatingModelXeonPhi. Most optimization steps address a specific category of hardware capabilities or algorithm properties; however, some optimization improve performance by affecting multiple issues. We emphasize such multiissue optimizations in the discussion. The explicit code of all optimization steps can be viewed, cross-examined and benchmarked by downloading the supplement to HEATCODE in the CPC library.
The unoptimized code (we refer to it as "optimization step 0") is our own application developed without the aim to optimize it for the Intel MIC architecture or to tune it for a specific C++ compiler. The code of step 0 is not intentionally handicapped for the purpose of highlighting the optimization benefits. It was designed as an architecture-oblivious, straightforward implementation of the matrix formalism of Draine and Li [4] . The performance of the code in step 0 compiled with the GNU C++ compiler is hereafter referred to as the baseline performance. Quantitatively, the baseline corresponds to an average of 59 ms per stochastic heating and emissivity calculation (in a batch of 10 4 spectra). Figure B .4 shows the performance of HEATCODE for all optimization steps with three curves. The green dashed line shows the performance of the code compiled with GCC on the host system utilizing all cores. The blue dash-dotted line is the performance of the same code compiled with Intel C++ compiler, also on the host system. The red dash-double-dot line is the performance on a single Intel Xeon Phi coprocessor (for steps 0 through 7), or on the whole system with the host and two coprocessors used concurrently (step 8). All benchmark results shown in Figure B .4 are measured relative to the baseline.
The performance of the optimized GCC version (step 7) is approximately 32x better than the baseline. The Intel C++ compiler version on the host is approximately 95x faster than the baseline, and the code compiled with Intel C++ compiler on the coprocessor is 180x faster. The heterogeneous system comprised of the host processors and two coprocessors performs 420x faster than the baseline version.
With GCC, the performance falls short of the corresponding performance with Intel C++ compiler. Two factors contribute to this performance difference. First, according to the vectorization report, GCC was not able to vectorize some of the loops in the code, including the calculation of the interpolated radiation field. That calculation involves costly transcendental arithmetic operations and takes a significant fraction of the computation time. Second, the Intel Math Library provides optimized exponential and logarithm functions with base 2, which are faster than their natural base counterparts, and HEATCODE takes advantage of these optimized functions. The math library used by GCC does also provides base 2 exponential and logarithm functions, but they are not faster than the natural base logarithm and exponential. In order to compensate for that, when GCC is used for compiling HEATCODE, the code falls back to using the natural base exponential and logarithmic functions.
In the rest of the Appendix, the specific measures taken in each optimization step are discussed and supplemented with fragments of C++ code and explanatory diagrams. Some applications can be improved by eliminating redundant operations from the algorithm. In the context of computer algorithms, this procedure is known as "pruning". Pruning is generally hardware-independent, but may require adopting a set of assumptions. For calculations in HEATCODE, two assumptions are made: 1) We assume that above a certain threshold grain size, grains become large enough so that the stochastic heating does not have to be computed. Instead, adiabatic heating and emission formalism can be used, which is outside the scope of the HEATCODE applicability. In step 1, we optimize the code by precomputing the index gIMax, which corresponds to the threshold grain size. Using this index, we restrict the bounds of corresponding loops of and reduce the amount of memory allocated for some storage containers.
2) The radiation field is assumed to be zero outside the wavelength range covered by the numerical wavelength grid. Accordingly, the range of grain excitation energies can be constrained. We optimize the code by precomputing the index of the highest exited level for each grain size. This index is represented by array int fMax [gIMax] . Using fMax, we restrict the loop bounds in the respective matrix operations.
Additionally, in step 1, we implement the pruning of grain sizes for which all heating terms calculated in the course of 
GCC on CPUs
In te l C + + o n X e o n P h i
In te l C + + o n C P U s radiation field interpolation are zero. This happens if the radiation field is non-zero only in a limited waveband, which does not interact with grains of a given size. For such non-interacting grains, the temperature distribution is immediately obtained as P i = δ i0 , where δ i j is the Kronecker delta. The pruning is done by counting and storing the number of non-zero heating terms in array int nNonZero[gIMax] and using this array in all functions to skip through trivial grain sizes. This optimization does not involve any assumptions.
Another approach to algorithm optimization is the use of recurrence relations. These relations express a quantity in a computed sequence via one or more preceding values in the sequence. As Guhathakurta and Draine [2] mention, the elements of matrix B f j defined by Equation (5) can be computed using a recurrence relation, which in our notation is
In optimization step 0 (previous), the calculation of B i j is performed explicitly (i.e., without recurrence) using the code shown in the left-hand side panel of Figure B Fusion for Reduced Memory Footprint While the host system used in our tests has 32 logical cores, each of its two coprocessors has 240 logical cores. As discussed in Section 3.1.1, HEATCODE employs thread parallelism by distributing across processor cores a large number of radiation fields for which the stochastic heating and emissivity must be computed. Intel Xeon Phi coprocessors used for testing have 8 GB of onboard memory, with some of it used for administrative purposes. Each thread allocates scratch data in order to perform the calculation of stochastic heating and emissivity. The greatest amount of memory is used for scratch arrays which hold the results of intermediate calculations, and occupy up to 60 MB per thread (the exact size depends on the grain model). On the coprocessor, each of the 240 threads allocates a private copy of all three arrays, resulting in a memory footprint in excess of 10 GB. Therefore, in order to run the calculation, we have to reduce the number of threads down to an amount that allows all scratch data to fit in memory, sacrificing the performance of unused cores. In optimization step 2 (current), we resolve this situation as described below.
In each parallel thread spawned by the function CalculateTransientEmissivity, three arrays are 1 f o r ( i n t f = 0 ; f < t e m p B i n s ; ++ f ) { 2 f o r ( i n t i = 0 ; i < t e m p B i n s ; ++ i ) { f o r ( i n t k = f ; k < t e m p B i n s ; ++k ) 6 sum += t r a n s i e n t M a t r i x [ g I * t e m p B i n s * t e m p B i n s + k * t e m p B i n s + i ] ; used to temporarily hold scratch data of the calculation: weightedRadiationField, transientMatrix and bMatrix.
The size of each of these arrays is gIMax × tempBins × tempBins. The factor gIMax 100 occurs because the calculation stores one matrix of size tempBins × tempBins for each grain size. This is necessary because the unoptimized calculation is structured as a pipeline. Three functions process the radiation field into the temperature distribution. One function performs the radiation field interpolation (9) and places it into the array weightedRadiationField. The second uses the interpolated radiation fields to prepare matrices T i j and B i j stored in transientMatrix and bMatrix, respectively. The third function uses bMatrix to solve, for each grain size, the system of equations (4) with these matrices. The structure of the unoptimized calculation is shown in Figure B .6.
The optimization that we performe in step 2 involves loop fusion. Each of the three pipeline functions contains a loop in the variable gI, which spans all grain sizes. The calculation for each grain size is independent of all other grain sizes. Therefore, if these three loops are fused into one, then, instead of a set of 3-dimensional arrays of size gIMax × tempBins × tempBins, a set of 2-dimensional arrays of size tempBins × tempBins can be used.
We perform the optimization by moving the code of the three functions into a single function named RadiationFieldToTemperatureDistribution and fusing the loop in gI. Naturally, we could once again partition the body of the new gI loop into three functions operating on only one index gI. However, we chose to keep the code in one function, aiming for the best performance. Argument passing and function calling could incur additional performance overhead. The optimized structure of the code is listed in Figure B This optimization does not significantly affect the performance of the host system based on general-purpose Intel processors. Indeed, the host system with 64 GB of memory and 32 logical cores has sufficient memory to hold the scratch data for all 32 threads even in the unoptimized case. However, the performance on Intel Xeon Phi coprocessors is improved by loop fusion by a factor of 4. One reason for this is that with reduced memory footprint, the calculation is able to run as many OpenMP threads as there are logical cores on the coprocessor. Another reason is that instead of an expensive allocation of large scratch arrays on the heap, threads perform quick allocation of small scratch arrays on the stack. Dynamic allocation on the heap is particularly costly on the Intel MIC architecture, because this operation is inherently sequential.
Appendix B.3. Efficient Interpolation Method: Packing and Precomputing Scattered Memory Accesses
Optimization for multi-and many-core architectures usually requires the maintenance of data access locality. Locality in this context means that the data in the main memory requested by cores should be requested from adjacent addresses ("locality in space") and, if these data are used multiple times in an algorithm, they should be re-used as soon as possible ("locality in time"). These properties of data access make the best use of hierarchical memory caches.
One of the operations performed in the stochastic heating calculation algorithm has an inherently scattered pattern of memory access. This operation is the interpolation of the incident radiation field from the wavelength grid on which it is defined onto the grid on which the absorption cross section of the grains is discretized. This pattern of memory access is inefficient because it involves memory access with poor locality in space. We also argue below that it is inefficient for vectorization (i.e., the use of SIMD instructions). The unoptimized interpolation algorithm is shown in Figure B. 9.
There are a number of issues in the unoptimized interpolation algorithm that prevent the multi-and many-core architecture from using its full potential:
1) The function std::lower bound is employed to find the radiation field wavelength bin interacting with the current grain wavelength. This function uses the binary search algorithm, which incurs poorly predictable branches and scattered memory accesses to array wavelength. Due to the combinatorial, non-vector nature of this algorithm, the Intel MIC architecture is used sub-optimally. 2) The data are read from arrays radiationField and absorptionCrossSection with a scattered pattern according to the index j found in the binary search. This pattern complicates effective cache use. It also prevents the vectorization of the transcendental expressions used to compute the interpolated value.
3) The data written to the matrix weightedRadiationField are re-used in the code when the matrix transientMatrix
v o i d I n t e r p o l a t e W e i g h t e d R a d i a t i o n F i e l d ( / * o u t p u t * / f l o a t * w e i g h t e d R a d i a t i o n F i e l d , . . . ) {
f o r ( i n t g I = 0 ; g I < gMax ; g I ++)
c a l c u l a t i n g w e i g h t e d R a d i a t i o n F i e l d [ g I * t e m p B i n s * t e m p B i n s + o f f s e t ] * / }
v o i d C a l c u l a t e M a t r i c e s ( / * i n p u t * / c o n s t f l o a t * w e i g h t e d R a d i a t i o n F i e l d , / * s c r a t c h * / f l o a t * t r a n s i e n t M a t r i x , / * o u t p u t * / f l o a t * b M a t r i x , . . . ) { 
I n t e r p o l a t e W e i g h t e d R a d i a t i o n F i e l d ( w e i g h t e d R a d i a t i o n F i e l d , . . . ) ;

C a l c u l a t e M a t r i c e s ( w e i g h t e d R a d i a t i o n F i e l d , t r a n s i e n t M a t r i x
S t a c k a l l o c a t i o n h e r e i s f a s t e r t h a n h e a p a l l o c a t i o n i n t h e u n o p t i m i z e d v e r s i o n . * / 4 f l o a t w e i g h t e d R a d i a t i o n F i e l d [ t e m p B i n s * t e m p B i n s ] ;
5 f l o a t t r a n s i e n t M a t r i x [ t e m p B i n s * t e m p B i n s ] ; is computed, however, the calculations of these two matrices are performed in distinct loops. This leads to sub-optimal data access locality in time. By the time the calculation of transientMatrix begins, the bulk of weightedRadiationField is likely to have been evicted from processor caches into the main memory. Loading weightedRadiationField from memory is the bottleneck for the calculation of transientMatrix.
In step 3, we optimize the algorithm of radiation field interpolation and the computation of transientMatrix using several measures, as shown in Figure B .10: a) First, we interchange the order of nesting in the traversal of the two wavelength grids. Instead of traversing the grain wavelength grid first and for each set of indices (f, i) computing the index j in the radiation field grid, we made the optimized outer loop iterate through index j and compute the indices (f, i) for it. This improves the pattern of memory access, because the index f*tempBins + i remains constant or monotonically increases with increasing j. Even though now the calculation accesses weightedInterpolationField with a scattered pattern, the performance increases because the latency of scattered writes can be overlapped with computation, unlike the latency of scattered reads. The overlap occurs automatically thanks to out-of-order computation in Intel Xeon cores and due to hyper-threading in both Intel Xeon and Intel Xeon Phi cores. b) The optimized code precomputes during the initialization phase the set of values (f, i) resonant with each wavelength bin j. These values are stored in array interpolationPatternIndex. An auxiliary array interpolationPatternCount stores the number of grain wavelengths in the former array that resonate with wavelength [j] . This completely eliminates a sorted array search, along with the poorly predictable branches caused by it, from the interpolation code.
c) The algebraic expression for the interpolated radiation field value is simplified, and most of the terms in it precomputed and stored in array interpolationLogOffset. This reduces the number of transcendental functions called for every interpolation event from 9 to an average value in the range (1 . . . 2]. The specific number depends on how many grain wavelength bins interact with the given radiation field bin, on average.
d) The calculation of the elements of transientMatrix is moved from a separate loop into the interpolation loop. This improves the temporal locality of data access, because the values of weightedInterpolationField are still in the processor's registers when they are re-used.
The optimizations discussed above increase the code performance thanks to improvements on multiple fronts. Loop interchange streamlines the data access pattern, replacing scattered reads with scattered writes. Precomputation of wavelength indices and interpolation expressions eliminates a combinatorial search and reduces the volume of transcendental arithmetic operations and non-local memory accesses. Fusing the calculation of weightedRadiationField with its subsequent usage to compute transientMatrix improves the locality of data access in time.
A schematic diagram of the optimization of the interpolation algorithm is shown in Figure B .11. 1 f o r ( i n t f = 1 ; f <= fMax [ g I ] ; ++ f ) { / * Loop t h r o u g h a l l h e a t i n g t r a n s i t i o n s f o r t h i s g r a i n s i z e * / f o r ( i n t i = 0 ; i < f ; ++ i ) { 3 / * W a v e l e n g t h a t which c r o s s s e c t i o n i s d e f i n e d * / 4 c o n s t d o u b l e wl = g r a i n W a v e l e n g t h [ g I * t e m p B i n s * t e m p B i n s + f * t e m p B i n s + i ] ; 5 6 i f ( wl >= w a v e l e n g t h [ 0 ] && wl <= w a v e l e n g t h [ wlBins − 1 ] ) { / * Guard a g a i n s t o u t o f r a n g e r e s u l t s * / / * How many g r a i n w a v e l e n g t h s f a l l w i t h i n t h e r a d i a t i o n f i e l d b i n j ( p r e c o m p u t e d ) * / 7 c o n s t i n t qCount = i n t e r p o l a t i o n P a t t e r n C o u n t [ g I * w l B i n s + j ] ; 8 9 i f ( ( u p p e r > 0 ) && ( l o w e r > 0 ) ) { / * Guard a g a i n s o u t o f r a n g e r e s u l t s * / c o n s t d o u b l e dLogUpperLower = l o g ( u p p e r / l o w e r ) ; / * Same v a l u e f o r a l l b i n s * / 11 f o r ( i n t c = 0 ; c < qCount ; c ++) { 12 / * I n d e x o f t h e g r a i n w a v e l e n g t h f * t e m p B i n s+ i i s p r e c o m p u t e d d u r i n g i n i t i a l i z a t i o n * / 13 c o n s t i n t i d x = i n t e r p o l a t i o n P a t t e r n I n d e x [ q C t r + c ] ; 14 15 / * I n t e r p o l a t i o n w i t h p r e c o m p u t e d l o g o f f s e t * / 
Appendix B.4. Scalar Optimizations: Precision Control, Precomputation
The preceding optimization steps (see Appendix B.1 through Appendix B.3) eliminate algorithmically redundant operations, ensure scalable thread parallelism, facilitate automatic vectorization, and improve the memory access pattern. The next step is optimization on the level of scalar operations by reducing the cost of arithmetics.
The unoptimized code uses mixed precision arithmetics. The usage of mixed precision is detrimental to performance for two reasons. First, double precision operations take more time than their single precision counterparts. Second, in vectorized mixed precision operations, additional type conversion and load/store operations are necessary, which may make mixed precision even less efficient than pure double precision. In the unoptimized code, matrices weightedRadiationField, transientMatrix and bMatrix are single precision containers. However, some intermediate arithmetic operations are performed in double precision in order to avoid floating-point underflow and overflow.
In optimization step 4 (currently discussed), we convert all floating-point variables and operations in the code to single precision. In order to avoid overflow, we use scaling factors that decrease the magnitude of intermediate arithmetic results, so that the overflow does not occur. These modifications are necessary in two locations in the code:
1) The calculation of the temperature distribution according to Equation (6) can have overflow if the heating is strong. In single precision, overflow occurs when for some f > 1, the value X f is a factor of FLT MAX ≈ 10 38 greater than X 0 = 1. We eliminate that overflow by checking that every new computed X f does not exceed a pre-determined threshold value (we used 10 30 as the threshold). If the value of X f exceeds the threshold, the whole vector X is multiplied by a factor 10 −30 . This procedure, illustrated in Figure B .12, affects the result only to machine precision, because eventually, the temperature distribution is calculated by re-normalization according to Equation (7). In order to conform all quantities and functions in the code to the pure single precision implementation, two additional modifications are made: 1) All floating-point constants are explicitly marked as single precision. By default, floating-point constants in C++ expressions, such as "0", "1.0", etc., are treated as double precision numbers by the compiler. In order to explicitly declare them as single precision, the letter "f" should be appended: "0.0f", "1.0f", etc. This avoids unnecessary type conversion in single precision expressions that use these constants. 2) Mathematical functions from the Math library optimized for single precision are used. That is, the double precision exponential function exp() is replaced with its single precision version expf(), and the logarithm log() with logf().
The exponential and logarithm transcendental functions can be further accelerated with the Intel Math library by using base-2 implementations exp2f() and log2f(). In the Intel Math library, these implementations are more efficient than the natural base versions expf() and logf(). This is not true for the Math library used by the GNU compiler version 4.4.7, which we used. Therefore, we implemented compiler-dependent code preprocessing in order to use the most efficient exponential and logarithm functions available in the compiler used to build the application, as shown in Figure B .13. The preprocessor macro INTEL COMPILER is defined by the Intel C++ compiler; it is not defined by GCC. This macro is also used to protect other syntax and functionality specific to the Intel C++ compiler, such as compiler vectorization hints (see below), and another automatically defined macro INTEL OFFLOAD is used to protect the offload pragmas that perform calculations on Intel Xeon Phi coprocessors.
Finally, in order to speed up some arithmetic operations, we employ a procedure known as strength reduction. This procedure algebraically transforms arithmetic expressions, aiming to replace long-latency arithmetic operations, such as division, with faster operations, such as multiplication. The result may not be bitwise-equivalent to the original expression, however, this is acceptable in cases such as ours. Some strength reduction had already been performed in the interpolation algorithm in optimization step 3 (see Appendix B.3). In the current optimization step, we replace some divisions with multiplications by the reciprocal number, as illustrated in Figure B. 14. The code in this figure implements the re-normalization of the temperature distribution expressed by Equation (7).
1 f l o a t b M a t r i x [ t e m p B i n s * t e m p B i n s ] ; 2 f l o a t * t r a n s i e n t M a t r i x = &b M a t r i x [ 0 ] ; 3 4 / * . . . L a t e r i n t h e i n t e r p o l a t i o n a l g o r i t h m : * / 5 f o r ( i n t c = 0 ; c < qCount ; c ++) { 6 c o n s t i n t i d x = i n t e r p o l a t i o n P a t t e r n I n d e x [ q C t r +c ] ; access optimization and perform three additional code modifications.
Appendix B.5.1. Sharing Memory Between Arrays First, we recognize that in the pipeline that processes the radiation field to compute the temperature distribution, three temporary arrays: weightedRadiationField, transientMatrix and bMatrix are used sequentially. When the next array is filled, the data in the previous array are no longer needed. This situation lends us the opportunity to further reduce the memory footprint and increase data locality by using a shared memory location for all three arrays. We accordingly improve the code by allocating space for only one array and declaring another array as a pointer referring to the same memory location. The array weightedRadiationField is eliminated, and only a temporary scalar variable is used in its place. Some care must has to be taken in order to preserve automatic vectorization, because the compiler is wary of vector dependences in loops operating on arrays with overlapping memory locations. The vectorization success is controlled using the compiler argument -vec-report3 of the Intel C++ compiler in order to produce a detailed vectorization report. This optimization and the construction of the modified loops is shown in Figure B .15.
Appendix B.5.2. Loop Interchange
In the calculation of stochastic emissivity from the temperature distribution, expressed by Equations (11) and (13), nested loops are used to traverse the emissivity wavelengths, grain sizes and the temperature bins. In the unoptimized calculation, the wavelengths are traversed first, followed by grain sizes, followed by the temperature bins in the inner nested loop. We have found that permuting the outer two loops (i.e., traversing grain sizes first and wavelengths afterwards) improves the performance. The unoptimized code (before permutation) and the optimized code (after) are shown in Figure B .16.
The difference in memory access between unoptimized and optimized codes is the pattern of read access to arrays distribution and planckDistribution. In the unoptimized code, the latter array has better access locality than the former. Indeed, for every iteration of the outer loop in variable i, only elements of the j-th row of planckDistribution are used, but the entirety of array distribution is read from memory. In contrast, with permuted loops in the right-hand side panel of Figure B .16, for every iteration of the outer loop in variable j, only the i-th row of distribution is used, whereas planckDistribution is read from front to back. The permuted loop version works faster because when the calculation is distributed across multiple threads, each thread is operating on the same copy of planckDistribution, but the data in distribution are private to each thread. In Intel Xeon processors, hyper-threading places two threads per physical core, and in Intel Xeon Phi coprocessors, four threads per physical core. Therefore, when the code operates on thread-private distribution, two (on the CPU) or four (on coprocessor) instances of this array must be fetched into the cache of each core. In contrast, for shared planckDistribution, only one copy per core must be fetched for all threads on that physical core to share. This means that the application pays a greater penalty for cache misses on distribution, which is why the permuted loop version, with better locality of access to distribution, works faster.
Appendix B.5.3. Loop Tiling
Further improvement of cache performance in the calculation of stochastic emissivity may be achieved with the use of loop tiling. Loop tiling transforms two nested loops in such a way that a) the outer loop variable is incremented with a stride greater than one (the stride is called the tile size), and b) for every value of the inner loop variable, a fixed small number of outer loop iterations are performed to traverse the tile.
This technique is often used in matrix operations with nested loops in order to increase the re-use of data fetched into processor registers or cache before these data are evicted. In the optimized version in step 5, we employ double tiling, applying the technique to both the i and j loops. Some empirical tuning is required in order to find the optimal tile size. The resulting code (incomplete) is shown in Figure B .17. The optimizations of loop permutation and tiling in this function lead to very significant performance improvement on the multi-core platform, and an even stronger improvement on the MIC architecture. These methods are often required for memory traffic optimization in a large class of matrix calculations with a low arithmetic intensity. Without loop tiling, such applications may be bottlenecked by memory access latency, nullifying the arithmetic and bandwidth advantages of specialized computing architectures such as Intel Xeon Phi coprocessors. In the case of multi-dimensional arrays, the alignment of the first element array on a 64-byte boundary may not be suffucient. Indeed, if the length of the inner dimension of this array is not a multiple of 64 bytes, then the first element in the second row (and subsequent rows) is not aligned. Therefore, in order to improve vectorization in nested loops operating on multidimensional arrays, the inner dimension must be a multiple of the alignment value. We verify that this condition is satisfied using an assert() check; however, we could also pad the row length to a multiple of 64 bytes for generality.
Appendix B.6.2. Padding of Loops
The Intel C++ compiler is able to vectorize loops even if the iteration count in the loop is not a multiple of the SIMD vector length. For instance, in Intel Xeon Phi coprocessors, a 512-bit SIMD vector packs sixteen single precision floating-point num-1 c o n s t i n t i T i l e = 4 ; / * S i z e o f i − t i l e s * / 2 c o n s t i n t j T i l e = 4 ; / * S i z e o f j − t i l e s * / 3 a s s e r t ( w l B i n s % i T i l e == 0 ) ; 4 5 f o r ( i n t j j = 0 ; j j < gIMax −( gIMax%j T i l e ) ; j j += j T i l e ) / * S t r i d i n g t h r o u g h j − t i l e s * / 6 f o r ( i n t i i = 0 ; i i < w l B i n s ; i i += i T i l e ) { / * S t r i d i n g t h r o u g h i − t i l e s * / f o r ( i n t b = 0 ; b < j T i l e ; b++) { / * C o l l e c t i n g r e s u l t s f o r t h e t i l e * / bers. A loop in which the number iterations N is determined at runtime is not guaranteed to have a multiple of sixteen iterations. In these cases, the compiler produces executable code for a variety of loop lengths N. Depending on the runtime value of N and memory alignment situation, the compiler may perform N modulo 16 iterations at the beginning or at the end of the loop in a non-vector form, and complete the rest using SIMD instructions.
Despite the compiler's flexibility, it is often beneficial for performance to ensure that the loop count is a multiple of 16. We have found that the HEATCODE calculations that involve the quasi-triangular matrix B i j can be accelerated by changing the inner loop bounds to a multiple of 16. In order to avoid overwriting the cooling terms B ( f −1) f in this process, we store these terms in a separate matrix rTransientMatrixOverDiagonal instead of storing them at their respective locations in bMatrix.
Additionally, we modify the calculation of the precomputed interpolation pattern (see Note that the GNU compiler was not able to vectorize this loop. The padding of interpolationCount to a multiple of 16 degrades the performance with GCC, because it increases the number of loop iterations. In order to avoid performance degradation, we restrict the padding of this array to the Intel compiler compilation by protecting the corresponding code with a check for the preprocessor macro INTEL COMPILER.
Appendix B.6.3. Compiler Hints
Even if the programmer ensures that every instance of a vector loop operates on aligned data, the compiler still implements runtime checks for alignment in order to maintain error-free execution. However, it is possible to instruct the compiler to drop these checks and assume the data aligned, which accelerates the program execution. This is done by including the hint #pragma vector aligned before the vectorized loop. If the data are not aligned at runtime, the use of this pragma results in a segmentation fault.
If the runtime length of some loops is known to the programmer, but cannot be determined by the compiler, #pragma loop count may accelerate execution. Using the loop count suggested by the developer, the compiler more efficiently chooses the vectorization strategy.
Another useful compiler hint is #pragma simd, which makes it mandatory for the compiler to vectorize the loop following the pragma. This is particularly helpful in combination with loop tiling (see Appendix B.5), where the vectorized loop is not the inner loop.
The hint #pragma vector nontemporal can be used in order to perform vector write operations in a loop directly into the main memory, bypassing the cache. This is useful when the output data of a vector loop are not used until much later in the code, and there is no reason to store them in a cache. Even though nontemporal write instructions incur eviction stalls immediately, the overall application performance may be improved by using #pragma vector nontemporal, because caches are not contaminated with temporary data.
In optimization step 6, we supplement the code, where appropriate, with the compiler hints described above. Figure B .21 demonstrates some of the loops tuned with the vectorization pragmas. These pragmas are specific to the Intel C++ compiler, and GCC does not recognize them. We protect the vectorization hints with a check for the preprocessor macro INTEL COMPILER in order to avoid warnings with GCC.
Appendix B.7. Offload Traffic Reduction: Data Persistence and Memory Retention on the Coprocessor In optimization steps 1 through 6 (Appendix B.1 through Appendix B.6) we focused on optimizing the calculations, whether they are running on the host system or on the coprocessors. As a result, the execution time of the optimized code has decreased enough that the set-up time of the calculations has become significant. In this optimization step, we reduce the set-up time for calculations offloaded a coprocessor.
HEATCODE uses Intel Xeon Phi coprocessors in the explicit offload mode. In this mode, before a function is executed on heterogeneous system, work must be distributed proportionally to the relative performance of compute devices. Otherwise some devices have to idle, awaiting the completion of work by other devices. 2) Different incident radiation spectra may take different amounts of time to process into stochastic emissivities, because the code uses pruning to exclude heating terms or whole spectra not interacting with grains in the stochastic heating mode (see Appendix B.1).
In order to utilize the additional level of parallelism (multiple compute devices), we use the blocking offload syntax and perform multiple concurrent offloads from the corresponding number of parallel threads on host. The loop scheduling functionality in OpenMP library takes care of scheduling. When multiple compute devices are used to calculate the stochastic grain emissivity for n incident radiation fields, HEATCODE splits these n tasks into multiple contiguous chunks (i.e., sets of radiation fields). Then the application spawns a thread-parallel region on the host with as many threads as there are compute devices. Each thread in the parallel region is associated with one of the compute devices. In computing systems enabled with Intel Xeon Phi coprocessors, the host system typically has no less than 8 logical cores (usually up to 32), but no more than 8 coprocessors (usually 1 or 2). Therefore, we expect that all threads in this outer parallel region start simultaneously. The OpenMP threads corresponding to compute devices are then used to run a parallel for-loop, which iterates over all chunks. The loop scheduling mode is set to dynamic scheduling with a grain size of 1. In this mode, initially, one chunk from the pool is assigned to each thread, which then launches a blocking calculation for this chunk on the respective compute device. When one of the threads (equivalently, one of the compute devices) completes the work and unblocks, another chunk from the pool is assigned to this thread (equivalently, to the corresponding compute device) by the OpenMP scheduler. This process goes on until all chunks are processed.
The offloaded calculation spawns its own parallel loop, which distributes the calculations of all radiation fields in the current chunk across the cores of the device. This parallel loop uses as many threads as there are available logical cores on the respective compute device.
The host CPU can be used as a compute device, too, by mapping one of the threads to the host CPU. In order to use the OpenMP loop that schedules across devices and, at the same time, spawn OpenMP threads on the host processor for calculations, nested parallelism in OpenMP must be enabled. The work scheduling technique described above is illustrated in Figure B .22.
Appendix B.9. Runtime Environment
In order to achieve maximum performance, some settings of the OpenMP and offload runtime library have to be set to nondefault values.
Compute-bound workloads, such as the stochastic heating and emissivity calculation discussed here, usually benefit from binding OpenMP threads to physical or logical cores of the processor to prevent thread migration across cores. For HEATCODE, we recommend setting such binding using the OpenMP thread affinity interface. On the host, prior to starting the calculation, the Linux environment variable KMP AFFINITY must be set by the user to the value "compact,granularity=fine" (hereafter, the quotes must not be included in the value of the environment variables). This sets the OpenMP thread affinity on the host that binds OpenMP threads to the respective logical cores, and places threads with consecutive numbers as close to each other as possible. Additionally, the environment variable MIC ENV PREFIX may be set to the value "MIC", the variable MIC KMP AFFINITY to the value "compact,granularity=fine", and MIC PLACE THREADS to "59C,4t" (for a 60-core coprocessor). These settings effect the same type of OpenMP affinity on all Intel Xeon Phi coprocessors in the system, ensuring that one core dedicated to OS-level offload tasks does not get loaded with computational work.
Another runtime optimization that improves the coprocessor performance is using large buffers for offloading data to coprocessors. In order to use this feature with HEATCODE, the environment variable MIC USE 2MB BUFFERS on the host must be set to the value "64K", which automatically uses large buffers for all offloaded arrays exceeding 64 kilobytes in size.
