Experimentation Procedure for Offloaded Mini-Apps Executed on Cluster
  Architectures with Xeon Phi Accelerators by Lawson, Gary et al.
ar
X
iv
:1
50
9.
02
13
5v
2 
 [c
s.D
C]
  1
4 S
ep
 20
15
Experimentation Procedure for Offloaded Mini-Apps
Executed on Cluster Architectures with
Xeon Phi Accelerators
Gary Lawson†, Vaibhav Sundriyal, Masha Sosonkina, and Yuzhong Shen
Old Dominion University, Norfolk VA 23529, USA,
Department of Modeling, Simulation, and Visualization Engineering,
{ glaws003, vsundriy, msosonki, yshen } @odu.edu
Abstract—A heterogeneous cluster architecture is complex.
It contains hundreds, or thousands of devices connected by
a tiered communication system in order to solve a problem.
As a heterogeneous system, these devices will have varying
performance capabilities. To better understand the interactions
which occur between the various devices during execution, an
experimentation procedure has been devised to capture, store,
and analyze important and meaningful data. The procedure
consists of various tools, techniques, and methods for capturing
relevant timing, power, and performance data for a typical
execution. This procedure currently applies to architectures with
Intel Xeon processors and Intel Xeon Phi accelerators. It has been
applied to the Co-Design Molecular Dynamics mini-app, courtesy
of the ExMatEx team. This work aims to provide end-users with a
strategy for investigating codes executed on heterogeneous cluster
architectures with Xeon Phi accelerators.
I. INTRODUCTION
Measuring the performance and energy of an application
can be a challenge. There are tools and methods for obtaining
power and performance measurements, but accurately com-
bining these with execution can be difficult. Further, from the
point-of-view of a single developer, determination of the crit-
ical and non-critical execution points can be tedious or over-
whelming. This work aims to provide an easy-to-reproduce
procedure for accurately profiling a generic application with
minimal code changes. This is beneficial to the solo developer
looking to optimize an application, because key phases of
execution within the application will fall into the authors
outlined measurement scheme and promote isolation of these
application phases. From the point-of-view of co-design, this
work provides meaningful insights into the performance of
an application; metrics such as memory bandwidth and com-
putational throughput are used in place of application phases
to describe execution time. However, obtaining these metrics
does require execution, hence a well described procedure has
been developed.
Accelerators are often adopted to reduce time-to-solution
with low energy costs. From the work of Choi et. al [1],
the Xeon Phi is capable of 11 GFLOPs/J and 880 MB/J for
single-precision operations (measured throughput of 2 Tflops/s
and 180 GB/s memory bandwidth). The Intel Xeon Phi is
an accelerator that promotes high memory bandwidth (i.e.
data movement) in addition to high computational throughput,
and that supports various execution modes [10], [16]. The
Xeon Phi also offers user-level access to important power
data, but poorly document how to utilize the information; this
work sheds light on an easy-to-implement method to read
power at the highest available sampling frequency for the
device. Further, unlike tools or methods which relay on reading
“window” power, such as MICSMC (Software Management
Controller) [14], this work yields true, instantaneous power
measurements based on the connectors that supply power.
The Xeon Phi co-processor is an accelerator with many ex-
ecution modes: native, offload, and symmetric. This is unique
to accelerators because normal operation of an accelerator
is considered the offload execution mode; this is the mode
used by GPUs. The Xeon Phi supports these other modes
because of the micro-OS (running a special version of Linux)
which enables the device to execute applications from the
device itself; to the host, the Xeon Phi may be considered an
additional node. Native execution mode allows an application
to be executed only on the Xeon Phi; a user log’s onto
the device and executes the application. Symmetric execution
mode allows an application to be executed on the host and
on the Xeon Phi, but each device is to solve a different
sub-domain. Offload execution mode allows the host and
accelerator to share the workload, but this execution mode
requires some code changes. This is the ideal starting place
for a procedure that requires some code instrumentation.
A heterogeneous cluster architecture is composed of many
nodes connected by a network interconnect. Each node is
composed of multiple devices with varying performance ca-
pabilities. An application executed on such an architecture
must implement domain decomposition [4]; to take a problem
and divide it into independent sub-problems, or sub-domains.
These sub-domains are then distributed to each node, one
or more sub-domain per node. However, domain decompo-
sition requires data sharing between sub-domains to solve the
total problem completely. For more simple implementations,
computation of the problem and communication between sub-
domains do not overlap. In this work, overlap between these
two phases is not considered.
A heterogeneous cluster architecture is complex. Measuring
the performance and energy consumption of such an archi-
tecture is also complex. Ideally, a measurement procedure
should have low performance impact and energy overhead;
measuring an execution should not degrade the performance of
the application, nor dramatically increase the required power
draw. A procedure should also be easy-to- implement on other
systems, and should not require dramatic code changes. In this
work, such a procedure is presented to capture important data
for all devices utilized in a heterogeneous cluster architecture.
Important data includes bandwidth for various data transfers,
work performed by the Xeon Phi accelerator, and power
consumption for the various devices.
This work investigates specifically the offload execution
mode [11], [9], however the procedure may be applied to
native and symmetric execution modes as well with minimal
changes. Some of these changes are to be introduced in this
work, but have not been thoroughly tested; although this is
a future work. To measure performance, hardware counters
are read using the Performance API (PAPI) [5]. Host power
is measured using the Running Average Power Limit (RAPL)
interface [18], and Xeon Phi power is measured by reading
the power file /sys/class/micras/power [15]. Timings
are gathered during execution of an application; however
additional timings are necessary beyond what is provided by
default. This work will explain the additional code changes
required for an application to provide accurate event timings
to be used to synchronize execution flow with the power and
performance samples.
The remainder of this paper is organized as follows: Sec-
tion II provides details for cluster and application require-
ments, such as software, hardware, and code changes. Sec-
tion III discusses defining the experiment and post-execution
data processing involved in profiling execution output. Finally,
Section IV concludes.
II. INITIAL STEPS
Before experimentation may proceed, certain software and
hardware are required to accurately measure an execution.
In general, the authors assume the cluster is composed of
Intel Xeon processors of the Sandy-Bridge or newer micro-
architecture supplied with one or more Intel Xeon Phi ac-
celerators. This architecture configuration is assumed to be
most common (among hybrid Xeon and Xeon Phi clusters),
and provides the necessary hardware counters for model
parameter estimations. However, it is not a requirement that
the processors be of the Intel brand or the accelerator be a
Xeon Phi if similar measurements may be obtained.
Beyond compiling the application [8] and executing in an
MPI (Message Passing Interface) environment [7], the cluster
architecture should have access to RAPL on the host, and a
version of PAPI for both the host and native Xeon Phi; even
for offload execution. Although the application will require
calls to PAPI from within the offload section, PAPI itself
should be natively compiled and the library accessible during
compilation and execution of the application. For offload
execution, a host version of PAPI is also required because
offload sections may be executed on the host in the event of a
conditional offload, or should “no-offload” be enabled during
compilation. It is also important to note that for native Xeon
Phi hardware counters, only versions 5.3.0 and 5.3.2 currently
support this functionality; more recent versions of PAPI (up to
5.4.1 as of this version) are unable to convert native hardware
counter names into codes and therefore are unable to access
the counter. Removing the dependency of PAPI from the
parameter estimation procedure will be done by the authors in
the future since PAPI provides limited support for the Xeon
Phi. It is important to note that with the more recent versions of
MPSS, configuring PAPI requires more than what is provided
in the instructions for the PAPI 5.3.X versions. Specifically,
the authors followed the instructions from version 5.4.1 [5] to
use the following configurations options:
• --with-mic
• --host=x86_64-k1om-linux
• --with-arch=k1om
• --with-ffsll
• --with-walltimer=cycle
• --with-tls=__thread
• --with-virtualtimer=
clock_thread_cputime_id ,
along with all the other configutation steps as per instructions
to PAPI versions 5.3.X.
A. Code Instrumentation
This section defines the required code changes to implement
offload execution in CoMD and to obtain highly accurate
timings for the computation and communication phases on
each device. These timings may then be cross-referenced to
the power and performance measurements to define various
performance metrics, and total energy consumption. Associate
code changes and micro-measurement apps are provided. Each
micro-measurement app only spawns a single-thread to sample
and print.
1) CoMD Overview: CoMD is a proxy application devel-
oped as part of the Department of Energy co-design research
effort [2] Extreme Materials at Extreme Scale [3] (ExMatEx)
center. CoMD is a compute-intensive application where ap-
proximately 85-90% of the execution time is spent computing
forces. Although two methods are available for the force
computation, this work focuses only on one of them, the
more complex and accurate EAM force kernel for short-range
material response simulations, such as uncharged metallic
materials [17]. The EAM kernel was chosen because its
parallel performance generally receives less attention than the
more commonly used Lennard-Jones potential, which easily
yields itself to parallelism.
2) Setup of the Offload Execution Mode: Offloading to
MIC requires the use of special pragmas defined for the Intel
compilers. These pragmas specify the code sections to be
processed by the Xeon Phi accelerator. Within the pragma
statement, one must specify the MIC device to communicate
with, the data to be transferred with the associated parameters
(such as array length, data persistence, variable reassignment,
etc.), offload conditional, and whether the offload event is
asynchronous, among other options [8]. However, in addition
to simply specifying what code sections to process on the
Xeon Phi, the arrays must be properly formatted for optimal
transfer bandwidth.
It is possible, although inefficient, to transfer multi-
dimensional arrays between the host and Xeon Phi. Therefore
algorithm structures should conform to the structure-of-arrays
data layout; CoMD is originally organized as an array-of-
structures which do not transfer easily. This code change sim-
ply requires converting the multi-dimensional arrays into one-
dimensional arrays. In the most recent experiments, CoMD is
measured to obtain more than 3 GB/s bandwidth over the PCI
bus (of 8 GB/s) for a problem size of 50 (500,000 atoms);
the resulting communication time is insignificant with respect
to the remaining computational requirements. This is one of
many measurements to be obtained using the procedure. One
final code improvement is the re-assignment of the maximum
number of atoms per link cell: by default, a link cell may
contain 64 atoms but has been reduced to 16 to reduce memory
requirements per thread. PAPI must be instrumented into the
offload sections such that memory and bandwidth may be
approximated during execution. PAPI is simply started and
stopped for each offload section such that the counter is always
reset for the next offload; the result is printed with application
output. SSE3 instructions have been enforced during compila-
tion [12], and utilization of the 2 MB buffers available through
the environment variable [6] has been implemented.
3) Synchronization of Measuring Event Timers: Power
and performance measurements are obtained for each device
individually. This approach removes unnecessary overhead
because devices are not required to communicate measurement
data during execution. However, this approach also requires the
use of three (or more) separate timers: the algorithm timer, host
timer, and accelerator timer(s). To synchronize these data files,
two timings are output with each measurement or event output
statement: local time in the format [HH:MM:SS], and the time
from start as a decimal value. The time from start (TFS) value
represents the time elapsed from the start of each measurement
tool or algorithm execution. The use of the local timestamp
ensures all timings are accurate to within one second, however
in addition to TFS, the error in timings is reduced to a fraction
of a second (within 20ms for the host, and 100ms for the Xeon
Phi).
Until a more sophisticated, and automatic method is devel-
oped, direct source code manipulation is the simplest solution
to start with. However, as offload execution already requires
source code manipulation, the additional event timing state-
ments are reasonable; especially for CoMD which features ro-
bust profiling output by default. Specifically, CoMD provides
four specific functions which must occur within one iteration
of the simulation; for other codes one or more functions may
be required, but in general it is most important to quantify total
simulation time, the time spent performing offload execution,
and the time spent transferring data during the communication
phase.
4) Obtaining Execution Time Values: The execution tim-
ings of interest are specifically: the time to compute on the
host, communicate on the host, compute on the Xeon Phi,
and transfer data over the PCI bus. To obtain these timings, the
application output must be consulted: for CoMD, the timings
are excellently profiled although only the root timings are
provided in entirety. For other applications, additional timings
may be required to obtain host computation; this will be
investigated in the future. The execution timings, which are
already collected for each sub-domain, have been exposed to
obtain exact timings for each sub-domain. Although this is
not so much of interest in these offload-only execution exper-
iments; the authors preliminary investigation into executions
with multiple execution modes showed this information to be
crucial and thus has been maintained for future investigations
until an improved method has been determined.
5) Using CPU Hardware Counters: The host CPU
micro-measurement application has been developed to
continuously read the RAPL power counter for CPU Core
and DRAM power; the sum is regarded as total CPU power
as uncore device power is not considered. Additionally,
host performance is measured using PAPI where the
last-level-cache memory fill counter, and unhalted CPU
cycles are measured; the hardware counter name differs
slightly depending on micro-architecture. For Sandy-Bridge,
“MEM_LOAD_UOPS_MISC_RETIRED:LLC_MISS”,
and “CPU_CLK_UNHALTED:THREAD_P” are the
native hardware counters used to approximate host
memory usage and bandwidth. For Ivy-Bridge, unhalted
cycles are captured using the same counter as on
Sandy-Bridge, but the LLC memory counter is:
“MEM_LOAD_UOPS_RETIRED:L3_MISS”. Power and
performance are sampled at a rate of 10 ms; the resulting
data is printed to output with the timestamp and TFS for
synchronization.
6) Using MIC Hardware Counters: The MIC1
micro-measurement application has been developed
to continuously read the available power file [15]:
/sys/class/micras/power which provides
approximated power over two time windows, power to
each connector, and voltage and power readings to the core,
uncore, and DRAM devices. Unlike the host CPU definition
of power, MIC power is based on the power draw measured
for each connector: PCI-E, 2x3, and 2x4. The sum of each
connector is the absolute power draw for the device as defined
in the Xeon Phi data sheet [9]. The power file is updated
only every 50ms, thus is the lowest available sample rate for
the device.
For offload execution, the MIC micro-measurement app
only measures power; however for native or symmetric ex-
ecution, this micro-app would also measure performance
with PAPI. For the Xeon Phi, the native hardware counters
of interest are: “L2_DATA_READ_MISS_MEM_FILL” and
“CPU_CLK_UNHALTED”; these are used to obtain estimates
for memory usage and bandwidth. For offload execution,
these hardware counters are instead measured over the du-
1MIC stands for “many-integrate cores” technology used in Intel Xeon Phi.
ration of each offload section and print with application
output. To obtain an estimate for vectorization intensity, a
few executions (although one is really all that is needed) us-
ing the hardware counters: “VPU_ELEMENTS_ACTIVE” and
“VPU_INSTRUCTIONS_EXECUTED” should be measured,
where elements over instructions equates to vectorization
intensity. In general, the value should be between 1 and 8
for double-precision, and 1 and 16 for single-precision [13].
III. PROFILING EXECUTIONS
In this work, an execution is more than simply running the
application; it requires properly measuring CPU and Xeon Phi
power and performance, and synchronizing this output with the
applications execution. A properly defined experiment must be
presented, and the process of mining the raw output data is also
discussed. The result of this process are measured performance
metrics to describe various attributes for an application, such
as the total workload for the accelerator defined in FLOPS
and bandwidths for many different data transfer situations.
The executions are always run with the offload report
environment variable set to 2; this provides MIC time, CPU
time (if available), and data transferred to and from the device.
To distinguish offload reports between various sub-domains, it
is advised that MPI is executed with the ‘-l’ option to print
the sub-domain identification number [?].
A. The Experiment
The experiment should be designed such that all investi-
gated parameters are meaningful. In this work, six meaningful
parameters have been chosen: the system, number of nodes,
number of Xeon Phi per node, total problem size, host
frequency (no DVFS), and Xeon Phi cores used. In general, the
authors are interested in determining the optimal configuration
(defined by all six parameters) which is defined by a static
configuration set (defined by: system, nodes, MICs/node,
and problem size) and configuration space (defined by: host
frequency and MIC cores). On the Borges system, two static
configurations per problem size are investigated: MIC 1 and
MIC 2, because the system consists of a single-node. On Bolt,
six static configurations per problem size are investigated:
N# MIC #; Bolt only has three nodes with two Xeon Phi,
hence one, two, and three node configurations are investigated,
each with one and two Xeon Phi used. The parameters have
been grouped into static configurations and configuration space
because static configurations may be easily compared with one
another and defined with a minimum energy; the minimum
energy may be found in the configuration space, because these
parameters impact execution energy and performance. Note,
although the number of Xeon Phi also impacts energy, it is
often desirable to compare the performance and energy for
each investigated.
An experiment is composed of many executions; each may
vary in configuration, but each follows the same execution
process to ensure minimal measurement overhead. The process
for a typical execution is as follows:
1) Start CPU micro-measurement app on all nodes
2) Start MIC micro-measurement app on all Xeon Phi
3) Sleep 20 seconds
4) Execute CoMD
5) Sleep 10 seconds
6) Stop MIC micro-measurement app on all Xeon Phi
7) Stop CPU micro-measurement app on all nodes
8) Copy MIC power output files from MIC to storage
9) Sleep 60 seconds.
Ample idle time is provided before execution begins to
ensure a sufficient number of power samples may be obtained
for each device such that idle power may be measured. For
larger clusters, the timing for step three may need to be
adjusted. The command to start each MIC power measurement
is issued using SSH which incurs a slight delay before power
measurements may begin. Idle power measurements are based
on at least 10 seconds of sample data. CoMD is then executed
according to the execution configuration parameters. Upon
completion, a brief idle period is provided to capture power
measurements before the CPU and MIC measurement threads
are halted. Finally, a rest period of one minute is provided to
allow the system to cool-down. Ideally this should be longer,
but a typical experiment consists of hundreds or thousands
of executions. The time spent allowing the system to rest
accounts for the majority of the total execution time for an
experiment.
B. Post-Execution Profiling
Upon successful completion of the experiment, a plentiful
number of output files are available for post-execution process-
ing. This process involves synchronizing measurement output
with application execution to properly quantify timings, power
measurements, and performance for various phases and states
involved in execution. These raw metrics are then to be used
to establish estimated global parameters that define each static
configuration.
1) Obtaining Execution Time Values: Extracting execu-
tion time is fairly simple with CoMD. There are four main
functions that occur every iteration: update position, velocity,
compute force, and share data between sub-domains; for EAM,
there is an additional data transfer during the force compu-
tation since it is a more specific algorithm. Host execution
time is defined as time to update position plus velocity plus
data redistribution minus the data transfer occuring within
the redistribution phase. Host communication time is the sum
of both communication phases (within the redistribution and
force phase), but is more accurately defined as two comm
timings: halo exchange and reduce. This is preferred because
it is more interesting with respect to optimization to separate
the point-to-point data transfers and reduce function timings.
Xeon Phi computation time is based on the sum of offload
report MIC time for all offloads throughout the simulation.
PCI transfer time is based on the offload report as well;
however is the difference between the sum of CPU time and
MIC time over all offloads during the simulation. If CPU
time is undefined (reports 0.0000 seconds), PCI time may be
approximated as the difference between total simulation time
(defined as loop in the CoMD output timings), and the time
to compute and communicate on host and to compute on the
accelerator.
2) Obtaining Power Consumption Values: Extract power
draw for each state, idle or active, is accomplished slightly
differently for each device because idle time and duration dif-
fers for each device. For the host, the active state is defined by
the host computation time; for host communication, PCI data
transfer, or computation on the accelerator, the host remains
idle. For the Xeon Phi, the active state is defined by the time to
compute on the accelerator, and the device is otherwise idle.
To synchronize power draw to execution state, the TFS for
each file has to be synchronized to the TFS value from the
application for the key execution phases. Host idle power is
accumulated during host communication and the entire force
computation because it includes accelerator computation and
PCI data transfer. All other power is accumulated in host active
power. For the Xeon Phi, active power is accumulated during
offload execution and otherwise idle power is accumulated
with the current power sample. Power is finally divided by
the number of samples for each state (always greater than 100
samples).
3) Obtaining Performance Values: Extracting the perfor-
mance metrics for each phase of execution is fairly straight
forward with the required code instrumentations. For the
Xeon Phi, memory usage is simply the LLC MISS counter
multiplied by 64 bytes per cache line; bandwidth is memory
usage times frequency divided by unhalted clock cycles.
Because these are measured explicitly for the offload sections,
these may simply be summated over all offloads during the
simulation. For the host, performance samples must first be
synchronized with execution and within the appropriate phase,
but because only the host communication phase is of interest
with respect to performance, this is the only phase in which the
counters are accumulated. The host follows the same simple
formula for computing memory usage and bandwidth as on the
Xeon Phi. PCI memory usage is summated over each offload
report within execution of the simulation by adding together
the data sent to and recieved from the device. PCI bandwidth is
estimated by the amount of data to transfer divided by transfer
time.
Finally, work on the Xeon Phi may be estimated by
multiplying computational throughput and computation time.
Computational throughput may be calculated as the product of
the number of cores, vectorization intensity, average number
of operations per cycle, and operational frequency. For CoMD,
vectorization intensity is measured to be 2.6; operations per
cycle is estimated to be 1.15 as few fused-multiply operations
are vectorized for this version of CoMD. These two values
depend heavily on the implementation and application and
must be measured and approximated approiately for each
application. Operations per cycle may be approximated by
cross-referencing the compiler vectorization report and source
code to determine which fused-multiply operations have been
vectorized; all other operations count as one. Then, assuming
each operation were to count only as one, take the ratio of
number of vectorized to non-vectorized operations.
IV. CONCLUSIONS AND FUTURE WORK
This work has provided a detailed procedure through which
developers may profile applications offloaded to accelerators
and produce meaningful conclusions and insights. At this
stage, only the mini-application CoMD has been investi-
gated thoroughly with the procedure, but it is of the utmost
importance to validate the technique on many other mini-
applications in the future. Additionally, reducing the number
of executions to accurately measure the application is a high
priority because this would provide larger datasets in a fraction
of the time. Currently, the experiment requires several days to
complete in the cluster environment because each configura-
tion change requires a new execution and the associated system
cool-down time. Finally, collecting hardware counter measure-
ments without the aid of PAPI is also to be investigated.
ACKNOWLEDGMENTS
This work was supported in part by the Air Force Office
of Scientific Research under the AFOSR award FA9550-12-1-
0476, by the National Science Foundation grants NSF/OCI—
0941434, 0904782, 1047772, and by the U.S. Department of
Energy, Office of Advanced Scientific Computing Research,
through the Ames Laboratory, operated by Iowa State Univer-
sity under contract No. DE-AC02-07CH11358.
REFERENCES
[1] Jee Choi, Marat Mukhan, Xing Liu, and Richard Vudue. Algorithmic
time, energy, and power on candidate HPC compute building blocks.
In 2014 IEEE 28th International Symposium on Parallel Distributed
Processing (IPDPS), Arizona, USA, May 2014.
[2] DOE. Co-design, 2013. http://science.energy.gov/ascr/research/scidac/co-design/.
[3] ExMatEx. CoMD proxy application, 2012.
http://www.exmatex.org/comd.html.
[4] R. Hayashi and S. Horiguchi. Domain decomposition scheme for parallel
molecular dynamics simulation. In High Performance Computing on the
Information Superhighway, 1997. HPC Asia ’97, pages 595–600, Apr
1997.
[5] ICL:UT. Performance application programming interface PAPI, 2015.
http://icl.cs.utk.edu/papi/.
[6] Intel. How to use huge pages to improve applica-
tion performance on Intel Xeon Phi coprocessor, 2012.
https://software.intel.com/sites/default/files/Large pages mic 0.pdf.
[7] Intel. Intel MPI library reference manual, 2014.
https://software.intel.com/sites/products/documentation/hpc/ics/impi/41/lin/Reference
Manual/index.htm?wapkw=.
[8] Intel. Intel C++ and Fortran compilers, 2015.
https://software.intel.com/en-us/intel-compilers.
[9] Intel. Intel Xeon Phi coprocessor: Datasheet, 2015.
http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-coprocessor-datasheet.html.
[10] James Reinders Intel. An overview of programming for
Intel Xeon processors and Intel Xeon Phi coprocessors, 2013.
https://software.intel.com/sites/default/files/article/330164/an-overview-of-programming-for-intel-xeon-processors-and-intel-xeon-phi-coprocessors
1.pdf.
[11] Kevin Davis Intel. Effective use of
the Intel compiler’s offload features, 2013.
https://software.intel.com/en-us/articles/effective-use-of-the-intel-compilers-offload-features.
[12] Martyn Corden Intel. How to compile for Intel AVX, 2012.
https://software.intel.com/en-us/articles/how-to-compile-for-intel-avx.
[13] Shannon Cepeda Intel. Optimization and performance tuning for Intel
Xeon Phi coprocessors, part 2: Understanding and using hardware
events, 2012. https://software.intel.com/en-us/articles/.
[14] Taylor Kidd Intel. Intel Xeon Phi coprocessor power management
configuration: Using the micsmc command-line interface, 2014.
https://software.intel.com/en-us/blogs/2014/01/31/intel-xeon-phi-coprocessor-power-management-configuration-using-the-micsmc-command.
[15] Intel Forums. how to measure en-
ergy consumption on the coprocessor?, 2013.
https://software.intel.com/en-us/forums/topic/488782?language=en.
[16] loc-nguyen Intel. Intel Xeon Phi copro-
cessor developer’s quick start guide, 2015.
https://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-developers-quick-start-guide.
[17] Jamaludin Mohd-Yusof, Sriram Swaminarayan, and Timothy C. Ger-
mann. Co-design for molecular dynamics: An exascale proxy ap-
plication, 2013. http://www.lanl.gov/orgs/adtsc/publications/science
highlights 2013/docs/Pg88 89.pdf.
[18] PAPI. Accessing the Intel RAPL registers, 2013.
http://icl.cs.utk.edu/projects/papi/wiki/PAPITopics:RAPL Access.
