Abstract-Intel's Xeon Phi coprocessor has successfully proved its capability by being used in Tianhe-2 and Stampede, two of the top ten most powerful supercomputers today. It is almost certain that the popularity of Xeon Phi in heterogeneous computing will grow significantly, which requires comprehensive studies on different aspects of this newly arrived many-core chip. Despite a number of previous studies on the performance of Xeon Phi, the power and energy behavior of the coprocessor has not been fully studied. In this paper, we present the performance, power and energy results of multiple parallel programs with contrasting workloads running on Intel Xeon Phi. Several interesting findings are derived from these results: 1) the Xeon Phi thread is power-hungry even when idle and altering the number of executing threads will largely affect the power consumption; 2) performance improvement and energy savings are highly related; 3) running code in native mode yields better performance and consumes less energy; and 4) co-running programs with complementary workloads has potential to conserve energy with negligible performance influence. In addition, we discuss an incorrect way of measuring power of Xeon Phi using the onchip power sensors and present our solutions.
I. INTRODUCTION
As the amount of data available for processing continues to quickly grow, more computation power is needed to effectively process it. Intel's Xeon Phi coprocessor [1] , a new line of many-core and multi-thread high performance computing chip, has become a popular answer for this paradigm. Both Tianhe-2 [2] (today's fastest supercomputer) and Stampede [3] (ranked as 7th on the top500 list) are powered by the Xeon Phi coprocessor.
The Xeon Phi coprocessor is equipped with 60 x86-based cores, each capable of running 4 threads. This gives the Xeon Phi much higher computation throughput than traditional multi-core/multi-thread processors. What makes the Xeon Phi a very enticing solution for developers who wish to exploit more parallelism is that code doesn't need to be re-written in order to run natively on the Xeon Phi, unlike GPU codes. Additionally, code sections can be easily offloaded to the Xeon Phi with minor modification.
The Xeon Phi coprocessor is designed for high computation density and energy efficiency. The majority of previous work primarily focused on the performance aspect of the Xeon Phi. It is paramount to understand the power and energy behavior of the Xeon Phi in order to exploit all of the possible performance and energy gains or trade-offs. A previous paper [4] that has studied the energy behavior of the Xeon Phi concentrates on the energy per instruction, while our work focuses on the behavior of the processor's power and energy consumption when running different workloads and configurations. More specifically, we present the performance, power and energy results of four carefully selected applications: Barnes-Hut algorithm, Shellsort algorithm, Single Source Shortest Paths (Dijkstra's algorithm) and Fibonacci calculation (cf. Section II for details). These results will demonstrate the energy and power characteristics of various workloads and facilitate our current understanding on when and where the Xeon Phi can be used to exploit its strengths for both performance and energy efficiency. Additionally, to the best of our knowledge, there is no published work to explain the details of profiling runtime power information of Xeon Phi using its built-in power sensors.
This paper makes the following contributions: 1) We demonstrate how to correctly profile the instantaneous power consumption of Xeon Phi using its built-in power sensors. We also demonstrate that the real time power trace can accurately reflect program behavior (cf. Figs 1 -4 ). 2) We show that the Xeon Phi thread is power-hungry even when idle and altering the number of executing threads could largely affect the power consumption of an application. 3) We show that the performance improvement and energy conservation is closely related for programs running on Xeon Phi. 4) We illustrate the benefit of running codes on Xeon Phi in the native mode. 5) We study the impact of co-running Xeon Phi programs on performance and energy-efficiency.
The remainder of the paper is organized as follows. Section II provides details about the selected applications. Section III discusses the methodology of measuring power consumption of code running on Xeon Phi and power traces of selected applications. Section IV presents the performance and energy results of selected applications. Section V summarizes our work and draws conclusions.
II. SELECTED APPLICATIONS
To fully reveal the energy and power characteristics of the Xeon Phi coprocessor, we carefully select four applications with completely different workloads. The Barnes-Hut algorithm is a well balanced workload with irregular memory references; The Shellsort algorithm is computation intensive with gradually reduced workload; The Single Source Shortest Paths (Dijkstra's algorithm) has unbalanced workload because the amount of work available during the execution of SSSP varies based on the number of neighbors of each node; and the Fibonacci calculation (recursive version) is a typical application with skewed workload.
1) Barnes-Hut:
The Barnes-Hut algorithm [5] is designed to solve the n-body simulation problem [6] by approximating the forces acting on each body. It hierarchically partitions the volume around the n bodies into successively smaller cells and records this spatial hierarchy in an unbalanced octree. Each cell forms an internal node of the octree and summarizes information about all the bodies it contains. The leafs of the octree are the individual bodies. This spatial hierarchy reduces the time complexity to O(n log n) because, for cells that are sufficiently far away, the algorithm only performs one force calculation with the cell instead of performing one force calculation with each body inside the cell, thus drastically reducing the amount of computation. However, different parts of the octree have to be traversed to compute the force on each body, making its control flow and memory-access patterns quite irregular. The Barnes-Hut implementation that we measure is parallelized with OpenMP to simulate 20,000 bodies with 1,000 time steps. The main computation intensive portion of the algorithm is parallelized with a "parallel for" pragma, where each thread is given an even portion of the data set for processing.
2) Shellsort: Shellsort [7] is a comparison based in-place sorting algorithm. It starts by sorting elements far apart from each other and progressively reduces the gap between them. It is computation intensive (primarily comparing and swapping elements) and the workload reduces gradually because more elements will be in a relatively sorted order toward the end of the program (i.e. less swapping will occur). The shellsort implementation that we measure is also parallelized with OpenMP to sort 100 million numbers. The "parallel for" pragma that surrounds the computation portion of the algorithm divides the data set evenly among the requested number of threads.
3) SSSP:
Single Source Shortest Paths (Dijkstra's algorithm) [8] is a graph searching algorithm that returns the shortest distance between two chosen nodes of a graph. SSSP is a typical example of an unbalanced workload. We can see a spike (as shown in Fig. 3 ) at the end of SSSP because the amount of parallelism available during the execution of SSSP varies with the number of neighbors during each iteration. For each iteration, threads are created and compute the distance of each neighbor before moving on to another node. The number of neighbors varies from node to node, so the amount of parallelism changes throughout the runtime of SSSP. Choosing different source nodes for SSSP will change the time when high degree of parallelism occurs during execution since it will propagate through the graph from a different starting location thereby hitting those sections with high degree of parallelism at different times. The SSSP implementation that we measured is parallelized using OpenMP and takes an input graph with 265,000 nodes and 733 edges. The main computation is surrounded with a "parallel for" pragma, thus each thread is given an iteration to execute. Each iteration is assigned to a starting graph node in SSSP, thus each executing thread will have a different amount of computation to perform based on the number of neighbors that each graph node has.
4) Fibonacci:
Fibonacci sequence [9] is a well-known math problem. The Fibonacci calculation code that we measure calculates 45 Fibonacci numbers (i.e. Fib(2), Fib(3), ... , Fib (46)). Each Fibonacci calculation generates a task, which recursively calculates the respective Fibonacci number of the sequence position. Each task will be assigned to a waiting thread to complete the actual computation of a Fibonacci number. Since the work required to calculate large Fibonacci numbers (e.g. Fib (46) and Fib (45)) is much heavier than small Fibonacci numbers (e.g. Fib(2) and Fib(3)), this implementation has an skewed (i.e. highly unbalanced) workload.
III. XEON PHI POWER MEASUREMENT
The energy consumption of computer components can be obtained either via power models or direct measurement. The idea of power modeling (a.k.a indirect measurement) is to estimate the power consumption of a node by correlating power with hardware performance counters or other events. Two widely-used CPU power models are Wattch [10] and McPAT [11] . Direct measurement methods periodically profile the current and voltage samples, calculate the power by multiplying the two values, and compute the total energy as the integral of the power over the execution time. WattsUP [12] is a widely used power meter that can directly measure the total energy consumed by an entire node. While WattsUP is easy to use but its sampling frequency is very low (1 Hz). More importantly, it cannot profile power of individual component(e.g. CPU or DRAM), which makes it insufficient to analyze the energy-efficiency of complicated code. To tackle this problem, several tools have been developed to provide finegrained power consumption information. PowerPack [13] is the most well-known tool, which was developed at Virginia Tech for the power-aware cluster -System G [14]. PowerPack is able to measure the power consumption of individual components (e.g., the CPU or DRAM) within a node. However, its profiling approach is fairly expensive, difficult to implement, and hard to scale. Another widely used tool is PowerMon [15] , [16] , which comprises a power monitoring card that plugs into the motherboard. Compared to PowerPack, PowerMon is cheaper and easier to implement because it only contains a single integrated circuit (no wiring or soldering required). However, it can only measure limited number of channels. Sandia National Laboratory also made substantial efforts in developing component-level power measurement tools for largescale systems [17] , [18] , [19] . They recently presented the PowerInsight tool [20] , which can instrument accelerators that draw power from the PCI bus and external power supplies. It is worth noting that built-in power sensors also gain popularity in accelerators and co-processors. For example, GPUs such as Tesla C2075 and K20 and Intel Xeon Phi both include on-board power sensors that allow direct power measurement while a program is running [21] .
More specifically, Xeon Phi provides a C/C++ library with a set of APIs (a.k.a MICAccessAPI [1] ) that allows users to monitor and configure several metrics (including power) of the coprocessor. MICAccessAPI [1] is primarily responsible for establishing connections with the host driver and coprocessor OS, allowing software to monitor and configure the Xeon Phi parameters. The power results that we present in this paper are measured and recorded by issuing the MicGetPowerUsage() call to the MICAccessAPI during execution of each experiment.
Figs. 1 -4 show the power traces of Barnes-Hut, Shellsort, SSSP and Fibonacci respectively. These traces demonstrate several important observations. First, the power trace can accurately capture the run-time behavior of a program and correctly represent its characteristics. For example, Fig. 1 shows that the Barnes-Hut algorithm is a well balanced workload, consequently its power trace is stable when using different number of threads. Fig. 2 captures the reduced workload of Shellsort. We can see that the power consumption gradually drops at the end of the program. Fig. 3 shows varied power consumption (especially with two spikes at the beginning and end of the program), which captures the varied workload (determined by the number of neighbors of each node at runtime) of SSSP. Second, the power trace can help us identify the features of different applications that are not easily found. For example, if we compare Figs 1, 2, and 3, we can see that Shellsort constantly consumes more power (when using the same number of threads) than Barnes-Hut and SSSP. Recall that both Barnes-Hut and SSSP have more memory operations than Shellsort, we can conclude from the power trace that Shellsort has a higher utilization of Xeon Phi cores and BarnesHut tend to have less memory waiting time than SSSP. Third, the power trace can derive new findings. For example, in the power trace of Fibonacci (Fig. 4) , we surprisingly find that Xeon Phi threads are power-hungry even when they are sitting idle. Since we only calculate 45 Fibonacci numbers, the majority of threads are idle when using 120 and 240 threads. It is expected that the performance will degrade due to the overhead of generating threads without producing useful work. However, the power increase is certainly beyond our expectation because we assume idle threads do not consume much power. The power trace clearly demonstrates that our previous assumption is wrong. To investigate if the OMP WAIT POLICY will affect the power results, we conduct two experiments with the ACTIVE policy (i.e. threads consume processor cycles while waiting) and PASSIVE policy (i.e. threads do not consume processor cycles if they are not actively computing) respectively. Unfortunately, the power consumption results of both experiments (as shown in Figs. 5 and 6) are almost identical with the passive policy slightly reducing power for 30 threads and 60 threads. We also notice from the power trace that the performance of Fibonacci is almost the same when using 5, 10, and 45 threads, which indicates that we can achieve much better energy-efficiency with negligible performance degradation for highly unbalanced workloads.
The built-in power sensors of Xeon Phi usually report power data at a steady rate (around 85 Hz) in our meter implementation. It should be noted that the sampling rate of the MICAccessAPI could drop to a lower rate (around 50 Hz) when running 240 threads because all cores are fully utilized in this state. Therefore, simply profiling the power data is not sufficient to correctly graph Xeon Phi power traces. It is critical to record the timing information simultaneously and use it as the X axis. Fig. 7 illustrates an incorrect power trace of Fibonacci when the number of samples is used as the X axis. Fig. 5 will lead to the wrong conclusion that using 240 threads will improve the performance of Fibonacci. Actually, the early power drop of the dark green line (on top) is misleading because the runtime of 240 threads is actually longer. It drops early because much less power samples are collected due to the lower power sample rate (around 50 Hz).
IV. PERFORMANCE AND ENERGY CHARACTERIZATION
In this section, we present the performance and energy results of the selected four applications. In addition, we evaluate and analyze the results from several different aspects, which include the impact of thread count and hyper-threading, the corelations between performance and energy, the comparison of native execution and offload execution, and the impact of co-running multiple applications. application. We can observe from Fig. 8 that the number of threads has a large impact on both performance and energy of each application. Since Xeon Phi has 60 physical cores, each thread (if using less than 60 threads) will be scheduled to a physical core for execution. However, when more than 60 threads are used, the effective support of hyper-threading becomes critical to ensure further performance improvement.
A. Impact of Thread Count and Hyper-threading
Hyper-threading is a technology introduced by Intel to support simultaneous multithreading (SMT). The key idea is to create multiple logical cores (a.k.a virtual cores) over a single physical core. Each logical core can individually execute a specified thread and all logical cores (up to 4 in Xeon Phi) in a hyper-threaded core share execution resources (e.g. cache). These shared resources allow multiple logical cores to work together efficiently. For example, when one logical core is stalled waiting for I/O, another logical core can be scheduled for execution. This improves parallelization by allowing multiple tasks to be executed simultaneously. 
1) Barnes-Hut:
Barnes-Hut is a balanced workload with numerous irregular memory access. Fig. 8 shows that increasing the number of threads will continually increase performance and energy savings. We observe that Barnes-Hut code benefits in both performance and energy consumption when executing 120 and 240 threads. This is because BarnesHut algorithm has irregular memory access, which provides more opportunity for scheduling another thread when one thread is waiting for I/O. Meanwhile, we notice significantly diminishing return once the number of threads exceeds 120, which is probably caused by the sharing of hardware resources and overhead of context switching.
2) Shellsort: Shellsort is another balanced workload like Barnes-Hut but less work will be done toward the end of the program. From Fig. 8 , we can see that Shellsort and Barnes-Hut exhibit a similar trend in terms of performance improvement and energy savings. Although they scale in a similar fashion, shellsort does not scale as well per thread compared to Barnes-Hut because the ratio of computation to memory access is low. We also notice that Shellsort does not benefit much from hyper-threading. Its performance and energy efficiency do not improve much between 120 and 240 threads.
3) SSSP:
Topology driven SSSP is an unbalanced workload, unlike both previous examples. SSSP's energy trend shows that it benefits from increasing the number of executing threads, but not in the same manner as both previous balanced workloads. Unlike Barnes-hut and Shellsort, SSSP continues to scale past the 60 core mark (the point when hyper-threading is used). The performance and energy savings from 30 to 60 threads is much less than the savings from 60 threads to 120 threads. It is likely that 120 threads is enough to exploit the maximum amount of parallelism that this specific graph (265k nodes and 733 edges) can offer at any point. This would also explain why using 240 threads does not offer much performance and energy benefit.
4) Fibonacci:
The algorithm calculating 45 Fibonacci numbers is the most unbalanced workload of all four selected applications. The method of parallelization involves each executing thread independently computing one Fibonacci number between Fib(2) and Fib(46). The Fibonacci computation is recursive, therefore the thread that calculates Fib(46) has much heavier load than the one that calculates Fib(2). In fact, most of the Fibonacci numbers take a small portion of the execution time of Fib(44)-Fib(46). In this case, it is meaningless to run more threads. Our results in Fig. 8 show that the performance of using 5, 10, 45 and 60 threads are almost identical and the performance drops dramatically when hyper-threading is used. This can be easily explained by the overhead of creating extra threads without producing useful work. Actually, those idle threads do not only hurt performance but also burn a large amount of energy while sitting idle (cf. Fig. 4) . Such a skewed workload benefits most in both performance and energy when executing on a small thread pool. 
B. Performance and Energy Corelations
To further study the corelations between performance and energy, we plot the runtime and energy consumption of each application in Figs. 9 -12.
In Fig. 9, we observe a 1.9, 3.3 , and 3.6 speedup when using 60, 120, and 240 threads respectively in the BarnesHut algorithm (compared to using 30 threads). Meanwhile, we observe 42%, 65%, and 66% of energy savings. The Shellsort (Fig. 10 ) achieves a speedup of 1.26, 1.39 and 1.42 and saves 18%, 26%, and 29% of energy when using 60, 120 and 240 threads respectively (compared to using 30 threads). The SSSP code (Fig. 11 ) sees a speedup of 1.15, 1.63, and 2 with energy savings of 10%, 37%, and 48%. In the Fibonacci application (Fig. 12) , we see that the performance drops by 11% and 34% when using 120 threads and 240 threads ( (compared to using 45 threads) and the energy increases by 22% and 80%. The most important observation derived from this set of experiments is that the performance and energy are highly related. When the performance of applications running on Xeon Phi is improved, it is highly likely the energy consumption is reduced as well. On the other hand, degrading performance can also lead to energy increase.
C. Native Execution vs. Offloaded Execution
The Intel Xeon Phi coprocessor offers two execution modes: native mode and offload mode. Native execution occurs when an application runs entirely on the coprocessor. Building a native application is a fast way to get existing software running with minimal code changes. The offload mode is a heterogeneous programming model, in which the programmer designates specific code sections (using simple pragmas/directives) to run on the Xeon Phi coprocessor. It is important that we investigate the intrinsic energy differences between executing code on the Xeon Phi in its native environment, versus offloading computation from the host CPU. 
1) SSSP:
The results of SSSP show that the offloaded version is consistently less energy-efficient than the native code across the number of threads. This is because the runtime of the offloaded version is always slightly longer for each number of threads. It should be noted that during native execution, the host CPU does no effective computation. Unless some scheduler is used to take advantage of the CPU after a native Xeon Phi job is launched, the host CPU will be wasting energy waiting for the Xeon Phi to return data. Fig. 14 shows that although the energy efficiency is slightly better for native execution, the energy difference becomes smaller as the number of threads increases. Intuitively, using the Xeon phi as a coprocessor and offloading the computation with 120 or 240 threads imply energy savings assuming the host CPU has been utilized in the meantime. The performance and energy difference between native execution and offloaded execution for SSSP is around 25% (as shown in Fig. 13 ) for all the thread counts under study except 120, with which the difference is 5% for performance and 9% for energy.
2) Offloaded Shellsort: Shellsort's offloaded execution reveals a very different energy difference outcome than SSSP's offloaded execution. In the previous example, SSSP is able to compute its distance at little performance cost. The difference in runtimes for SSSP generally lies around 25%, even falling as low as 5% when 120 threads are used. As for shellsort, the difference in runtimes for each thread count lies between 65-75% (as shown in Fig. 15 ). This considerable loss in performance causes the large difference in energy consumption between versions of shellsort. The native execution of shellsort is consistently 3-4X times as energy efficient and 3-4X times as fast as the offloaded versions.
These results clearly show the great benefit (for both performance improvement and energy savings) of running Xeon Phi codes in the native mode. Generally speaking, a code that does not perform extensive I/O operations, requires modest memory footprint, and can scale well should be executed in the native mode instead of the offload mode.
D. Co-Running Programs
The Xeon Phi coprocessor contains 60 physical cores and is capable of high computation density. It is worth exploring the viability of co-running jobs on the Xeon Phi. In this subsection, we present the experimental results (Figs. 17 -20 ) of corunning workloads that compliment each other on the Xeon Phi.
1) Barnes-Hut/Fibonacci: Co-running the Barnes-Hut and Fibonacci number calculation natively on the Xeon Phi shows that the processor can facilitate generous energy savings in this manner. These codes are able to co-run well because BarnesHut is a very balanced workload and benefits from using more threads, while Fibonacci actually declines in performance if more threads than a certain count are used. This allows us to give as many threads as possible to execute Barnes-Hut while leaving a small thread pool to execute Fibonacci. There are still some performance cost when natively co-running these programs, likely caused by sharing resources and memory contention, but the resultant energy saving is much greater. 2) SSSP/Fibonacci: Co-running SSSP and Fibonacci in the native mode shows similar results as co-running with BarnesHut. Fibonacci is an example of a workload that will perform well when co-running with other codes with high degree of parallelism. As such, SSSP is still a good candidate to co-run with Fibonacci 46 despite its runtime and energy not scaling between utilizing 120 and 240 threads. Therefore, it makes sense to run SSSP and Fibonacci cooperatively with the same thread distribution as Barnes-Hut/Fibonacci is executed. Corunning codes on the Xeon Phi is analogous to space-sharing in a GPU, where different execution kernels will co-run on the GPU utilizing independent hardware blocks. Assuming memory contention is low, each of these co-running codes will return with little performance cost. This is the case that we see in co-running Fibonacci and two other workloads with the optimum thread distribution (235/5). 
V. CONCLUSION
This paper studies the power and energy characterizations of four parallel programs (Barnes-Hut, Shellsort, SSSP, and Fibonacci) running on the Intel Xeon Phi coprocessor. These four applications have different workloads and we have shown their detailed performance, power and energy characterization results. A number of conclusions can be drawn from these results: 1) The power trace generated from the built-in power sensors of Xeon Phi can accurately capture the run-time program behavior; 2) Xeon Phi threads consume great power even when they are sitting idle, which indicates that we should make sure each thread will have sufficient work to do once created; 3) Altering the number of executing threads can largely affect both performance and energy. Performance improvement tends to result in energy savings as well; 4) Running code in native mode yields better performance and consumes less energy compared to the widely used offload mode, which indicates that all codes that are suitable for native execution should be suggested to run in the native mode; and 5) Co-running programs with diverse workloads has potential to conserve energy with negligible performance degradation.
