Per-call Energy Saving Strategies in All-to-all Communications by Sundriyal, Vaibhav & Sosonkina, Masha
Computer Science Technical Reports Computer Science
2011
Per-call Energy Saving Strategies in All-to-all
Communications
Vaibhav Sundriyal
Iowa State University, vaibhavs@iastate.edu
Masha Sosonkina
Iowa State University
Follow this and additional works at: http://lib.dr.iastate.edu/cs_techreports
Part of the OS and Networks Commons, and the Programming Languages and Compilers
Commons
This Article is brought to you for free and open access by the Computer Science at Iowa State University Digital Repository. It has been accepted for
inclusion in Computer Science Technical Reports by an authorized administrator of Iowa State University Digital Repository. For more information,
please contact digirep@iastate.edu.
Recommended Citation
Sundriyal, Vaibhav and Sosonkina, Masha, "Per-call Energy Saving Strategies in All-to-all Communications" (2011). Computer Science
Technical Reports. 196.
http://lib.dr.iastate.edu/cs_techreports/196
Per-call Energy Saving Strategies in All-to-all Communications
Abstract
With the increase in the peak performance of modern computing platforms, their energy consumption grows
as well, which may lead to overwhelming operating costs and failure rates. Techniques, such as Dynamic
Voltage and Frequency Scaling (called DVFS) and CPU Clock Modulation (called throttling) are often used
to reduce the power consumption of the compute nodes. However, these techniques should be used
judiciously during the application execution to avoid significant performance losses. In this work, two
implementations of the all-to-all collective operations are studied as to their augmentation with energy saving
strategies on the per-call basis. Experiments were performed on the OSU MPI benchmark as well as NAS and
CPMD application benchmarks, in which power consumption was reduced by up to 10% and 15.7%,
respectively, with little performance degradation.
Keywords
Collective Communications, MPI, MVAPICH, DVFS, CPU Throttling
Disciplines
OS and Networks | Programming Languages and Compilers
This article is available at Iowa State University Digital Repository: http://lib.dr.iastate.edu/cs_techreports/196
Per-call Energy Saving Strategies in All-to-all
Communications∗
Vaibhav Sundriyal and Masha Sosonkina
Department of Electrical and Computer Engineering
Ames Laboratory
Iowa State University
Ames, IA 50011 USA
vaibhavs,masha@scl.ameslab.gov
Abstract. With the increase in the peak performance of modern com-
puting platforms, their energy consumption grows as well, which may
lead to overwhelming operating costs and failure rates. Techniques, such
as Dynamic Voltage and Frequency Scaling (called DVFS) and CPU
Clock Modulation (called throttling) are often used to reduce the power
consumption of the compute nodes. However, these techniques should
be used judiciously during the application execution to avoid significant
performance losses. In this work, two implementations of the all-to-all
collective operations are studied as to their augmentation with energy
saving strategies on the per-call basis. Experiments were performed on
the OSU MPI benchmark as well as NAS and CPMD application bench-
marks, in which power consumption was reduced by up to 10% and
15.7%, respectively, with little performance degradation.
Keywords: Collective Communications, MPI, DVFS, CPU Throttling.
1 Introduction
Power consumption is rapidly becoming one of the critical design constraints in
modern high-end computing systems. While the focus of the high-performance
computing (HPC) community has been to maximize the performance, the system
operating costs and failure rates can reach a prohibitive level.
The Message Passing Interface1 has become a de facto standard for the design
of parallel applications. It defines both point-to-point and collective communi-
cation primitives widely used in parallel applications. This work examines the
nature of all-to-all communications because they are among the most intensive
∗This work was supported in part by Iowa State University under the contract
DE-AC02-07CH11358 with the U.S. Department of Energy, by the Director, Office of
Science, Division of Mathematical, Information, and Computational Sciences of the
U.S. Department of Energy under contract number DE-AC02-05CH11231, and by the
National Science Foundation grants NSF/OCI – 0749156, 0941434, 1047772.
1MPI Forum: http://www.mpi-forum.org
2and time consuming collective operations while being wide-spread in parallel ap-
plications. By definition, a collective operation requires the participation of all
the processes in a given communicator. Hence, such operations incur a significant
amount of the network phase during which there exist excellent opportunities
for applying energy saving techniques, such as DVFS and CPU throttling.
The all-to-all operation is studied here on the per-call (fine-grain) basis as
opposed to a “black-box” approach, which treats communication phase as indi-
visible operation contributing to the parallel overhead. In this work, the energy
saving strategies are incorporated within the existing all-to-all algorithms.
CPU Throttling and DVFS in Intel Architectures. The current generation of
Intel processors provides various P-states for DVFS and T-states for throttling.
In particular, the Intel “Core” microarchitecture, which provides four P-states
and eight T-states from T0 to T7, where state Tj refers to introducing j idle
cycles per eight cycles in CPU execution. The delay of switching from one P-
state to another can depend on the current and desired P-state and is discussed
in [2]. The user may write a specific value to Model Specific Registers (MSR) to
change the P- and T-states of the system.
Infiniband has become one of most popular interconnect standard marking its
presence in more that 43% of the systems in the TOP 5002 list. Several network
protocols are oﬄoaded to the Host Channel Adapters (HCA) in an Infiniband3
network. Here, MVAPICH4 implementation of MPI, which is designed for Infini-
band networks, is considered. MVAPICH2 uses “polling” communication mode
by default since a lower communication overhead is incurred with polling when
an MPI process constantly samples for the arrival of a new message rather than
the with “blocking”, which causes CPU to wait for an incoming message.
1.1 Effect of CPU Throttling on Communication
Since point-to-point communication operations underlie collectives, it is reason-
able to analyze the CPU throttling effects on them first. Fig. 1(a) shows the
point-to-point internode communication times for the communicating processes
at T-states T0 and T5. Similarly, Fig. 1(b) depicts the change in intranode com-
munication time for the states T0 and T1. It can be observed that the effect of
throttling on internode communication is minimal. In fact, the average perfor-
mance loss was just 5% at state T5 for various message sizes. However, introduc-
ing just one idle cycle per eight cycles degrades the intranode communication
considerably (about 25%). This is expected since intranode communication uses
more CPU cycles for a message transfer whereas in internode transfers RDMA
oﬄoads a large part of the communication processing to the NICs [13].
The difference between the intra- and inter-node message transfer types with
respect to CPU throttling becomes the basis for the energy saving strategy
2http://www.top500.org/
3http://www.infinibandta.org/
4http://mvapich.cse.ohio-state.edu/
3proposed in this work. The appropriate T-state is selected depending on the
communication type of all the cores in a socket. The Intel Xeon processor, which
is used in this work, supports DVFS and throttling only on the socket level rather
than on the core level of granularity. Hence, the lowest T-State T0 is chosen for all
the cores when at least one core on a socket is in the intranode communication.
Conversely, a higher throttling state T5 is selected when all the cores on a socket
perform internode communication. In the experiments, a throttling higher than
T5 resulted in a significant performance loss, which is not desirable since the
aim is to minimize the energy consumption without sacrificing the performance.
However, if all the socket cores are idle during the collective operation, then they
can be throttled at the highest state T7. To summarize,
◦ All cores communicate internode → T5;
◦ At least one core communicates intranode → T0;
◦ None of the cores communicate → T7.
0 1000000 2000000 3000000 4000000 5000000
0
1000
2000
3000
4000
5000
6000
State T0
State T5
Message Size (in bytes)
Ti
m
e 
( in
 m
i c
ro
se
co
nd
s )
0 1000000 2000000 3000000 4000000 5000000
0
500
1000
1500
2000
2500
3000
3500
State T0
State T1
Message Size (in bytes)
Ti
m
e 
( in
 m
i c
ro
se
co
nd
s )
         (a)
     
(b)
Fig. 1. Point-to-point communication with throttling (a) internode (b) intranode.
The rest of the paper is organized as follows. Section 2 describes the proposed
energy savings in the all-to-all operation. Section 3 shows experimental results
while Sections 4 and 5 provide related work and conclusions, respectively.
2 All-to-all Energy Aware Algorithm
MVAPICH2 implementations of all-to-all are considered in this work. They are
based on three algorithms: 1) Bruck Index [14], used for small — less than 8KB
— messages with at least eight participating processes; 2) Pairwise Exchange,
used for large messages and when the number of processes is a power of two; 3)
Send To rank i+k and Receive From rank i−k, used for all the other processor
numbers and large messages. These algorithms are referred further in text as
BIA, PEA, and STRF, respectively.
4Bruck Index first does a local copy with the upward shift of the data blocks
from the input to output buffer. Specifically, a process with the rank i rotates
its data up by i blocks. The communication starts such that, for all the p com-
municating processes in each communication step k (0 ≤ k < dlog2 pe), process
i, (i = 0, . . . , p − 1), sends to (i + 2k) mod p (with wrap-around) all those data
blocks whose kth bit is 1 and receives from (i − 2k) mod p. The incoming data
is stored into the blocks whose kth bit is 1. Finally, the local data blocks are
shifted downward to place them in the right order. Fig. 2 shows a cluster hav-
ing N = 3 nodes with c = 8 cores each placed on two sockets and the total
number of processes p = 8N . The rank placement is performed in block manner
using consecutive core ordering. Note that, until the kth step where 2k < c, the
communication is still intranode for any socket in the cluster, considering all the
cores on a socket collectively. However, after the kth step, the communication
becomes purely internode for all the participating cores. Thus, from this step
on, the throttling level T5 may be applied to all the cores without incurring a
significant performance loss.
4567 0123
8*N-48*N-38*N-28*N-1 8*N-88*N-78*N-68*N-5
12131415 891011
Node 1
Node 2
Node N
4567 0123
8*N-48*N-38*N-28*N-1 8*N-88*N-78*N-68*N-5
12131415 891011
Node 1
Node 2
Node N
Step 0 Step 1
4567 0123
8*N-48*N-38*N-28*N-1 8*N-88*N-78*N-68*N-5
12131415 891011
Node 1
Node 2
Node N
4567 0123
8*N-48*N-38*N-28*N-1 8*N-88*N-78*N-68*N-5
12131415 891011
Node 1
Node 2
Node N
Step 2 Step 3
Fig. 2. First four communication steps in the Bruck Index all-to-all algorithm on three
nodes with two sockets (shown as rectangles) and eight cores (ovals) each. Internode
communications are shown as straight slanted lines across the node boundaries.
“Send-To Receive-From” and Pairwise Exchange. For the block placement of
ranks, in each step k (1 ≤ k < p) of STRF, a process with rank i sends data to
(i+ k) mod p and receives from (i− k + p) mod p. Therefore, for the initial and
5the final c−1 steps, the communications are not purely internode. The PEA uses
exclusive-or operation to determine the rank of processes for data exchange. It
is similar to the BIA in terms of communication phase since after step k where
k = c, the communication operation remains internode until the end.
Energy Saving Strategy. Because all three algorithms exhibit purely internode
communications at a certain step k, the following energy saving strategy may
be applied in stages to each of them.
Stage 1 At the start of all-to-all, scale down the frequency of all the cores
involved in the communication to the minimum.
Stage 2 During the communication phase, throttle all the cores to the state T5
in step k if
◦ BIA: 2k ≥ c,
◦ STRF: c ≤ k < p− c.
◦ PEA: k > c.
Stage 3 For STRF: throttle to state T0 at the communication step k = p− c.
Stage 4 At the end of all-to-all, throttle all the cores to state T0 (if needed)
and restore their operating frequency to the maximum.
Rank Placement Consideration. MVAPICH2 provides two formats of rank place-
ments on multicores, namely block and cyclic. In the block strategy, ranks are
placed such that any node j (j = 0, 1, . . . , N − 1) contains ranks from c × j to
c×(j+1)−1. In the cyclic strategy, all the ranks i belong to j if (i mod N) equals
j. The block rank placement calls for only two DVFS and throttling switches
in the proposed energy saving strategy, and thus minimizes the switching over-
head. In the cyclic rank placement, however, after a fixed number of steps the
communication would oscillate between intra- and inter-node, requiring a throt-
tling switch at every such step. Therefore, the block rank placement has been
considered for the energy savings application.
2.1 Power Consumption Estimates
Let a multicore compute node has frequencies fi, (i = 1, . . . ,m), such that
f1 < . . . < fm, and throttling states Tj , (j = 0, 1 . . . , n). When all the c cores of
the node execute an application at frequency fi, each core consumes the dynamic
power Pi proportional to f
3
i . Let Pij be the dynamic power consumed by the
entire node at the frequency fi and throttling state Tj , Ps be the total static
power consumption, and Pd be the dynamic power consumption of the compute
node components, such as memory, disk, and NIC, which are different from the
processor. Then, the power consumption with no idle cycles (at T0) may be
assumed as Pi0 = c× Pi + Ps + Pd, so, at Tj , it is
Pij =
j × (Ps + Pd) + (n− j)(Pi0)
n
. (1)
The Pi0 expression serves just to give an idea of the effect of frequency scaling
on power consumption. It may vary with the application characteristics since
6each application may have a different power consumption pattern depending its
utilization of the compute node components.
Bruck Index Algorithm. Let tintra and tinter be the time to transfer a singe byte
of data in an intra node and inter node message transfer respectively. Let Cnet(x )
and Cmem(y) be the parameters to consider the effect of network and memory
contention where x and y processes are involved in inter node and intra node
message transfer within a node respectively. Also, consider Odvfs and Othrottle
be the overheads associated with a frequency scaling and throttling switch. So
for a cluster having N nodes, each node having c cores, an all-to-allmessage
exchange of total M bytes will undergo communication in two stages. In the
first stage message transfers are both inter and intra node in nature. The total
communication time for the first stage, can be written as,
TB1 = Odvfs +
t∑
i=1
max(tintraCmem(c− 2i−1)M
2
, tinterCnet(2
i−1)
M
2
), (2)
where t=dlog2 ce. In the second stage of the all-to-alloperation, all the mes-
sage transfers are inter node in nature and the time spent in this phase can be
expressed as,
TB2 = 2×Othrottle + (r − t)(tinter + O5(M
2
))Cnet(c)
M
2
+ Odvfs, (3)
where r=dlog2 pe and Oj(M2 ) is the increase in the inter node transfer time for
a message size M2 , due to throttling state Tj .
The power consumption during the first stage of the communication is P10
across a node, since the execution is at the minimum frequency and no throt-
tling is applied. Whereas in the second stage, the power consumption is P15 as
throttling level T5 is applied. Therefore the average power consumption for the
proposed algorithm can be expressed as ,
P¯B =
P10TB1 + P15TB2
TB1 + TB2
. (4)
As the number of nodes increase in a cluster, the communication time for the
second stage dominates and hence the average power consumption approximately
becomes P15.
STRF. For this algorithm, communication takes place in three stages where
in the the first and third stage, intra node and inter node message transfers
take place and in second stage, message transfers are purely inter node.The
communication time for the first and third stage can be expressed as,
TS1 = TS3 = Odvfs +
c−1∑
i=1
max(tintraCmem(c− i)M
p
, tinterCnet(i)
M
p
). (5)
7The communication time for stage 2 can be expressed as,
TS2 = 2×Othrottle + (p + 1− 2× c)(tinter + O5(M
p
))Cnet(c)
M
p
. (6)
So, the average power consumption for the proposed energy saving algorithm
when STRF algorithm is used for all-to-alloperation is,
P¯S =
2× P10TS1 + P15TS2
2× TS1 + TS2 . (7)
Similar to the BIA, the execution time for the second stage dominates the total
execution time for all-to-alloperation in the STRF algorithm and therefore, the
average power consumption is close to P15.
3 Experimental Results
The computing platform used comprises ten Infiniband-connected compute nodes,
each of which has 16 GB of main memory and two Intel Xeon E5450 Quad core
processors arranged as two sockets with the operating frequency ranging from
2.0 GHz to 3.0 GHz and the eight levels of throttling from T0 to T7. For measur-
ing the node power and energy consumption, a Wattsup5 power meter is used
with a sampling rate of 1 Hz. Due to such a low measuring resolution, a large
number of all-to-all operations have to be performed.
OSU MPI Benchmarks. This set of benchmarks6 are used here to determine
the change in execution time and power consumption of “stand alone” all-to-all
operations. From Fig. 3(right), it can be observed that the execution time for
all-to-all has very low performance penalty when the proposed energy savings
are used. The average performance loss observed for various message sizes was
just 0.97% of that for the Full power case. While somewhat higher than in the
DVFS only case, which was 0.5%, it is quite acceptable taking into the consider-
ation large reductions in the power consumption achieved (Fig. 3(left)) with the
Proposed strategy. Note, however that, in all the cases, the power consumption
increases with the message size since the memory dynamic power consumption
increases because of message copying [13]. Similar power reductions have been
obtained for the all-to-all-vector operation.
The static power consumption Ps of a node was around 150 watts as measured
by the Wattsup power meter. The most significant contribution to the dynamic
power consumption Pd of the components in a node apart from the processor,
comes from the memory. For measuring memory dynamic power consumption,
an extrapolation method discussed in [7] was used since each node has four
modules of 4 GB each. So, the memory dynamic power consumption, Pd was
5https://www.wattsupmeters.com
6OSU MPI Benchmarks: http://mvapich.cse.ohio-state.edu
8determined to be around 20 watts. When the all-to-all message exchange is for
1 MB, P15 can be calculated as,
P15 =
5× (150 + 20) + 3× 222
8
= 189.5. (8)
It can be observed from Fig. 3(right) that the value of P15 calculated in
(8) is close to the value of average power consumption for an all-to-all message
exchange of 1 MB.
0 200000 400000 600000 800000 1000000 1200000
0
200000
400000
600000
800000
1000000
1200000
1400000
Full power
DVFS only
Proposed
Message Size (in bytes)
Ti
m
e 
(in
 M
ic
ro
s e
co
n d
s)
0 200000 400000 600000 800000 1000000 1200000
180
190
200
210
220
230
240
250
260
270
280
Full power
DVFS only
Proposed
Message Size (in bytes)
Po
w
er
 (i
n 
W
at
ts
)
          
          18%
  
29.2%
Fig. 3. The all-to-all execution time on 80 processes (left) and the power consump-
tion across a compute node (right) for the three cases: Executing at the highest fre-
quency and no throttling (Full power); Only frequency scaling without throttling
(DVFS only); and Using the proposed energy saving strategies (Proposed).
CPMD and NAS Application Benchmarks CPMD (CarParrinello Molecular Dy-
namics)7 is a ab-initio quantum mechanical molecular dynamics technique using
pseudopotentials and a plane wave basis set. Eleven input sets are used here.
MPI Alltoall is the key collective operation in CPMD. Since most messages have
the sizes in the range of 128 B to 8 KB, the BIA is used. From the NAS bench-
marks [1], FT and IS Class C benchmarks are chosen because they use the all-to-
all operation. Fig. 4 shows the execution time and energy consumption of CPMD
and NAS benchmarks on 80 and 64 processes, respectively, normalized to the
Full power case. For the CPMD with the Proposed strategies, the performance
loss ranges from 0.4% to 4.3% averaging 2.78% leading to the energy savings in
the range of 9.8% to 15.7% (13.4% on average). For the NAS, the performance
loss ranges from 1.1% to 4.5% and the average energy savings are about 10%.
Hence, the benchmark applications tested suffer from little performance loss and
have significant energy savings.
7CPMD Consortium: http://www.cpmd.org
9wat-32-inp-1 wat-32-inp-2 c120-inp-1 c120-inp-2 Si512-inp-1 Vdb-inp-1 lanczos.inp davidson.inp annealing.inp Path-int-inp-1 Tddft-inp-1 NAS FT C NAS IS C
0
1
DVFS only Proposed
No
rm
al
iz
ed
  E
xe
cu
tio
n 
Ti
m
e
wat-32-inp-1 wat-32-inp-2 c120-inp-1 c120-inp-2 Si512-inp-1 Vdb-inp-2 lanczos.inp davidson.inp annealing.inp Path-int-inp-1 Tddft-inp-1 NAS FT C NAS IS C
0
1
No
rm
al
iz
ed
 E
ne
rg
y 
C
on
su
m
pt
io
n
Fig. 4. Execution time (up) and energy consumption (down) of 11 CPMD inputs on 80
processors and of NAS benchmarks on 64 processes for the DVFS only and Proposed
cases normalized to the Full power.
4 Related Work
In [13], authors study the power efficiency and communication performance of
RDMA over TCP/IP and conclude that RDMA performs better for the both
cases. The energy efficiency delivered by the modern interconnects in high per-
formance clusters is discussed in [4]. The communication phase characteriza-
tion to obtain energy savings by using DVFS is done in [3][5][6]. Performance
counter based algorithms for determining stall cycles to save energy are used
in [11][10][8]. In [9], authors have developed a tool which estimates power con-
sumption characteristics of a parallel application in terms of various CPU com-
ponents. In [12], algorithms to save energy in the collectives, such MPI Alltoall
and MPI Broadcast, are proposed. However, they differ significantly with the
approach presented in this paper. Specifically, [12] assumes that throttling has a
negative effect on internode communication. Thus, the authors of [12] redesign
the all-to-all operation, such that a set of sockets does not take part in commu-
nication a some point of time, and thus, may throttled. But as the number of
cores within a node keep on increasing, this approach of forcing the sockets to
remain idle during communication, can introduce significant performance over-
heads. The power saving achieved in [12] are be equivalent to executing the two
sockets in a node at the minimum frequency and throttling state T4, whereas
the approach proposed here achieves better power saving by keeping both sock-
ets at minimum frequency and throttling to a higher state T5. The detailed
experimental comparison of the two algorithms is left as future work.
5 Conclusions
Energy-saving strategies have been proposed for the all-to-all operation and
implemented as the MPI Alltoall collective in MVAPICH2 without modifying
10
the standard algorithms used to perform this operation. The sensitivity of inter-
and intranode message transfers to CPU throttling has been assessed, and it was
observed that throttling has almost no negative effect on the performance of in-
ternode communications. Thus, both DVFS and CPU throttling were applied in
the appropriate communication steps within three different all-to-all algorithms
used by MVAPICH2. The experiments demonstrate that the proposed strategies
can deliver up to 15.7% energy savings, without introducing significant perfor-
mance overhead for the CPMD and NAS application benchmarks, which reflect
the potential beneficial effect on scientific applications in general.
Similar energy saving strategies may be extended to other collectives includ-
ing reduction operations. Furthermore, as the number of cores within a node
keeps increasing, the opportunity of applying throttling in intranode commu-
nication must be also explored. The point-to-point operations should also be
studied for energy savings since many modern architectures may support socket
level DVFS and CPU throttling.
References
1. D. H. Bailey et al. The nas parallel benchmarks-summary and preliminary results.
In Proceedings of the 1991 ACM/IEEE conference on Supercomputing, Supercom-
puting ’91, pages 158–165, New York, NY, USA, 1991. ACM.
2. J. Park et al. Accurate modeling and calculation of delay and energy overheads
of dynamic voltage scaling in modern high-performance microprocessors. In Low-
Power Electronics and Design (ISLPED), 2010 ACM/IEEE International Sympo-
sium on, pages 419–424, 2010.
3. M. Y. Lim et al. Adaptive, transparent frequency and voltage scaling of communi-
cation phases in mpi programs. In Proceedings of the 2006 ACM/IEEE conference
on Supercomputing, SC ’06, New York, NY, USA, 2006. ACM.
4. R. Zamani et al. A feasibility analysis of power-awareness and energy minimization
in modern interconnects for high-performance computing. In Proceedings of the
2007 IEEE International Conference on Cluster Computing, CLUSTER ’07, pages
118–128, Washington, DC, USA, 2007. IEEE Computer Society.
5. V. W. Freeh et al. Exploring the energy-time tradeoff in mpi programs on a
power-scalable cluster. In Proceedings of the 19th IEEE International Parallel and
Distributed Processing Symposium (IPDPS’05) - Papers - Volume 01, IPDPS ’05,
pages 4.1–, Washington, DC, USA, 2005. IEEE Computer Society.
6. V. W. Freeh et al. Using multiple energy gears in mpi programs on a power-scalable
cluster. In Proceedings of the tenth ACM SIGPLAN symposium on Principles and
practice of parallel programming, PPoPP ’05, pages 164–173, New York, NY, USA,
2005. ACM.
7. Xizhou Feng, Rong Ge, and Kirk W. Cameron. Power and energy profiling of
scientific applications on distributed systems. In Proceedings of the 19th IEEE
International Parallel and Distributed Processing Symposium (IPDPS’05) - Papers
- Volume 01, IPDPS ’05, pages 34–, Washington, DC, USA, 2005. IEEE Computer
Society.
8. R. Ge, X. Feng, W. Feng, and K.W. Cameron. CPU MISER: A performance-
directed, run-time system for power-aware clusters. In Parallel Processing, 2007.
ICPP 2007. International Conference on, page 18, 2007.
11
9. R. Ge, X. Feng, S. Song, H. C. Chang, D. Li, and K.W. Cameron. PowerPack: En-
ergy profiling and analysis of high-performance systems and applications. Parallel
and Distributed Systems, IEEE Transactions on, 21(5):658–671, May 2010.
10. C. H. Hsu and W. Feng. A power-aware run-time system for high-performance
computing. In Supercomputing, 2005. Proceedings of the ACM/IEEE SC 2005
Conference, page 1, nov 2005.
11. S. Huang and W. Feng. Energy-efficient cluster computing via accurate workload
characterization. In Cluster Computing and the Grid, 2009. CCGRID ’09. 9th
IEEE/ACM International Symposium on, pages 68–75, May 2009.
12. K. Kandalla, E.P. Mancini, S. Sur, and D.K. Panda. Designing power-aware col-
lective communication algorithms for infiniband clusters. In Parallel Processing
(ICPP), 2010 39th International Conference on, pages 218–227, sept. 2010.
13. J. Liu, D. Poff, and B. Abali. Evaluating high performance communication: a power
perspective. In Proceedings of the 23rd international conference on Supercomputing,
ICS ’09, pages 326–337, New York, NY, USA, 2009. ACM.
14. R. Thakur and R. Rabenseifner. Optimization of collective communication oper-
ations in mpich. International Journal of High Performance Computing Applica-
tions, 19:49–66, 2005.
