Energy Wall for Exascale Supercomputing by Wang, Zhiyuan et al.
Computing and Informatics, Vol. 35, 2016, 941–962
ENERGY WALL FOR EXASCALE SUPERCOMPUTING
Zhiyuan Wang, Yuhua Tang, Juan Chen
State Key Laboratory of High Performance Computing
National University of Defense Technology
Changsha, Hunan, 410073, China
e-mail: yhtang62@163.com, juanchen@nudt.edu.cn
Jingling Xue
School of Computer Science and Engineering
University of New South Wales, Australia
Yun Zhou, Yong Dong
School of Computer
National University of Defense Technology
Changsha, Hunan, 410073, China
Abstract. “Sustainable development” is one of the major issues in the 21st cen-
tury. Thus the notions of green computing, green development and so on show up
one after another. As the large-scale parallel computing systems develop rapidly,
energy consumption of such systems is becoming very huge, especially system per-
formance reaches Petascale (1015 Flops) or even Exascale (1018 Flops). The huge
energy consumption increases the system temperature, which seriously undermines
the stability and reliability, and limits the growth of system size. The effects of
energy consumption on scalability become a growing concern. Against the back-
ground, this paper proposes the concept of “Energy Wall” to highlight the signifi-
cance of achieving scalable performance in peta/exascale supercomputing by taking
energy consumption into account. We quantify the effect of energy consumption on
scalability by building the energy-efficiency speedup model, which integrates com-
puting performance and system energy. We define the energy wall quantitatively,
942 Z. Wang, Y. Tang, J. Chen, J. Xue, Y. Zhou, Y. Dong
and provide the theorem on the existence of the energy wall, and categorize the
large-scale parallel computers according to the energy consumption. In the context
of several representative types of HPC applications, we analyze and extrapolate the
existence of the energy wall considering three kinds of topologies, 3D-Torus, binary
n-cube and Fat tree which provides insights on how to mitigate the energy wall ef-
fect in system design and through hardware/software optimization in peta/exascale
supercomputing.
Keywords: Energy consumption, scalability, exascale computing, energy-efficiency
speedup, energy wall
Mathematics Subject Classification 2010: 68M01
1 INTRODUCTION
Currently, the scalable parallel computing has become the common approach for
achieving high performance, and the system size scales up rapidly for meeting the
growing demands. Despite great strides made by supercomputing systems, much
of scientific computation’s potential remains untapped, “because many scientific
challenges are far too enormous and complex for the computational resources at
hand” [1]. Planned exascale supercomputers (capable of an exaflop, 103 petaflops,
or 1018 floating point operations per second) in this decade promise to overcome
these challenges by a revolution in computing at a greatly accelerated pace [2].
However, exascale supercomputing is faced up with a number of challenges, including
technology, architecture, energy, reliability, programmability and usability. In [3], we
have discussed the reliability challenge for exascale supercomputing. This treatise
is aimed at addressing the energy challenge in building scalable supercomputing
systems, particularly those at the peta/exascale levels.
According to the top ten supercomputers in Top500 list [4] published in Novem-
ber 2015, the highest system power is 17 808 KW of Tianhe-2, and its yearly power
bill will be more than 25 million US dollars. It shows that current high performance
systems consume huge energy, and the energy consumption may continue to climb
up as the performance increases, which is unsustainable and brings about serious
challenges of stability, cooling, and so on.
“Sustainable development” is one of the major issues in the 21st century. Thus
the notions of green computing, green development and so on show up one after
another. As the large-scale parallel computing systems develop rapidly, the effect of
energy consumption on system scalability becomes a growing concern.
Against the background, we quantify for the first time the concept of “En-
ergy Wall”. We highlight the significance of achieving scalable performance in
peta/exascale supercomputing by taking energy consumption into account.
Energy Wall for Exascale Supercomputing 943
For the parallel system, we present the energy-efficiency speedup model, define
quantitatively the energy wall, give the existence theorem of the energy wall, and
categorize it according to the energy consumption as red scalable system, yellow
scalable system or green scalable system. Finally, we take 3D-Torus, binary n-cube
and Fat tree topologies as examples for demonstrating our “Energy Wall” theory.
2 THE ENERGY WALL
Table 1 provides a list of symbols, their meanings, and where they are defined in
the paper.
Symbol Meaning Definition
P The number of cores Section 2.1
ERP The overall energy consumption of the system (1)
EEP Effective energy consumption of the overall system Section 2.1
EE Effective computing energy consumption (1)
ECE Communication energy consumption of network (1)
ESW Redundant computing energy for serial part of application (1)
EIW Idling energy consumption for computation (1)
ECW Idling energy consumption of network (1)
EWP Extra energy consumption of introducing parallelization (1)
SP Traditional speedup (2)
EP Energy consumption speedup (3)
E(P ) Factor of parallel energy consumption (4)
V EP Energy efficiency (5)
SEP Energy-efficiency speedup (6)
SEAmdahl Amdahl energy-efficiency speedup (7)
SEGustafson Gustafson energy-efficiency speedup (8)
UEP Efficiency of parallel energy consumption Section 2.2
Table 1. List of main symbols, their meanings, and definitions
Section 2.1 describes the composition of energy consumption of the system.
Section 2.2 introduces a new energy-efficiency speedup model. Section 2.3 quantifies
the energy wall based on our energy-efficiency speedup model.
2.1 Energy Composition
Energy consumption of nodes and interconnection network are two major compo-
nents of high performance computing system [5], and each of them is further divided
into static energy consumption and dynamic energy consumption. Only certain op-
erating components may generate the dynamic consumption. For example, the node
dynamic energy consumption occurs when the node is computing, and the network
dynamic energy consumption occurs when the data is transmitted, while the static
energy consumption always exists even when nodes or network resources are idle.
944 Z. Wang, Y. Tang, J. Chen, J. Xue, Y. Zhou, Y. Dong
Node energy consumption mainly consists of the energy consumption of Effec-
tive computing EE, the energy consumption of redundant computing for serial part
of parallel application ESW (the serial part of parallel application needs only one pro-
cess to complete, and other processes should wait until the finish of the serial part
or compute serial part simultaneously, which consumes much “Wasted” energy con-
sumption), and the energy consumption of idling EIW . Network energy consumption
mainly comes from network resources (e.g. routers) consisting of energy consumption
of Communication ECE and that of idling E
C
W . Shared-memory-based communica-
tions among different cores within a single computation node (CN) do not incur
energy consumption of network resources, and therefore we treat energy consumed
by such inter-core communications as a portion of energy consumption of the effec-
tive computation EE on the corresponding node. In conclusion, the dynamic energy
consumption of redundant computing for serial part ESW plus the static energy con-
sumption, i.e. cores and network resource idling energy consumption EIW + E
C
W , is
the waste when the system is running.
Suppose a parallel system consists of P cores and the number of cores per node
is a constant. A parallel program runs on the system with one process per core,
resulting in a total of P processes.
Figure 1 a) shows the energy consumption of an ideal parallel computing system
which runs without any waste of energy, i.e. all energy consumption are used for
effective operation (computing and communication), as the number of cores increases
from 1 to P . EEP is the energy consumption of this ideal system, which includes
energy consumption of computation EE and communication energy consumption of
network ECE , i.e. E
E
P = EE +E
C
E . EE is the energy consumption of sequential version
of the parallel application executed by a single core, which does not change with the
system size.
Figure 1. The increase of energy consumption of parallel system as its size scales up.
a) Energy consumption of an ideal parallel system. b) Energy consumption of
a parallel system in practice.
Figure 1 b) shows the energy consumption of a parallel system in practice, where
ERP represents the total energy consumption during the period of Running the pro-
Energy Wall for Exascale Supercomputing 945
grams, including both dynamic and static energy consumption. Then we obtain the
energy composition of the system as follows:








W = EE + E
W
P . (1)
As shown in Figure 1 b) and Equation (1), we define EWP . According to Equa-
tion (1), EWP has four components, representing the extra energy consumption of
introducing parallelization by increasing the system size. It means that the increase
of performance rarely comes without energy consumption. In other words, when the
system performance increases, especially approaches exascale, the effect of energy
consumption on scalability should be taken into account. Thus, energy is integrated
into traditional speedup model to measure the scalability change of parallel com-
puting.
2.2 Energy-Efficiency Speedup
The increase of computing performance is at the cost of energy consumption. As
usual, the workload of parallel computing is measured by FLoating-point OPera-
tions (FLOP) or Instruction Counts (IC). Without loss of generality, we measure
the system workload by FLOP, and the computing performance is measured by
FLoating-point OPerations per Second (FLOPS).
Given an application G, traditional speedup is the ratio of the computing per-
formance of the parallel execution to that of the serial execution. Since the FLOP
of both parallel and serial executions are identical, i.e. FLOP1 = FLOPP , the











where Tj (j = 1, P ) is the time for completing workload FLOPj on j core(s) of the
system, and FLOPSj is the computing performance on j core(s) of the system. The




Similarly, energy consumption speedup is defined as:
Definition 1 (Energy Consumption Speedup EP ). Given an application G
running on a P -core system, the energy consumption speedup is the ratio of energy
consumption of parallel execution ERP to that of sequential version of the parallel





Energy consumption speedup reflects the increase of energy consumption of
the system as the system size is scaled up. According to Section 2.1, the energy
946 Z. Wang, Y. Tang, J. Chen, J. Xue, Y. Zhou, Y. Dong
consumption of effective computing EE = E
R
1 . According to Equation (3), energy












= 1 + E(P ), (4)
where E(P ) is called the factor of parallel energy consumption. Since EWP is the
extra energy consumption after introducing parallelization to improve performance,
E(P ) reveals the relationship between extra energy consumption for parallelization
and the energy consumption of effective computing.
Next, we give the definition of energy efficiency and then build up the energy-
efficiency speedup model.
Definition 2 (Energy Efficiency V EP ). Given an application G running on a P -core









According to the definition of energy efficiency, the energy-efficiency speedup
model is built up as follows.
Definition 3 (Energy-Efficiency Speedup SEP ). Given an application G running on
a P -core system, the energy-efficiency speedup is the ratio of energy efficiency of par-
allel execution V EP to that of sequential version of the parallel application executed






























is called efficiency of parallel energy consumption, which reflects
the increase efficiency of computing scalability gained from the unit of energy. Thus
energy-efficiency speedup SEP is the ratio of traditional speedup to energy consump-
tion speedup, which is quantified by the speed-up of performance by consuming an
unit of energy.
Different from traditional speedup, energy-efficiency speedup SEP integrates both
energy consumption and computing performance of parallel system from the per-
spective of scalability. The variation of SEP is determined by both the applica-
tion characteristic and the system architecture. Typically, if the proportion of
the computation overhead is larger than that of the communication overhead dur-
ing the parallel program running on the system, the application is computation-
intensive, otherwise it is communication-intensive. For example, if there are lots of
Energy Wall for Exascale Supercomputing 947
communication operations, e.g. point-to-point communication for communication-
intensive applications, the traditional speedup SP increases slowly when the sys-
tem size is scaled up, even worse, then decreases and approaches zero if there are
massive collective communications, e.g. one-to-all or all-to-all broadcasting. Gen-
erally, the scalability of computation-intensive applications is better than that of
the communication-intensive applications, hence many researches focused on the
optimization of communication such as overlapping computations and communica-
tions.
Traditional speedup SP varies mainly in three cases based on the types of appli-
cations, i.e. limP→∞ SP = ∞, limP→∞ SP = C, where C is a positive constant and
limP→∞ SP = 0, which we define as scalable, weak-scalable and unscalable applica-
tions, respectively.
Amdahl law [6] and Gustafson law [7] are two representative traditional speed-
ups. When the system size is scaled up, Amdahl speedup is limited by the se-
rial part of the program f , i.e. limP→∞ SAmdhal =
1
f
which quantifies the scala-
bility for the weak-scalable applications, and Gustafson speedup increases linearly,
i.e. limP→∞ SGustafson = ∞ which quantifies the scalability for scalable applica-
tions.
The Amdahl and Gustafson laws are readily to be generalized to Amdahl energy-
efficiency speedup and Gustafson energy-efficiency speedup without loss of general-
ity.
Amdahl energy-efficiency speedup may be defined as
SEAmdahl =
SAmdhal
1 + E(P )
=
PUEP
1 + f(P − 1)
. (7)
Gustafson energy-efficiency speedup may be defined as
SEGustafson =
SGustafson
1 + E(P )
= ((1− f)P + f)UEP . (8)
Particularly, if EWP = 0, i.e., there is no extra energy consumption after in-
troducing parallelization, then E(P ) = 0 and UEP = 1, and the expressions of (7)
and (8) are identical to those of the traditional Amdahl’s and Gustafson’s speedup
models. However, the communication energy consumption of parallel computing
cannot be ignored in practical scenario, therefore our works only consider the case
where EWP > 0.
Taken Gustafson energy-efficiency speedup as an example, Figure 2 shows the
trends of Gustafson speedup and Gustafson energy-efficiency speedup when the sys-
tem size is scaled up. As shown in Figure 2, the energy-efficiency speedup SEP
never exceeds the traditional speedup SP because the efficiency of parallel energy
consumption UEP < 1.





















Figure 2. Gustafson speedup vs. Gustafson energy-efficiency speedup
2.3 Energy Wall
In Section 2.2, we discuss the relationship between energy consumption and scal-
ability, where we try to answer how and to what extent the energy consumption
limits the scalability. This section first gives the quantitative definition of energy
wall, and then proposes the existence theory of energy wall along with its proof.
Definition 4 (Energy Wall). Given an application G running on a P -core system,
energy wall is the maximum of energy-efficiency speedup SEP , which is denoted by
maxSEP .




itive constant or∞, and f(x)  g(x) if limx→∞ f(x)g(x) is∞. In addition, the operators
4 and ≺ are used in the standard manner. If f(x) = Θ(g(x)), as is customary, the
Θ notation1 describes asymptotically both an upper bound and a lower bound.
Theorem 1 (Existence Theory of Energy Wall). Given an application G running
on a P -core system, energy wall exists if and only if limP→∞ S
E
P = 0.
Proof. By Definition 3, we explore the trends of energy-efficiency speedup SEP when
SP changes in different laws.
Case 1: limP→∞ SP =∞.
According to the definition of the increase factor of parallel energy consumption,
E(P ) is the monotonically increasing function in P . By Definition 3, energy-
efficiency speedup SEP has three curve shapes.
1 Suppose R(x) is the set consisting of all functions of x, f(x) ∈ R(x) and g(x) ∈ R(x).




g(x) are positive constants.
Energy Wall for Exascale Supercomputing 949
1. If E(P )  SP , when the system size is scaled up, energy-efficiency speedup
SEP first increases, then decreases and approaches zero.
2. If E(P ) = Θ(SP ), when the system size is scaled up, energy-efficiency
speedup SEP increases monotonically and approaches a positive constant.
3. If Θ(1) ≺ E(P ) ≺ SP , when the system size is scaled up, energy-efficiency
speedup SEP increases monotonically and approaches ∞.
=⇒ If energy wall exists, by Definition 4, then the maximum of energy-efficiency
speedup exists, i.e., energy-efficiency speedup SEP first increases, then de-
creases and approaches zero.
⇐= Since limP→∞ SEP = 0, energy-efficiency speedup SEP first increases and
then decreases. So, the maximum of energy-efficiency speedup SEP exists,
i.e., energy wall exists.
Case 2: limP→∞ SP = C, where C > 0 is a positive constant.
Energy-efficiency speedup SEP has two curve shapes.
1. If E(P )  SP , when the system size is scaled up, energy-efficiency speedup
SEP first increases, then decreases and approaches zero.
2. If E(P ) 4 SP , when the system size is scaled up, energy-efficiency speedup
SEP increases monotonically and approaches a constant.
Thus the proof is similar to that of case limP→∞ SP =∞. Similarly, the energy
wall exists if and only if limP→∞ S
E
P = 0.
Case 3: limP→∞ SP = 0.
The energy-efficiency speedup limP→∞ S
E
P = 0 first increases, then decreases and
approaches zero. So energy wall always exists.

This theorem has the following important implication. If limP→∞ SP = 0, then
limP→∞ S
E
P = 0 according to Equation (6). Thus, the energy wall always exists no
matter how energy consumption changes. In the rest of the paper, we therefore
focus on the cases when limP→∞ SP 6= 0.
By Equation (6), the factor of parallel energy consumption E(P ) is the key factor
of energy-efficiency speedup. According to the existence theory of energy wall, we
derive the existence corollary of energy wall, which reveals the decisive effect of the
factor on the existence of energy wall.
Corollary 1 (Existence Corollary of Energy Wall). Suppose an application G that
satisfies limP→∞ SP 6= 0 runs on a P -core system. The energy wall exists if and only
if E(P )  SP .
950 Z. Wang, Y. Tang, J. Chen, J. Xue, Y. Zhou, Y. Dong
Proof.
=⇒ If the energy wall exists, by Theorem 1, limP→∞ SP1+E(P ) = limP→∞ S
E
P = 0.
Since limP→∞ SP 6= 0, we have E(P )  SP .




This section categorizes parallel systems by the different characteristics of energy-
efficiency speedup when the system size is scaled up, to better understand the effects
of energy consumption on scalability.
Section 3.1 provides a classification of supercomputing systems based on their
energy consumption. Section 3.2 analyzes the existence of energy wall for different
systems.
3.1 Categorizing Systems
When the large-scale parallel system is running, the energy-efficiency speedup ex-
hibits different trends according to different application characteristics or different
system architectures. Based on the different trends of energy-efficiency speedup,
parallel systems are categorized as red scalable system, green scalable system and
yellow scalable system, which are defined as:
Definition 5 (Red Scalable System). Given an application G running on a P -core
system, if energy-efficiency speedup satisfies SEP ≺ Θ(1), then the system is called
red scalable system.
Definition 6 (Green Scalable System). Given an application G running on a P -co-
re system, if energy-efficiency speedup satisfies SEP  Θ(1), then the system is called
green scalable system.
Definition 7 (Yellow Scalable System). Given an application G running on a P -
core system, if energy-efficiency speedup satisfies SEP = Θ(1), then the system is
called yellow scalable system.
By the definitions of system categorization and Equation (6), we give the intu-
itive explanations of the above three systems as follows:










1 + E(P )
= 0.
Energy Wall for Exascale Supercomputing 951
When the system size is scaled up, the increase speed of energy consumption is
faster than that of the computing performance. So, the energy consumption of red
scalable system is inefficient and not preferable, since there is large energy waste
during system running. Unfortunately, almost all of the current parallel systems are
red scalable system, which is analyzed in the Section 3.2.










1 + E(P )
=∞.
The green scalable system is the energy-efficient parallel system, i.e. the increase
speed of energy consumption is slower than that of the computing performance. It
means that all energy consumption introduced by parallelization used for improving
the computing performance is quite effective, which is the goal of future exascale
supercomputing system.










1 + E(P )
= C.
where C > 0 and C is a constant.
The energy efficiency of yellow scalable system sits between that of red scalable
system and green scalable system. When the system size is scaled up, the ratio of
increase speed of energy consumption to that of computing performance approaches
a constant. Yellow scalable system is a compromise in system and application de-
sign.
For ease of understanding, Figure 3 shows the variations of E(P ) for red scal-
able system, green scalable system and yellow scalable system and their correspond-
ing energy-efficiency speedup SEP . Traditional speedup is substituted by Gustafson
speedup SGustafson, and E(P ) varies in forms of P
2, P and lgP , where lg is loga-
rithmic function with base 10.






1 + P 2
= 0,
the corresponding system is red scalable system.








where C > 0 is a constant. The corresponding system is yellow scalable system.








952 Z. Wang, Y. Tang, J. Chen, J. Xue, Y. Zhou, Y. Dong
the corresponding system is green scalable system.













































Figure 3. Trends of the increase factor of parallel energy consumption E(P ) and the
energy-efficiency speedup SEP for three kinds of systems. a) E(P ) for three kinds
of systems. b) SEP for three kinds of systems.
In Figure 3 a), when the system size is scaled up, the increased speed of the
factor of parallel energy consumption E(P ) for red scalable system exceeds that of
traditional speedup SP , hence the gap between SP and E(P ) keeps growing and the
corresponding SEP in Figure 3 b) approaches zero. The increased speed of E(P ) for
green scalable system is slower than that of SP , where the gap between SP and E(P )
widens as shown in Figure 3 a), and the corresponding SEP in Figure 3 b) approaches
infinity. The trend of E(P ) for yellow scalable system is consistent with that of
Energy Wall for Exascale Supercomputing 953
SP as shown in Figure 3 a), and the corresponding S
E
P in Figure 3 b) approaches
a positive constant.
3.2 System Categorization and Energy Wall
Section 3.1 categorizes parallel systems into red scalable system, green scalable sys-
tem and yellow scalable system. At the cost of unit energy consumption, the im-
provement of computing performance for these three systems are different, hence
the variations of energy-efficiency speedup are different. This section explores the
relationship of energy wall and the system categories.
Based on Corollary 1, we further derive the existence corollaries for different
systems.
Corollary 2 (Existence of Energy Wall for Red Scalable System). Suppose an ap-
plication G that satisfies limP→∞ SP 6= 0 runs on the red scalable system. The
energy wall always exists.





= 0 and limP→∞ SP 6= 0, so
E(P )  SP . Based on Corollary 1, the energy wall exists. 
Corollary 3 (Existence of Energy Wall for Yellow/Green Scalable System).
Suppose an application G that satisfies limP→∞ SP 6= 0 runs on the yellow or green
scalable system. The energy wall does not exist.






6= 0, and so E(P ) 4 SP . Based on Corollary 1, the energy wall does
not exist. 
Existence of energy wall for red, yellow and green scalable systems are shown in
Table 2, where it is shown that the red scalable system is unscalable, while the yellow
and green scalable systems are scalable. The scalable system, i.e. green scalable
system and yellow scalable system, is the design objective of system designers or
application developers.
System Categorization Condition Energy Wall
Red scalable system E(P )  SP Exists
Yellow scalable system Θ(E(P )) = Θ(SP ) Does not exist
Green scalable system E(P ) ≺ SP Does not exist
Table 2. Existence of energy wall for categorizations
According to Table 2, we come to the conclusion that the relationship of SP and
E(P ) is the decisive factor on the existence of the energy wall. E(P ) is decided by
the extra energy consumption after introducing parallelization for improving perfor-
mance EWP . As shown in Figure 1 b), E
W
P consists of four parts (two kinds of energy
954 Z. Wang, Y. Tang, J. Chen, J. Xue, Y. Zhou, Y. Dong
consumption of computation and two kinds of energy consumption of network). As
we know, load imbalance accounts for idling, and in the following analysis, we as-
sume the system is load balancing, i.e. EIW = 0. Therefore, we mainly concern the
trends of other three energy consumption of EWP , i.e. the redundant energy of serial
part in parallel application ESW , energy consumption of communication E
C
E and that
of network idling ECW .
Suppose the execution time of serial part is fixed and denoted by T fP , the average
dynamic power per core equals to the ratio of a single CN dynamic power to the





P · (P − 1).
Thus, without regarding the energy consumption of network, Θ(E(P )) = Θ(ESW )
= Θ(P ). For the cases of limP→∞ SP = C(C > 0) and 0, Θ(E(P )) = Θ(P )  Θ(SP ),
the parallel computing system is red scalable, and the energy walls always exist. For
the scalable application, SP = SGustafson = Θ(P ), Θ(E(P )) = Θ(P ) = Θ(SP ), and
then energy-efficiency speedup SEP = Θ(1), the parallel computing system is yellow
scalable and the energy wall does not exist according to Table 2. ESW is not the deci-
sive factor of the existence of energy wall, while the energy consumption of network
decides the existence of energy wall. Thus, we only analyze the influence of energy
consumption of network on the existence of energy wall for scalable applications i.e.
SP = SGustafson = Θ(P ) in the following case studies.
The energy consumption of network is mainly consisted of the energy consump-
tion of network nodes(NNs) rather than the energy consumption of communication
links, where the latter is negligible, hence we focus on analyzing the influence of
energy consumption of NNs on the energy wall.
Currently, large-scale systems usually adopt optoelectronic hybrid network. For
example, Tianhe-2 uses proprietary interconnect, called TH-Express-2 network
which uses 3-level fat tree topology, and the active optical cables (AOCs) are used in
cabinet-to-cabinet connection to decrease the communication latency at reasonable
cost [8]. For 3-level fat tree TH-Express-2 network, all the switches are built by
Network Router Chips (NRCs), and the messages transmit through the fabric in the
second and the top levels of the fat tree, where the power of photoelectric conversion
of NRCs is a constant (No. of optical ports per NRC × power per optical port) and
should be taken into account in the energy consumption of NNs. In addition, the
power of each NRC is also a constant. Therefore, the average power of each NN
is constant function in P and the number of NNs N is the decisive factor of the
existence of energy wall rather than the power of each NN from the perspective of
scalability.
Suppose the average static power of a single node in the network is denoted by
p0, the average dynamic power of a single node in the network is denoted by p1, and
l is the total times of communication that the messages are transmitted through
NNs. p0, p1, TP and l are all constant functions in P .
Energy Wall for Exascale Supercomputing 955
NN of many current parallel systems integrates special hardware for routing
and switching. The message from the source-NN to the target-NN may pass several
intermediate NNs, each of which is called a hop. Thus, the trend of network energy
consumption with P is decided by three coefficients N , hopj (the total times of
network hops for the jth communication) and tcj (the time of j
th communication).
3D-Torus, binary n-cube and Fat tree are three commonly-used network topolo-
gies in large-scale parallel systems [9]. We use these three representative systems to
verify and extrapolate the existence of the energy wall. The number of cores per CN
is a constant, and denoted by CC. In 3D-Torus and binary n-cube topologies, N is
equal to the number of CNs, i.e. N = P
CC








) NNs to connect computation nodes.
We list the energy consumption of network for three topologies in Table 3, where
there are two terms in the formulas of energy consumption in network. The first
term is derived from the static energy consumption of network. For the Fat tree
topology, we have SEP ≺ Θ(1) if we only consider the first term. If the parallel
computing system is red scalable, the energy wall always exists. For 3D-Torus
and binary n-cube topologies, we have SEP ≺ Θ(1) (according to hopj and tcj) or
SEP = Θ(1), and the system is either red or yellow scalable.
Topology Energy Consumption of Network Categorization
3D-Torus p0 · TP ·Θ(P ) + (p1 − p0)
∑l
j=1 hopj · tcj Yellow/Red
Binary n-Cube p0 · TP ·Θ(P ) + (p1 − p0)
∑l
j=1 hopj · tcj Yellow/Red
m-branch Fat Tree p0 · TP ·Θ(P · logm PCC ) + (p1 − p0)
∑l
j=1 hopj · tcj Red
Table 3. Existence of energy wall for three commonly adopted network topologies
Furthermore, we take one-to-all broadcasting as an example, and analyze the en-
ergy consumption of one-to-all broadcasting on 3D-Torus and binary n-cube topolo-
gies respectively. One-to-all broadcasting occupies all network resources, then
hopj = P and t
c
j satisfies at least  Θ(1). Thus, E(P )  Θ(P ) = SGustafson, and
then energy-efficiency speedup SEP ≺ Θ(1). It means that the system is red scalable,
and the energy wall exists when there are one-to-all broadcasting on 3D-Torus or
binary n-cube topology, even it has only once.
In conclusion, the systems with 3D-Torus, binary n-cube or fat tree topologies
are all not green scalable system, because the function of static energy consumption
makes E(P ) always < Θ(SP ). In order to build the green scalable system, we should
adopt appropriate methods, such as closing NNs technology with low overhead to
avoid the linear or superlinear increase of network energy consumption.
Our energy wall theory provides us insights on reducing energy consumption for
improved scalability in a number of principal directions.
• First, the relationship between energy consumption and scalability can be re-
vealed. An energy-efficiency speedup model can be developed and the existence
of the energy wall can be analytically predicted.
956 Z. Wang, Y. Tang, J. Chen, J. Xue, Y. Zhou, Y. Dong
• Then, by performing an energy wall analysis, we revealed that the key scalability-
limiting energy consumption is the network energy consumption. The 3D-Torus
and binary n-cube topologies are better than fat tree from the perspective of
scalability. So the appropriate topologies should be chosen in the system design
in order to mitigate the effect of energy wall.
• Besides, according to the analysis of network energy consumption, the network
static energy consumption cannot be ignored. The development of new low
energy techniques can be guided by our theory to focus on optimizing the key
scalability-limiting factors, such as network static energy consumption.
4 RELATED WORK
Speedup has been almost exclusively used for measuring scalability in parallel com-
puting, such as Amdahl speedup, Gustafson speedup or memory-bounded speed-
up [6, 7, 10, 11]. These models are only concerned with the computing performance.
When supercomputing evolves from high performance to high productivity, it is vi-
tal to rethink the performance metric by integrating computing performance and
reliability, energy consumption, communication and so on [3, 12, 13, 14]. Recently,
there are also some works that discussed the performance scalability and energy for
multi-core systems using Amdahl’s Law [15, 16]. Ge et al. [17, 18] imported varying
power modes into Amdahl’s speedup so that different power-performance pairings
were applied and lower power was assumed to mean slower performance of a com-
ponent. In comparison, we integrate the performance and energy consumption from
the perspective of scalability, emphasize the energy efficiency of parallel system and
then quantitatively define the energy wall. We also provide the existence theorems
for the energy wall, which are significant for analyzing energy wall effects with differ-
ent topologies for guiding system design and hardware/software development. That
is the major difference of their works and ours.
Song et al. [19] proposed a system-level iso-energy-efficiency model for predict-
ing energy-performance of data intensive parallel applications and they also demon-
strated that this model is helpful for various application contexts and in scala-
bility decision-making. For building energy efficiency model for parallel systems,
iso-energy-efficiency model proposed by Song et al. is similar to ours. In compar-
ison, our energy-efficiency speedup model focuses on the impact of the scalability
on energy efficiency, and deduces the condition that makes the following theorem:
with the increase of the system scale, energy-efficiency speedup has a maximum,
i.e., energy wall exists. Song’s work, to some extent, supports our conclusion.
The “power wall” is currently one of the major obstacles computer architecture
is facing. The power wall proposed by [20, 21, 22] means that uniprocessor perfor-
mance improvements have come to an end due to power constraints, and emphasizes
instantaneous energy consumption or the average energy consumption for a period
of time. However the accumulative energy consumption over a period of time is
not reflected. Different from power wall, the energy wall is a synthesized concept,
Energy Wall for Exascale Supercomputing 957
mainly focused on the synthesize characteristics of the performance and energy. But,
it should be also noted that the concepts of energy wall and those of power wall are
not contradictory.
Energy consumption is one of the major factors that limits the development of
a high efficiency computer system. Nowadays, low power optimization in the parallel
system includes task scheduling, load balancing and so on [23, 24, 25, 26, 27, 28, 29].
The dynamic compiling technology compiles, alters and optimizes the execution
sequence of the application, which is used for reducing the energy consumption of
scientific computing and data processing [30, 31, 32].
Most of the above mentioned low power technologies are confined to the opti-
mization of dynamic power consumption, which can reduce the value of energy wall,
but cannot remove the wall fundamentally. The future low power technology should
pay more attention to the network energy consumption, especially static network
energy consumption, for the sake of removing the energy wall.
As the device size of computer design decreases, the number of transistors in-
creases dramatically, which makes the static power (e.g. leakage current) exceed the
dynamic power (e.g. dynamic switching) and dominate the power consumption of
interconnection networks. Recently, many researches are focusing on the optimiza-
tion of static power consumption from several aspects, such as buffer, arbitrating
component, switch, link, network topology and routing algorithm.
A large number of buffers are the main power consumers of static leakage cur-
rent in interconnection networks. Chen and Peh found that buffers consumed the
majority of the static power consumption of network-on-chip and proposed several
designs of power sensitive buffers [33]. Methods of power optimization of static
leakage current for cache are also used in that of interconnection network. In [34],
Hanson et al. evaluated the effects of several power optimization methods of cache
used in interconnection network, such as Gated-VDD [35]. Matsutani et al. applied
virtual channels to solve static power and dynamic power simultaneously [36, 37].
Topology optimization is a powerful approach to optimize static power consumption
of interconnection networks, and an important method is to customize the topology
of networks according to the characteristics of applications [38, 39, 40, 41]. Turnoff-
based optimization methods reduce static power through turning off some of the
routers, links and so on, or making them sleep [36, 42].
To sum up, nearly no holistic metrics or theories can be used to direct the
development of system-level low power technology. The energy wall theory aims at
improving the scalability, which directs the system-level low power technology.
5 CONCLUSION
This paper quantifies for the first time the concept of “Energy Wall” and proposes
an energy wall theory that allows the effects of energy consumption on scalability of
parallel computing systems to be understood and predicted analytically, particularly
those at the peta/exascale levels.
958 Z. Wang, Y. Tang, J. Chen, J. Xue, Y. Zhou, Y. Dong
The significance of this work is demonstrated by three representative topolo-
gies. Our work enables us to mitigate energy wall effects in system design (e.g.
by choosing the appropriate topologies) and through applying the network static
energy consumption optimizations in hardware/software approaches.
Acknowledgement
This work is supported by the National Natural Science Foundation of China (NSFC)
No. 61303068, No. 61221491 and No. 61303061.
REFERENCES
[1] Kothe, D. B.: Science Prospects and Benefits with Exascale Computing. Technical
Report ORNL/TM-2007/232, Oak Ridge National Laboratory, 2007.
[2] Simon, H.—Zacharia, T.—Stevens, R.: Modeling and Simulation at
the Exascale for Energy and the Environment. http://www.sc.doe.gov/ascr/
ProgramDocuments/ProgDocs.html.
[3] Yang, X.—Wang, Z.—Xue, J.—Zhou, Y.: The Reliability Wall for Exascale
Supercomputing. IEEE Transactions on Computers, Vol. 61, 2012, No. 6, pp. 767–779.
[4] Top 500, Website, http://www.top500.org.
[5] Shang, L.—Peh, L.-S.—Jha, N. K.: Dynamic Voltage Scaling with Links for
Power Optimization of Interconnection Networks. Proceedings of the 9th Interna-
tional Symposium on High-Performance Computer Architecture (HPCA ’03), IEEE
Computer Society, Washington, DC, USA, 2003.
[6] Amdahl, G. M: Validity of the Single Processor Approach to Achieving Large Scale
Computing Capabilities. Proceedings of the 1967 Spring Joint Computer Conference
(AFIPS ’67), 1967, pp. 483–485.
[7] Gustafson, J. L: Reevaluating Amdahl’s Law. Multiprocessor Performance Mea-
surement and Evaluation, IEEE Computer Society Press, Los Alamitos, CA, USA,
1995, pp. 92–93.
[8] Liao, X. K.—Pang, Z. B.—Wang, K. F.—Lu, Y. T.—Xie, M.—Xia, J.—
Dong, D. Z.—Suo, G.: High Performance Interconnect Network for Tianhe System.
Journal of Computer Science and Technology, Vol. 30, 2015, No. 2, pp. 259–272.
[9] Jurczyk, M.—Siegel, H. J.—Stunkel, C.: Interconnection Networks for Parallel
Computers. 1998.
[10] Sun, X. H.—Ni, L. M.: Scalable Problems and Memory-Bounded Speedup. Journal
of Parallel and Distributed Computing, Vol. 19, 1993, No. 1, pp. 27–37.
[11] Sun, X. H.—Rover, D. T.: Scalability of Parallel Algorithm-Machine Combina-
tions. IEEE Transactions on Parallel and Distributed Systems, Vol. 5, 1994, No. 6,
pp. 599–613.
[12] SPECpower ssj2008 (2008). http://www.spec.org/power_ssj2008/.
[13] Green 500. http://www.green500.org.
Energy Wall for Exascale Supercomputing 959
[14] Yang, X.—Du, J.—Wang, Z.: An Effective Speedup Metric for Measuring Pro-
ductivity in Large-Scale Parallel Computer Systems. The Journal of Supercomputing,
Vol. 56, 2011, No. 2, pp. 164–181.
[15] Woo, D. H.—Lee, H.-H. S.: Extending Amdahl’s Law for Energy-Efficient Com-
puting in the Many-Core Era. IEEE Computer., Vol. 41, 2008, No. 12, pp. 24–31.
[16] Cho, S.—Melhem, R. G.: Corollaries to Amdahl’s Law for Energy. Computer Ar-
chitecture Letter, Vol. 7, 2008, No. 1, pp. 25–28.
[17] Ge, R.—Cameron, K. W.: Power-Aware Speedup. IPDPS, 2007, pp. 1–10.
[18] Cameron, K. W.—Ge, R.: Generalizing Amdahl’s Law for Power and Energy.
IEEE Computer, Vol. 45, 2012, No. 3, pp. 75–77.
[19] Song, S.—Su, C.-Y.—Ge, R.—Vishnu, A.—Cameron, K. W.: Iso-Energy-
Efficiency: An Approach to Power-Constrained Parallel Computation. IPDPS, 2011,
pp. 128–139.
[20] Kuroda, T.: CMOS Design Challenges to Power Wall. International Microprocesses
and Nanotechnology Conference, Shimane, Japan, 2001, pp. 6–7.
[21] Meenderinck, C.—Juurlink, B.: (When) Will CMPs Hit the Power Wall? Euro-
Par 2008 Workshops – Parallel Processing. Springer-Verlag Berlin, Heidelberg, 2009,
pp. 184–193.
[22] Gioiosa, R.: Towards Sustainable Exascale Computing. 18th IEEE/IFIP VLSI Sys-
tem on Chip Conference, September 2010, pp. 270–275.
[23] Olsen, C. M.—Morrow, L. A.: Multi-Processor Computer System Having Low
Power Consumption. Proceedings of the 2nd International Conference on Power-
Aware Computer Systems (PACS ’02), Springer-Verlag, Berlin, Heidelberg, 2003,
pp. 53–67.
[24] Kadayif, I.—Kandemir, M.—Karakoy, M.: An Energy Saving Strategy Based
on Adaptive Loop Parallelization. Proceedings of the 39th Annual Design Automation
Conference (DAC ’02), ACM, New York, NY, USA, 2002, pp. 195–200.
[25] Freeh, V. W.—Pan, F.—Kappiah, N.—Lowenthal, D. K.—Springer, R.:
Exploring the Energy-Time Tradeoff in MPI Programs on a Power-Scalable Clus-
ter. Proceedings of the 19th IEEE International Parallel and Distributed Processing
Symposium (IPDPS ’05), IEEE Computer Society, Washington, DC, USA, 2005.
[26] Pan, F.—Freeh, V. W.—Smith, D. M.: Exploring the Energy-Time Tradeoff in
High-Performance Computing. Proceedings of the 19th IEEE International Parallel
and Distributed Processing Symposium (IPDPS ’05), IEEE Computer Society, Wash-
ington, DC, USA, 2005.
[27] Freeh, V. W.—Lowenthal, D. K.: Using Multiple Energy Gears in MPI Pro-
grams on a Power-Scalable Cluster. Proceedings of the Tenth ACM SIGPLAN Sym-
posium on Principles and Practice of Parallel Programming (PPoPP ’05), ACM, New
York, NY, USA, 2005, pp. 164–173.
[28] Springer, R.—Lowenthal, D. K.—Rountree, B.—Freeh, V. W.: Minimiz-
ing Execution Time in MPI Programs on an Energy-Constrained, Power-Scalable
Cluster. Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and
Practice of Parallel Programming (PPoPP ’06), ACM, New York, NY, USA, 2006,
pp. 230–238.
960 Z. Wang, Y. Tang, J. Chen, J. Xue, Y. Zhou, Y. Dong
[29] Kappiah, N.—Freeh, V. W.—Lowenthal, D. K.: Just in Time Dynamic Volt-
age Scaling: Exploiting Inter-Node Slack to Save Energy in MPI Programs. Proceed-
ings of the 2005 ACM/IEEE Conference on Supercomputing (SC ’05), IEEE Com-
puter Society, Washington, DC, USA, 2005.
[30] Unnikrishnan, P.—Chen, G.—Kandemir, M.—Mudgett, D. R.: Dynamic
Compilation for Energy Adaptation. Proceedings of the 2002 IEEE/ACM Interna-
tional Conference on Computer-Aided Design (ICCAD ’02), ACM, New York, NY,
USA, 2002, pp. 158–163.
[31] Son, S. W.—Chen, G.—Kandemir, M. T.—Choudhary, A. N.: Dynamic Com-
pilation for Reducing Energy Consumption of I/O-Intensive Applications. Languages
and Compilers for Parallel Computing (LCPC 2005), Lecture Notes in Computer
Science, Vol. 4339, 2006, pp. 450–457.
[32] Wu, Q.—Martonosi, M.—Clark, D. W.—Reddi, V. J.—Connors, D.—
Wu, Y.—Lee, J.—Brooks, D.: A Dynamic Compilation Framework for Con-
trolling Microprocessor Energy and Performance. Proceedings of the 38th Annual
IEEE/ACM International Symposium on Microarchitecture (MICRO 38), IEEE
Computer Society, Washington, DC, USA, 2005, pp. 271–282.
[33] Chen, X.—Peh, L.-S.: Leakage Power Modeling and Optimization in Intercon-
nection Networks. Proceedings of the 2003 International Symposium on Low Power
Electronics and Design (ISLPED ’03), 2003, pp. 90–95.
[34] Hanson, H.—Hrishikesh, M. S.—Agarwal, V.—Keckler, S. W.—Bur-
ger, D.: Static Energy Reduction Techniques for Microprocessor Caches. IEEE
Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 11, 2003,
pp. 303–313.
[35] Hu, Z.—Buyuktosunoglu, A.—Srinivasan, V.—Zyuban, V.—Jacob-
son, H.—Bose, P.: Microarchitectural Techniques for Power Gating of Execution
Units. Proceedings of the 2004 International Symposium on Low Power Electronics
and Design (ISLPED ’04), 2004, pp. 32–37.
[36] Matsutani, H.—Koibuchi, M.—Amano, H.—Wang, D.: Run-Time Power Gat-
ing of On-Chip Routers Using Look-Ahead Routing. Proceedings of the 2008 Asia and
South Pacific Design Automation Conference (ASP-DAC ’08), 2008, pp. 55–60.
[37] Matsutani, H.—Koibuchi, M.—Wang, D.—Amano, H.: Adding Slow-Silent
Virtual Channels for Low-Power On-Chip Networks. Proceedings of the Second
ACM/IEEE International Symposium on Networks-on-Chip (NOCS ’08), 2008,
pp. 23–32.
[38] Jalabert, A.—Murali, S.—Benini, L.—De Micheli, G.: Xpipescompiler:
A Tool for Instantiating Application Specific Networks on Chip. Proceedings of De-
sign, Automation and Test in Europe Conference and Exposition, 2004, pp. 884–889.
[39] Srinivasan, K.—Chatha, K. S.—Konjevod, G.: An Automated Technique for
Topology and Route Generation of Application Specific On-Chip Interconnection Net-
works. Proceedings of the 2005 IEEE/ACM International Conference on Computer-
Aided Design (ICCAD ’05), 2005, pp. 231–237.
Energy Wall for Exascale Supercomputing 961
[40] Xu, J.—Wolf, W.—Henkel, J.—Chakradhar, S.: A Design Methodology for
Application-Specific Networks-on-Chip. ACM Transactions on Embedded Computing
Systems (TECS), Vol. 5, 2006, No. 2, pp. 263–280.
[41] Stensgaard, M. B.—Sparsø, J.: ReNoC: A Network-on-Chip Architecture with
Reconfigurable Topology. Proceedings of the Second ACM/IEEE International Sym-
posium on Networks-on-Chip (NOCS ’08), 2008, pp. 55–64.
[42] Powell, M.—Yang, S. H.—Falsafi, B.—Roy, K.—Vijaykumar, T. N.:
Gated-V dd: A Circuit Technique to Reduce Leakage in Deep-Submicron Cache
Memories. International Symposium on Low Power Electronics and Design, July,
2000, pp. 90–95.
Zhiyuan Wang received her B.Sc., M.Sc. and Ph.D. degrees
from the National University of Defense Technology in 2003,
2005 and 2011 respectively. She is now Assistant Professor at the
State Key Laboratory of High Performance Computing, National
University of Defense Technology and the School of Computer,
National University of Defense Technology. Her research inter-
ests focus on parallel and distributed systems, machine learning,
data mining and robotics.
Yuhua Tang is now Full Professor at the State Key Laboratory
of High Performance Computing, National University of Defense
Technology and the School of Computer, National University of
Defense Technology. Her research interests focus on computer
architecture.
Juan Chen received her B.Sc. degree from the Department of
Computer Science, Southeast University in 2001 and her Ph.D.
degree from the School of Computer, National University of De-
fense Technology, China in 2007. She is now Associate Professor
in the State Key Laboratory of High Performance Computing
at NUDT, China. Her research interests focus on energy-aware
HPC interconnection networks, and parallel software frame-
works.
962 Z. Wang, Y. Tang, J. Chen, J. Xue, Y. Zhou, Y. Dong
Jingling Xue received his B.Sc. and M.Sc. degrees in computer
science from the Tsinghua University, China, in 1984 and 1987,
respectively. He received his Ph.D. degree in computer science
from the University of Edinburgh, United Kingdom, in 1992.
He is currently Professor in the School of Computer Science and
Engineering, University of New South Wales. His current re-
search interests include programming languages, compiler opti-
misations, program analysis, high-performance computing and
embedded systems. He is a senior member of the IEEE and
a member of ACM.
Yun Zhou received his Ph.D. degree from National Unversity of
Defense Technology. He is now Assistant Professor at the State
Key Laboratory of High Performance Computing, National Uni-
versity of Defense Technology and the School of Computer, Na-
tional University of Defense Technology. His research interests
focus on machine learning, deep learning and robotics.
