Portland State University

PDXScholar
Electrical and Computer Engineering Faculty
Publications and Presentations

Electrical and Computer Engineering

8-2022

Warp-Aware Adaptive Energy Efficiency Calibration
for Multi-GPU Systems
Zhuowei Wang
Guangdong University of Technology

Xiaoyu Song
Portland State University, song@ece.pdx.edu

Lianglun Cheng
Guangdong University of Technology

Hai Wan
Tsinghua University, Beijing

Wuqing Zhao
Digital Grid Research Institute

See next page for additional authors
Follow this and additional works at: https://pdxscholar.library.pdx.edu/ece_fac
Part of the Electrical and Computer Engineering Commons

Let us know how access to this document benefits you.
Citation Details
Published as: Wang, Z., Song, X., Cheng, L., Wan, H., Zhao, W., & Wang, T. (2022). Warp-Aware Adaptive
Energy Efficiency Calibration for Multi-GPU Systems. IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, 1–1. https://doi.org/10.1109/TCAD.2022.3200528

This Post-Print is brought to you for free and open access. It has been accepted for inclusion in Electrical and
Computer Engineering Faculty Publications and Presentations by an authorized administrator of PDXScholar.
Please contact us if we can make this document more accessible: pdxscholar@pdx.edu.

Authors
Zhuowei Wang, Xiaoyu Song, Lianglun Cheng, Hai Wan, Wuqing Zhao, and Tao Wang

This post-print is available at PDXScholar: https://pdxscholar.library.pdx.edu/ece_fac/700

This article has been accepted for publication in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. This is the author's version which has not been fully edi
content may change prior to final publication. Citation info「mation: DOI 10.1109/TCAD.2022.3200528

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS

Warp-Aware Adaptive Energy Efficiency Calibration
for Multi-GPU Systems
Zhuowei Wang, Xiaoyu Song, Lianglun Cheng, Hai Wan, Wuqing Zhao, Tao Wang

Abstract-Massive GPU acceleration processors have been
used in high-performance computing systems. The Dennard
scaling has led to power and thermal constraints limiting the
performance of such systems. The demand for both increased
performance and energy-efficiency is highly desired. This pa
per presents a multi-layer low-power optimisation method for
warps and tasks parallelisms. We present a dynamic frequency
regulation scheme for performance parameters in terms of load
balance and load imbalance. The method monitors the energy
parameters in runtime and adjusts adaptively the voltage level
to ensure the performance efficiency with energy reduction.
The experimental results show that the multi-layer low-power
optimisation with dynamic frequency regulation can achieve
40% energy consumption reduction with only 1.6% performance
degradation, thus reducing 59% maximum energy consumption.
It can further save about 30% energy consumption in comparison
with the single-layer energy optimisation.
Index Terms-GPU, multi-layer energy optimisation, warps
and tasks parallelisms, dynamic frequency regulation
I. INTRODUCTIO

ITH the increasing scale ofthe high-perfoImance computer system, the power consumption of the system
is increasing rapidly, which brings great challenges to the
power supply system and cooling system [I] [2] [3]. A
heterogeneous parallel system has become one of the im
portant trends in the development of high-performance com
puter systems. Compared wi由the traditional homogeneous
parallel systems, heterogeneous parallel systems with special
acceleration components have better peak computing speed
and peak efficiency. Although the GPU can improve system
performance and power consumption ratio as an acceleration
w

This work was sponsored in part by Guangclong Natural Science Foundation
under Grant 2020Al515011409, in part by Construction Project of Regional
Innovation Capability and Support Guarantee System in Guangdong Province
under Grant 2021A 1414030004, in part by Provincial Agricultural Science and
technology innovation and Extension project of Guangdong Province under
Grant 2022KJ147; and in part by the Guangdong Provincial Key Laboratory
of Cyber-Physical System under Grant 2020B1212060069.This article was
recommended by XXXX. (Corresponding author: Tao Wang)
Z. Wang is with the School of Computers, Guangdong University Technol
ogy, Guangzhou 510000, China; with the school of computer science, Wuhan
Donghu University, Wuhan 430074, China.(e-mail: zwwang@gdut.edu.cn).
X. Song is with the Department of Electrical and Computer Engi
neering, Portland State University, Portland, OR 97207, USA (e-mail:
songx@pdx.edu).
L. Cheng is with the School of Computers, Guangclong University Tech
nology, Guangzhou 510000, China (e-mail:llcheng@gdut.eclu.cn)
H. Wan is with the School of Software, Tsinghua University, Beijing
100084, China (e-mail:wanhai@tsinghua.edu.cn)
W. Zhao is with Digital Grid Research Institute, China Southern Power
Grid, Guangzhou 510000, China (e-mail:zhaowq@csg.cn)
T. Wang is with the School of Automation, Guangdong University of
Technology, Guangzhou 510000, China (e-mail:wangtao_cps@gdut.edu.cn)

component, the problem of high power consumption persists.
Optimising the power consumption of heterogeneous p 紅allel
systems remains an important challenge in consti·ucting future
e-level supercomputer systems [4] [5] [6] [7].
There are hun山eds of GPU acceleration processors in the
紅chitecture of supercomputers. The execution process of the
application program on the system includes two pru·allel levels.
One level is used to divide the application into several p 紅allel
execution tasks, which are executed on each GPU in the unit
ofblock, and the other level is that on a single GPU, the task is
executed on multiple SMs in the unit of wru-p. ln the existing
power optimisation research, the main concern is the power
optimisation problem of a single task executed on a single
GPU (parallelism of w紅ps) [8] [9] [I0] [11 ], or the power
opti1nisation problem of multiple tasks executing on multiple
GPUs (parallelism of tasks) [12] [13] [14] [15]; however,
in heterogeneous p 釘allel systems with multiple GPUs, a
large-scale parallel application usually contains multiple loop
iterations, that is, multiple pru·allel tasks without dependencies.
Among multiple GPUs, reasonably dividing multiple p 紅allel
tasks into GPUs for parallel execution and scheduling multiple
wru-ps to SMs for parallel execution of a single task are core
issues in achieving the energy optimisation of the application
program.
Our main contributions in this paper are summru·ized as
follows.
(!)We establish a hierarchical energy consumption optimisa
tion model based on wru-p p 紅allelism and task parallelism for
the first time. The experimental results show that the proposed
multi-layer energy optimisation model can reduce the energy
consumption of heterogeneous systems, and the maximum
energy consumption can be reduced by 59%. Comp 紅ed with
single-level energy optimjsation, it can further save about 30%
of the previous energy consumption.
(2)ln this model, the core parameters affecting energy
consumption are determined加ough analysis, and the range
of core parameters is discussed for load balance and load
imbalance, respectively.
(3)We propose a dynamic energy optimisation methodology.
The algorithm monitors the power consumption of the system
in real time, adjusts the frequency adaptively to the appro
priate value, and ensures that the system energy consumption
achieves the lowest without increasing the average execution
time of the program. We evaluate the dynamic optimisation
method with the static optimisation method. The experiments
show that the power optimisation algorithm of dynamic volt
age regulation can control the performance at the specified
level. Compared with the perfonnance before optimization,

This article has been accepted for publication in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. This is the author's version which has not been fully edi
content may change prior to final publication. Citation info「mation: DOI 10.1109/TCAD.2022.3200528

the perfo1111ance after multi-layer low-power optimization is
only reduced by an average of 1.6% and a power reduction of
40.01%.
The rest of this paper is organised as follows. Section II
reviews related work. Section III presents a t 紅get system
architecture. Section IV analyses the hierarchical pipeline
stJT1ctures. Section V establishes the energy consumption mod
el based on wru-p and task parallelism. Section VI gives the
constJ·aints for performance and energy. Section VII proposes
a dynamic frequency regulation methodology. Section Vlll
discusses the application by way of an analysis of case studies.
Section IX evaluates and analyses the experimental results, and
Section X concludes the paper.
II. RELATED WORK
In the existing power optimisation research, the main con
cern is the performance optimisation of a single task on a
single GPU (wa1-p parallelism). For example, Khailany et
al. [16] have proposed an approach for programming GPUs
with tightly-coupled specialized DMA warps for performing
memory transfers between on-chip and off-chip memories.
Brunie et al. [17] have both considered co-issuing instructions
from different divergent paths of the same wai-p and co-issuing
instJ·uctions from different wa1-ps as two complement 紅y tech
niques to mitigate tl1e eげect of thread divergence on SIMT
micro 面chitectures. Choi et al. [18] have analysed the GPU
to warp formation� with real C?PU
�erf'.ormance �ccording
_
"
_
hardware configurations and propose several w 紅p formations
for handling branch divergence to improve GPU performance.
Chiou et al. [19] have proposed an intelligent policy selection
process for a GPU warp scheduler based on a machine learning
approach.
Other research focuses on the problem of power optimi
sation (task pai·allelism) when multiple tasks ai·e executed
on GPUs. Guo et al. [12] analyse the energy consumption
optimisation problem of parallel tasks on a mobile cloud
computing system. It is necessary to consider the effect of
task scheduling, resource unloading, and frequency regulation
on system energy consumption. Goraczko et al. [20] propose
a task partitioning method with optimal energy consumption
for heterogeneous multi-core processors. By mapping tasks on
heterogeneous multi-core processors and combining with pro
cessor frequency regulation technology, the method optimises
processor energy consumption under the constraint of real-time
application. Morad et al. [13] discuss the partitioning of paral
lel tasks on heterogeneous multiprocessors, and concluded that
heterogeneous processors have higher performance under the
same power consumption. Wang et al. [21] have investigated
task level power optimisation. They propose a new framework
for managing system power consumption with a three-level
power control mechanism. The memory of a task execution
model of a heterogeneous system is abstracted, and top-down
power control is applied at system, group, and unit levels.
In multi-GPU heterogeneous systems, the execution of
programs involves two levels of parallelism. The first refers to
the fact that the program is divided into multiple tasks, and the
tasks are executed in p 釘allel on multiple GPUs. The second

2

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS

refers to the fact that, when a single task is executed on a GPU,
a single task will be divided into multiple thread blocks, and
the thread blocks 紅e executed on multiple SM in the form
of warps. In the above research, there is no relevant !item
ture considering two-level parallelism. Aiming at the special
紅chitecture of GPU, the present research applied two-layer
pai·allel power optimisation, and the influences of frequency
regulation and task division on system power consumption
from task level and WaIJ) level are discussed.
At the same time, for the power optimisation of hetero
geneous systems, much research has focused on the low
power optimisation method of static voltage regulation [22].
This method can provide performance improvement without
increasing power consumption, and obtain better performance
effect than simple frequency regulation under the premise of
consuming the same power. The frequency setting implies a
lack of practical operability. It needs several static voltage
regulation operations to obtain the ideal voltage or frequency
value. To overcome this defect, we propose a power optimi
sation algorithm based on dynamic frequency regulation. The
algorithm can monitor the power consumption of the system
in real time and adapt the frequency to the appropriate value,
thus ensuring the optimal power consumption of the system
without loss of performance.
Ill. TARGET SYSTEM ARCHITECTURE MODEL
A heterogeneous parallel system consists of one CPU (the
host) processor and multiple GPU-accelerated processors. The
host processor and the acceleration processors have separate
memories. The data transmission between the CPU processor
and the GPU-accelerated processor is completed through the
PCI-E bus. Each GPU contains multiple stream multiproces
sors (SMs). Multiple SMs sh 釘e data through a shared memory.
There are two levels of parallelism: parallel tasks and parallel
w 印-ps. The hier 釘chical parallel execution mode is illustrated
in Figure l.
important parameters巾at appear in the formula a.i·e listed
here. Others are defined immediately following the equations
that contain them.
A. Modeling Parallel Tasks
The execution space of kernel program is composed of many
threads that can be executed in p釘allel. These threads ai·e
divided into thread blocks and assigned to SMs in GPU for
execution. To develop the processor level parallelism contained
in multiple GPUs, the thread space of the kernel program
can usually be divided, different subspaces form different sub
kernels, which are formulated to be executed on different
GPUs at the same time. We define the sub kernel containing
the thread blocks collection as task.
GPU generally contains multiple SMs. Therefore, in order
to make full use of the parallel performance of a single GPU,
there should be a basic granularity for task division to ensure
that each SM on GPU participates in computing. The number
of thread blocks allocated to a GPU should not be less than
this basic division unit, otherwise some SMs will idle on the
GPU, wasting static power consumption. On the contJ·ary, if

This article has been accepted for publication in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. This is the author's version which has not been fully edi
content may change prior to final publication. Citation info「mation: DOI 10.1109/TCAD.2022.3200528

TABLE I
The number of GPUs in heterogeneous system
The number of GPUs participating in the
computation.
That an application is modeled as a
l,
sequence of n tasks t i
The number of SMs in a GPU
(NsM)-ma:z:
The number of the SMs involved
NsM
in the calculation.
The number of warps in a thread block
Nwarp
The average number of warps executed
per SM for a task li
The size of the warp.
The SIMD width within the processor.
The number of thread instructions for a warp
The instruction clock cycle of a warp
The SM operating frequency.
The memory access frequency.
The number of bytes of a warp
Nb y tes
accessing off-chip memory.
The GPU off-chip memory bandwidth
The number of thread blocks contained in the
kernel progr血
The least number of thread blocks can meet the
needs of all SMs pa1ticipating in the
calculation when a task is executed
in GPU
T. e
The number of thread blocks contained in
the computation tasks where the number
of the thread blocks assigned to a
GPU is larger than e (the number
of basic thread blocks)
NB
The number of tasks assigned to each GPU
Ncpu·r·e
The total task data amount of the Kernel program
The amount of data that the GPU needs to
transmit to the GPU to complete the computing
Dala(r · e ·奇）
task l i .
The task li execution time on a GPU.
TK (？1 )
The transfer time of task li between
11(Dala(r · e 心）） CPU and GPU.
The total execution time of the program
T(r)
(Ncpu)max
NGPu

p d t
arm s c
n
i? W
ss
w si

IE
wpssNCJC

灼
e

the number of thread blocks divided is more than this basic
division unit, the computing efficiency can be improved to
a certain extent. Therefore, for a given task, the task should
contain a basic thread blocks unit e during GPU execution. If
the number of thread blocks r • e included in the computing
task allocated to a GPU is more than the basic thread block
unit e, it can improve the performance of SIMD computing
pipeline more effectively, and hide the delay caused by GPU
memory access. The application Transpose (TP) can be used as
an example to illustrate the basic granularity of tasks. The TP
programs iteration space is 256 x 4096, including 4096 thread
blocks. We divide the original iteration space with a subspace
of size 256 x 64 as the basic task granularity. Assume that
each GPU contains 16 SMs. We set the size of each thread
block on the GPU to 256. When the number of thread blocks
allocated to the GPU is not an integer multiple of the SMs, a
load imbalance will occur, and some SMs will become idle.
Therefore, during the execution of this application, the basic
division granularity of tasks is 16 x (e x 256) = 64 x 256.
There is a basic thread blocks unit e = 4 for the task
which can ensure that each SM on GPU participating in the
calculation. If there are more computation tasks including
r • e(1.· =1, 2,..., 64) thread blocks divided for a GPU than
the basic thread blocks unite, the computational density inside

3

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS

the SM can be improved to certain extent.
It should be noted that multiple GPUs contained in the
system refer to homogeneous GPUs, that is, the computing
power of GPUs is the same. Therefore, from the granul 皿
ity of task division, the task allocation for GPUs with the
same computing power should be由e same. Without loss of
generality, we assume that a kernel program contains NB
thread blocks. The system has Ncpu GPUs participates in the
calculation. So, the number of tasks assigned to each GPU is
NB
Ncpu·r·e . The tasks executed on each GPU can be modeled
NB
tasks ti, t = (t1, tぁ... t-ユ-)
as a sequence of �
Ncpu:r•e
represents the task sequence on a single GPU. Therefore,
Task ti will be mapped to Ncpu GPUs for execution. For
1こi
,) represents the operation type of
Ne岱？r:e, T'リpe(ti)
task ti, and Type(t』= ｛K, T}. K indicates the computing
operation of task ti on GPU. T indicates the communication
operation of task ti between CPU and GPU. The p 釘allel
execution time of a program depends on the longest execution
unit, which is the execution unit on由e critical path, therefore,
by establishing a spatio-temporal pipeline diagram of tasks
in GPU execution operations and CPU-GPU communication
operations, we find the critical time that affects program
execution time.

<

B. Modeling Parallel Warps
A task consists of multiple thread blocks. Each thread block
consists of many threads. During the execution of the task, the
thread scheduler on the GPU dynamically allocates computing
tasks to idle SMs with thread blocks as the granularity; because
different SMs complete computing tasks independently, this
article considers the SM as the processor core of a single GPU.
The threads in the thread block are organised and dispatched
to the SM for execution in units of warp. Task distribution
is completed by the compute scheduler unit, which is CTA
(collaborative thread arrays). CTA and block are expressions
of the same thing in execution model and programming model.
Because threads in a block use the same shared memory,
all threads in a CTA must be allocated to the same SM.
SM performs CTA in the unit of warp. All threads in a
wru-p must belong to the same CTA. If only one CTA is
allocated to each SM, there is a high probability that several
wru-ps belonging to the same CTA will enter the long delay
operation at the same time, which makes the execution units
in SM wait for a long time. When the warps belonging to
different CTAs are executed on the same SM, if all the wru-ps
of one CTA enter the long delay operation, there is a high
probability that there will be warp instJT1ctions in the ready
state in other CTAs. If multiple warp contexts are kept on
SM, the latency can be better hidden than that of only one
W 紅p context. Therefore, in a SM, there can be multiple active
wai-ps waiting to be executed at the same time. That is, there
are multiple wai·p contexts can exist on the same SM, but only
one w 紅p is being executed at the same time. Various wa1-ps
executed in pru·allel occupy SM by way of time reuse. vV =
(w1, Wz, W3, · · · ,叫•exN山arv) represents a warp sequence of
a task ti. For 1 :S: i :S: 、r · e x Nwarp , Type(wi) represents
the operation type of w 紅p wi, and Type（叫） ＝ ｛C,M}. C

This article has been accepted for publication in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. This is the author's version which has not been fully edi
content may change prior to final publication. Citation info「mation: DOI 10.1109/TCAD.2022.3200528
4

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS

NB thread blocks contained in the kernel program

Iニ

ニ□三

ロ

GPU 1

゜g

□三二]三]|

Ir

Multiple GPUs Systems

GPU NoPU

Fig. I: The hierarchical p 紅allel execution mode.
represents a warp calculation operation,and M represents a
w 釘p fetch operation. We establish a spatio-temporal pipeline
diagram of wru-p calculation operations and warp fetch opera
tions, the critical time that affects the execution time of tasks
is found.
Our objective is to explore power efficiency hierru·chically
in heterogeneous parallel systems. We analyse tl1e parallel
execution of multiple wru-ps at the thread level to get the
execution time of the task. Then we analyse the parallel
execution of multiple tasks at the program level to get the pro
gram execution time. We establish the task parallel execution
pipeline and warp parallel execution pipeline in layers to find
the critical path that affects the final run-time of the program.
Through reasonable task division, reducing SMs frequency or
memory access frequency on non-critical paths can minimise
system energy consumption without affecting performance.
IV. CRITICAL EXECUTION TIMES
A. Critical Execution万mes in Warp Pipelines
The warp pipelines describe the relationship between the
concurrent execution of multiple wru-ps calculation operations
and fetch operations on an SM. C represents the calculation
operation delay of each memory access operation interval of
warp. The memory access operation interval refers to the
elapsed time from the end of the last memory access to
the beginning of the next memory access. NI indicates the
memory access bandwidth delay of the next memory access
operation. Therefore,operation C should precede memory
access operation M. Since only one w 紅p is being executed
on an SM at the same time,in the pipelines,the calculation
operation C of each warp cannot be hidden from each other
and need to be executed sequentially. The fetch bandwidth
delays caused by different fetch operation NI cannot be hidden
from each other. Therefore, the calculation operation delay and
fetch bandwidth delay in the same wru·p and between different
w 紅ps cannot overlap in the pipelines.
The meu·ics used in parallel wa1-p pipelines are as follows.
The program execution time consists of calculation operation
delay and memory access delay. a represents the average
calculation operation delay of a warp. The memory access

delay is divided into the delay caused by insufficient memory
bandwidth and tl1e delay caused by absolute memory access
time. Memory access bandwidth delay (3 refers the delay
caused by insufficient memory bandwidth. Absolute memory
access time'Y indicates the shortest time of a memory access
operation. It refers to the time from starting a memory op
eration to completing the operation. Specifically, it refers to
the time from the issuance of a reading operation command
to the completion of the instt·uction and reading the data into
the data buffer register.

ご：

釣

The delay for a w釘p to issue an insu•u io is
□ ．Cme
a :
r
can be expressed as Cwc = Ni ns t x �
The
average
.
Ssi m d
calculation operation delay of a warp can be expressed as
a= C c .
J
Since the off-chip memory bandwidth is shared by all
processor cores SM,the bandwidth allocated by each core is
上血止 . The fetch bandwidth clock cycle of a warp is C'Wm =
炉妙MtesXfm =
_ �N妙 tesXfmXNsM The average fetch bandwidth
Mbw/NsM
叫
C,,, ,n ＝ Nb ,, tesXNsM
of a warp delay can be expressed as(3 ＝ ーエ
M如
fm
(I) When aさ(3,the SM is not a critical path that
restricts the performance of the program. We call this
situation a memory-intensive program. At this time,
reducing the core operating frequency (increasing a) or
turning off some SMs can reduce energy consumption
without affecting the performance. To increase a, we
must consider the distribution of tlu·ee cases of p 釘allel
warp pipelines.
Case l: when a1 +'Y く (3 x Wp, the distribution
of parallel w紅p pipelines is shown in Figure 2(a). To
increase the calculation time a1,the total execution time
of kernel depends on the fetch bandwidth delay and
is independent of the calculation time as long as the
condition a1 +'Y く (3 x wp is satisfied. In this situation
it does not increase the total execution time of the kernel.
Case 2: when a2 +'Y = (3 x W.か the distribution of
p 釘allel wat-p pipelines is shown in Figure 2(b). We
continue increasing the calculation time a2,the total
execution time of the kernel still depends on the fetch
bandwidth delay and is independent of the calculation
"9C

This article has been accepted for publication in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. This is the author's version which has not been fully edi
content may change prior to final publication. Citation info「mation: DOI 10.1109/TCAD.2022.3200528
5

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS

time as long as a2 +'Y = f3 x WP is satisfied. In this
situation it does not increase the total execution time of
the kernel.
Case 3: when a3 +'Y > /3 x Wp, the distribution of
parallel warp pipelines is shown in Figure 2(c). We
continue increasing the calculation time a3 out of the
range of叩十'Y = f3 x Wp, the calculation time
cannot be completely hidden at this time. so, the total
execution time of the kernel increases(the non-overlap
in Figure 2(c) represents the increased time).
From the above analysis, if the program is a memory
intensive program, the condition that we can reduce
energy consumption without affecting performance is
(I)

(a :S/3） A (a+'Yさf3 x Wp)-

The total execution time Tka (r) of the program kernel
depends on the time of the fetch operation is
Tka(r) =WP·(3＝WP·

Cwm

fm
V\lp X Nbytes X NsM

(2)

M如

口

Memory-intensive: a<�
a 1

warp 3
C

<

�

.

---

C

Wp Wa<p 2
Wa<p ll I

.

-

-

--＊‘_＿＿ ＿＿
-

C

(a>(3)八（1'.Sax(WP― 1)).

c

--<>---t---------t

'

I I

"

(a) Case1: a1+y <pxWp

C

wp Warp 2
Wa<p ll I

---M -- ----

C

Warp 3

C

"n

B

a,

-—→-- ------

. --M---t-------""I
.y

C

C
I

Case I: when "/l < a x (TiVp― 1), the disu·ibution
of parallel warp pipelines is shown in Figure 3(a).
To increase the fetch bandwidth delay {31 , the total
execution time of kernel depends on the wru·p calculation
time is dependent of the fetch bandwidth delay as long
as the condition of "/l < ax(Wp -1) is satisfied. In this
situation it does not increase the total execution time of
the kernel.
Case 2: when "(2 = a x (WP ― 1), the distribution of
p 紅allel warp pipelines is shown in Figure 3(b). We con
tinue increasing the fetch bandwidth delay {32 , the total
execution time of the kernel still depends on the warp
calculation time. It is dependent of the fetch bandwidth
delay as long as the condition of "(2 =ax(Wp - 1) is
satisfied. In this situation it does not increase the total
execution time of the kernel.
Case 3: when "(3 > a x (WP - 1),
1) the disu·ibution of
p 紅allel w紅p pipelines as shown in Figure 3(c). We
continue increasing the fetch bandwidth delay凡out
of the range of而＝ax(WP ― 1)), the memory cycle
cannot be completely hidden at this time. So, the total
execution time of the kernel increases (the non-overlap
in Figure 3(c) represents the increased time).
From the above analysis, if the program is a warp
calculation-intensive program, the condition that we can
reduce energy consumption without affecting perfor
mance 1s

C

Cwc

.

fc

WP X Ninst X Swarp

M

ssimd X fc

M

(b) Case2: a2>al
a2+y叶3 xWp

The total execution time T知，（r ) of the program kernel
depends on the time of calculation operation is
Tkb(r) =WP· a=WP ·

M

＂匹

(3)

(4)

In summary, the execution time of the kernel is
Tk(r)= Tka (r) I\ (a :S(3)八(a+ r :S (3x Wp)
+ Tkb(r) I\ (a>(3） A (1 ax(WP ― 1)).

<

(5)

W,cp 3
Wp Wa<p 2

B. Critical Execution Times in Task Pipelines

Wa<p l
(c) Case3: a3>a2
a3+y>�xWp

＂匹

Fig. 2: The distribution of memory-intensive p 紅allel warp
pipelines.
(2) When a>/3，the fetch bandwidth delay is not a critical
path that restricts the performance of the program. We
call this situation a warp calculation-intensive program.
At this time, reducing the memory access frequency
(increasing -y) or turning off SMs can reduce energy
consumption without affecting performance. To increase
'Y we must consider the distt·ibution of three cases of
parallel w 紅p pipelines.

Task pipelines describe the relationship between the con
current execution of multiple tasks perform operations and
communication operations on a program. K denotes巾e fact
that the task that performs operations on the GPU and T
is the communication operation. ln a heterogeneous system
including CPU and multi GPUs, only the CPU main processor
can transmit data to a certain GPU at the same time since the
main memory is sh 釘ed by each GPU. Each GPU processor
runs independently. To develop processor-level p 紅allelism
between multiple GPUs, the parallel thread space of a program
can usually be divided into tasks, and different tasks can
be formulated and executed on different GPUs at the same
time. It is noted that each task requires the main processor
to tJ·ansmit the corresponding data required to complete the

This article has been accepted for publication in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. This is the author's version which has not been fully edi
content may change prior to final publication. Citation info「mation: DOI 10.1109/TCAD.2022.3200528
6

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS

口二

Warp calculation-intensive: a>�
Warp3
w, warp2
Wacpl
time

ニ

(a) Case!: yl<ax(Wp-1)
Wa,p3
Wp Wa巾 2
W叩l

y2'

NGPU

Task performs
operation
Communication
operation

GPU n
GPU 2
GPU 1

I

T2

ビ=:J Overlap

ヽ

三
＼
＼
k1 `m1

Tl

|

Kn+1

心

Time
Fig. 4: The distribution of calculation-intensive parallel task
pipelines.

(b) Case2, (�2>�1)->(y2>yl)
y2=ax(Wp-1)

performance of the program. We call this situation a
communication-intensive program. The distJ·ibution of
communication-intensive parallel task pipelines is shown
in Figure 5. Each GPU processor p 紅ticipating in the
calculation has an idle state, and it can be considered that
the communication delay of all tasks cannot be hidden
well by the task's calculation. The total execution time
of the program is

W•巾3

Wp W• 四
Wa<pl
(c) Case3c (�3>�2)->(y3>y2)
y3>ax(Wp-1)

F
NB
t
.e.
）） ＋
乃(r) = �
·T1(Data(,I
·
0'
'
NB
r e

Fig. 3: The distribution of calculation-intensive p紅allel warp
pipelines.
computing task, and the data communication process is trans
rnitted through the PCl-E bus. So, the communication process
is serial.
The metrics used in parallel task pipelines are as follows.
The amount of data per thread block can be expressed as NFB .
The amount of data that the task ti needs to compute is r • e •
F

NGPU

Tn

GPU n
GPU 2
GPU 1

T2
T,

K1

T山）．

kn
K2

唸
GPU P esso ;die
面

km1

『

Time

.

NB

The ratio of the times consumed in task computation to the
task communication (denoted by R) is a fundamental factor
determining the distribution of the pipeline. For task ti, we
have
R(r) =

九('r)
‘

T1(Data(,I ..e·

応））

(6)

．

(1) When R(r) 2'. Nc
N, pu, the communication time of the

task is currently not a critical path that restricts the
performance of the program. We call this situation a
task calculation-intensive program. The distJ·ibution of
calculation-intensive parallel task pipelines is shown in
Figure 4. Each GPU processor participating in the calcu
lation is always at full load, so it can be considered that
all data communication delays are hidden by the GPU
calculation task. Since由e number of tasks assigned to
外
each GPU is
the total execution time of the
Ncpu·r:e'
program at this time is

（ 8)

Fig. 5: The disn·ibution of communication-intensive pru·allel
task pipelines.
In summary, the total execution time of the program is
T(r) = (R(r) 2'. N) I\ T1(r) + (R(r)

< N) I\ T2(r),

where
F
NB
冗(r) = Napu·T1(Data(r· e· -ir)) + � ·I',山），
Napu·r·e
靡
F
靡
乃('r) ＝―•T1(Data(r · e ·一）） ＋九(r),
r. e
NB
互(r) = Tka(r) I\ (a� fJ) I\ (a+

ァ<

fJ x W砂

+Tkb('r) A (a > B) A (Tさax (Wp―1)).

(9)

V. HIERARCHICAL ENERGY MODEL

According to the law of physics, energy consumption E
m,, is the product of power consumption P and execution time
•T山）
T由）＝ Ncpu·T1(Data(r·e· ―-)）＋
· T. Complementary metal-oxide-semiconductor tJ·ansistors (CNB
況pu·r·e
(7) MOSs) are basic computer devices. The relationship between
the dynamic power consumption and the execution frequency
(2) When R(r) < Ncpu, tl1e computation time of the of the processor can be approximately expressed as P ex: j 3
task at this time is not a critical path that restJ·icts the [23].
F,,.

N8

This article has been accepted for publication in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. This is the author's version which has not been fully edi
content may change prior to final publication. Citation info「mation: DOI 10.1109/TCAD.2022.3200528

During the implementation of a program, a task is usually
mapped to a GPU for execution. Suppose there are Ncpu
GPUs in the system to participate in the calculation. A task
refers to loop iterations without circular dependencies. In
CPU+GPUs heterogeneous systems, the tasks we used refer
to the thread blocks allocated to each GPU. We assume that a
complete application contains n tasks ti, t = (ti, t 2 , t ふ ．．．互）
represents the task sequence of an application. A task ti
contains r • e tlu·ead blocks. The threads in the thread block
are organised and dispatched to the SM on the GPU for
execution in units of w 紅p. A w紅p contains 32 threads, so
we can say that a task ti contains r • e/32 warps. ¼I =
(‘Wぃ'W か 'Wふ．．·叫／3 2 ) represents a WaJ1) sequence of a task.
According to the analysis of the pipeline distribution of the
w 紅p, the calculation time of the task ti on a single GPU is
equal to the calculation time Tk a (r) of the parallel execution of
r·e/32 w紅ps contained in task ti. The fetch time of task ti on
a single GPU is equal to the fetch time Tkb(r) of the parallel
execution of r • e/32 warps contained in task ti. For task ti,
the power consumption of GPU processor at frequency fc is
defined as凡(f,J, the power consumption of GPU memory
at fetch frequency fm is defined as肛Um)- The dynamic
energy consumption of the system is
Ed

Ncpu NsM

=

区区(P心•(Jc)· Tka(r)+

.i=l i=l
Ncpu NsM

L
L (P
j=l i=l

(I 0)
m)ij · (f:,,.)

· T.瓜r).

We assume that the static power consumption of an GPU
is P., [23). The remaining GPUs are in the shutdown or
lowest power operating state, and their static power is ignored.
The total program execution time is T(r). It is assumed that
the static power consumption of the GPU remains unchanged
during program execution, so the static energy consumption
consumed by the system is
Es

=

Ncpu · P., · T(r).

(11)

In heterogeneous systems, in addition to the dynamic power
is tl1e main system power source, the frequent communication
operations between the CPU processor and multiple GPU
acceleration processors also impose non-negligible commu
nication power overhead. It is assumed that the bus does
not support dynamic voltage/frequency scaling technology,
that is, the execution speed and power consumption of data
communication operations are constant. Considering the PCI-E
bus as a special type of functional unit, the power consumption
overhead during execution process is A,1, and A,o in the idle
state. Therefore, the communication power consumption of the
system is
Pb

=

Pb,l+ Pb,O·

(12)

The bus idle state time can be expressed as T(r)-T1(Data(r·
F
e · Na
)) · Ncpu. According to the formula (12), the system

7

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS
communication energy consumption is
Eb

=

F
A,1 ・ Ti(Data(r • e• — ））
NB

F
‘
+ A,o · (T(,I ·)―刀( Data(r • e · �)) · Ncpu).
NB

(13)

The total energy consumption of the system Et including
dynamic energy consumption, static energy consumption, and
communication energy consumption is
Et

=

Ed + E s + Eb.

(14)

By substituting Eqs.(10), (11), and (13) into Eqs.(14), the
objectives of the energy optimisation problem are as follows
NcPu NsM

恥＝区区(Pふ．i·Uc)· Tk (r)+
j=l-i＝ 1
Ncpu NsM
(P
L L
i=l

1=1

a

m )ij

· (.fm ) · T:叫r·)

(15)

+ Napu · Ps · T(r) + Pb,l · T1(Data(r · e ·

F

NB

））

F
+ Pb,o · (T(r) - T1(Data(,I,.．e· +,-)) · Napu) .
NB

Performance is an important constraint in energy opti
misation. Performance here refers to the execution time of
the application after energy optimisation. lf the performance
constraint satisfies (9), the power consumption can be reduced
without affecting the performance. The energy optimisation
problem can be modeled as
Minimisation
Subject to

Et

=

T ( r· )

Ed＋恥＋Es,

＝{ mr,I'·)，（R(r·)r 2 NN)，.
乃( ),

(R( )

>

(16)

)

VI. CONSTRAINTS FOR PERFORMANCE AND ENERGY
From (15), reducing the processor frequency fc or the
memory frequency fm can reduce the processor power con
sumption Pc and the memory power consumption Pm . The
four parameters Ncpu, r, NsM, WP have a direct influence
on the processor operating frequency fc and program runtime
T(r ). Therefore, the key to energy consumption optimisation
is to determine the effects of these parameters on program
execution performance and energy consumption. In a multi
GPUs environment, we discuss the parameter value ranges
under load balancing and load imbalance conditions.
A. Constraints for Load Balance
During the paJ·allel execution of tasks, when the number of
tasks in the application is greater than the number of GPUs in
the system, that is，阿＞（NGPu) max all GPus can run at
full capacity. At this time, the GPUs can basically guaJ·antee
load balance. Napu GPUs participate in the calculation in
formula (15), ranges from 1 � Ncpu � (Napu) ma
During the execution of tasks on a GPU, when tasks
contains more thread blocks than the number of SM processor
匹

This article has been accepted for publication in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. This is the author's version which has not been fully edi
content may change prior to final publication. Citation info「mation: DOI 10.1109/TCAD.2022.3200528

cores in the GPU, that is, r · e > (NsM) max , there are
enough p紅allel thread blocks to fill SMs in the GPU. Then,
the value of the number of processor cores NSM in a single
GPU ranges from 1::; NsM::; (NsM) max , and the number of
w紅ps concurrently executed on one SM is WP = (vVp) ma匹
(V\lp ) max represents the maximum number of wa11)S allowed
for concurrent execution by hardware resources, which is
determined by the hardware resources of the SM and the
amount of hardware resources used by Kernel programs.
When the number of running thread blocks included in tasks
is less than the maximum allowed by the SMs, that is, r • e <
r • e] SMs in an
(NsM) max , there must be [(NsM) max
idle state. The range of the number of SMs is modified to
1 ::; NsM ::; r. The p紅ameter vVP is related to the value of
NSM. Nwarp denotes the number of WaJJ)S contained in each
thread block. According to the previous analysis, we know that
r·eXNwa.,·p
32xNsM represents the average number of w紅ps executed
by each SM in parallel. Therefore, it is not difficult to analyse
that WP= min((Wp) max , 疇敦宮�).
ー

B. Constraints for Load Imbalance
In some applications, due to the effect of the problem size,
the number of tasks included in the application may be less
than the number of GPUs in the system, that is，告� ＜
· At this time, the task load will be imbalanced in
PU
(NGap
u) max
i
a multi-GPU environment. There are 血
m
oart
r•• e GPUs part1c1pate
the calculation, and the remaining [(Napu) max ―告] GPUs
are idle. Our strategy is to turn off the processor cores of these
GPUs to reduce the overhead of static power consumption.
In (1 5), the number of GPUs participating in the calculation
Napu value range should be modified to 1 :S Napuさ告�．
Then, in the process of executing a single task on GPU, as
analyzed in tl1e previous section, when r · e > (NsM) ma::r..,
the value of the number of processor cores NSM in a single
GPU ranges from 1 :S NsM :S (NsM) max , and the number of
w釘ps concurrently executed on one SM is WP = (vVp) ma::r...
When the number of running thread blocks included in tasks
.
is less than the maximum allowed by the SMs, that is,
,I” . eく(NsM) max , the value of the number of processor
cores NsM in a single GPU ranges from 1 :S NsM さr. e,
and the number of Wllil)S concurrently executed on one SM is
WP= min((W)
p ma エ 9 二塾2).
32X
Suppose (fc )l ow and Uc) h i g h represent the lowest and
highest value of the SM operating frequency, respectively;
m)) h i g h represent the lowest and highest value
Um)low and Um
of the memory fetch frequency, respectively; Tmax is the
maximum number of thread blocks contained in tasks on a
GPU. The range of Napu, NsM, WP can be expresses as

叩

NB
(1 :S Ncpuさ(Ncpu) max ) I\ ( � � (Ncpu) max )V
r. e
NB
靡
（lこNGPuこ • ） A （ • ＜ (Ncpu) max )r e
r e

(17)

(1::::; NsMさ(NsM) max ) I\ (r · e;::: (NsM) max )V
‘

(1さNsM::; r · e) A (,I .．e < (NsM) max )-

(18)

8

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS
Algorithm 1 Dynamic frequency regulation

Input:
Parallel programs consisting of multiple tasks;
Output:
fc ,fm ;
I: Ncpu = (Ncpu)max , NsM = (NsM)max, Jc = !co, fm =
fmo;
2: Execution is repeated at an interval of Ns thread blocks until
the program is completed
3: Executing r • e t11read blocks when frequency regulation is not
allowed;
4:: Tk a (r)
(r) = the current kernel calculation time;
5: Tkb(r) = the memory access time;
6: T(r) = the program execution time;
7: Pt = the total power consumption of the system;
8 : if R(r) � (Ncpu) m ax then
9:
NsM = (NsM)max,Ncpu = (Ncpu)ma ぶ
Executing (Ncpu)max X r · e thread blocks when the fre
IO:
quency regulation is allowed;
’
l l:
r) = the current kernel calculation time,
Tk a(）
12:
Tkb(r)＝由e memory access time;
13:
T(r)
r) ' = the program execution time;
14:
Pt = the total power consumption of the system;
15: else
16:
NsM = (NsM)niax,Ncpu =「 R(r)l,
17:
Executing 「 R(r)l x r • e threads blocks when the frequency
regulation is allowed;
’
18 :
r) = the current kernel calculation time,
'I',叫）
19:
Tkb(r) = the memory access time;
20:
T 日＝ the program execution time;
21:
Pt = the total power consumption of the system;
22: end if
23: while T(r) � T(r) do
24:
if Pt T(r) > PtT(r) then
Executing the remaining threads Ns - 2 x (N⑫U)ma x X
25:
r • e in the current task division mode and frequency, and
do not allow frequency adjustment;
else
26:
Executing the remaining threads Na - ((Ncpu) max +
27:
「 R(r)l) x r • e in the current task division mode and
frequency, and do not allow frequency adjustment;
Tk a (r}.
28 :
µc = Tka (r),
Tkb(T) ．
29 :
µm = ’I油(r)'
=
30:
cl
µc X Jc;
f
31:
fm l = µm X fni;
C points are uniformly sampled as frequency scaling factors
3 2:
within the interval of [µ c, 1];
C points are uniformly sampled as frequency scaling factors
33:
within the interval of [µ=, 1];
Executing r thread blocks repeatly, record the dynamic
34
power consumption Pi and the program execution time of
each execution Ti(I � i � C). The execution time and dy
namic power consumption curves are fitted by polynomial;
35:
36:
37:
38 :
39:
40:
41:
42:
43:
44:
45:
46:
47:

minimise Px 冗←select JJ, cx and µm必
end if
end while
if P,ェ 九<PtTt then
Jex = /1,cx X fぷ
fnix = µnix X fni;
fc = fcが
fni = fniぶ
else
Jc = fc1;
fm = fniI,
end if
return fc Jm ;

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS

，

This article has been accepted for publication in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. This is the author's version which has not been fully edi
content may change prior to final publication. Citation info「mation: DOI 10.1109/TCAD.2022.3200528

respectively. The number of thread blocks that can run on
a single GPU (r) is derived through analysing the program
,I·- ex Nwarp
code
of the original application. Our strategy is to set the
mi n((vVp) max ,
(19)
(W p =
）））＾
= rni
32 X NsM
number of thread blocks used by the whole program NB
as the repetition period, and the data ai·e counted every r
(r · e <(NsM) max )thread blocks. The execution of the r thread blocks becomes
VII. DYNAMIC FREQUENCY REGULATION METHODOLOGY
a sampling period. During the execution of r thread blocks,
We assume that the static power consumption of GPU we use （芭＞ （NGPu)ma 1) A (T, e > （NsM)rna ェ） as an ex
remains unchanged during execution, so the total static energy ample to introduce the whole execution process. Algorithm 1
consumption consumed by multiple GPUs is, E = Napu · P.. · proceeds as follows. According to tl1e previous analysis, we
T(r) . Since P.. remains unchanged during program execution, can know NsM = (NsM) rnax · W hen R(r) � (Ncpu)rnax ,
there is E ex: Napu · T(r)) . In other words, tl1e static energy all Ncpu GPUs should participate in the calculation. When
r
consumption generated by multiple GPUs is proportional to R(r) < (Ncpu) max, only 「 R(r·)lI should patticipate in the
the rectangular area surrounded by the abscissa and ordinate calculation.
Step 1: W hen r · e > (NSM
in the parallel task pipelines spatio-temporal graph. For each
sM) max , in the initial state
(
possible number of GPUs Napu, we find r in the range
Ncpu = (Ncpu) rnax ,NsM= (NsM) rnax ,fc = fco,frn =
thread blocks is executed,
of [l,, �],
the total kernel execution time T(r)
mi
c pU)
frn o ), the first (NGP
fe ], to minimize
u) max x r · e
and
the
current
kernel
calculation
time
Tka (r),memory access
the
and record
results as (k1, Sか(k2, S2), • • • (kN, SNcpu).
time
T
,
the
total
power
consumption
of the system Pt are
r
ふ(1さiさNapu) represents the shortest execution time that
kb ( )
recorded
(Lines
I
-7).
由e program can obtain when the system contains i GPUs.
Step 2: If R(r) � (Ncpu)max , then N 四＝ （NsM)rna工
如represents the granularity of task division when optimal
performance is achieved. It should be noted that if R(r) 2". i is N"aPu = (Ncpu) m ax · We use the same thread space settings
‘
obtained for optimal performance,all i GPUs must p紅ticipate in step 1 to execute (Ncpu)max X,I ..e thread blocks.
in the calculation; when R(r) < Ncpu, the execution time of The program calculation time Tk a (r)', memory access time
the program is independent of the number of GPUs NaPU· Tkb (r) , the total power consumption of the system P� 紅e
In fact, it is not difficult for us to analyze from由e spatio recorded (Lines 8-16). If R(r) < (Ncpu) rnax , then N亙M=
temporal diagram that the system only needs 「 R(r)l GPUs can (NsM) max , N"cpu =「 R(r)l 「 R(r)l x r • e thread blocks
completely hide the computing task from data communication. are executed. The program calculation time Tka (r)', memory
,I” ,the total power consumption of the system
At this time, using more GPUs will only increase the number access time Tkb （）
of active GPUs without reducing the execution time of巾e 凡also ai·e recorded (Lines 16-21).
Step 3: In the remaining sampling guidance period NB program. Therefore, only 「 R(r)] 紅e required to participate
in the calculation, and the static energy consumption of the 2x(Ncpu) maxxr-e or NB-((Ncpu) m ax + 「 R(r)l)xr-e,
Tkb(r)
Tk a (r·)
system is the lowest.
we use factor µ c = �
1.,Ka (r·) and µ m = 1 油 (r·) to adjust the
For the value of NsM, we analyzed it earlier. When the processor frequency and the memory frequency respectively.
number of thread blocks on a single GPU is much larger than Repeat step 2 until the total execution time of the application
the number of SMs, that is, r · e > (NsM)max indicates that program T(,I‘-) � T(r). Note that the operating frequency of
there ru·e enough p 紅allel tasks occupying all SMs. Currently, the processor is fc1, the memory access frequency is fm1,
all SMs on the GPU participate in the calculation to ensure therefore fc1 = µ c x f い fm1 = µ rn X frn (Lines 17-26 ).
the load balance between SMs. Thus, NsM = (NsM)max ·
Step 4: We uniformly sample C points in the interval
ln some applications, due to巾e impact of factors such as
[µ c , l] and the interval [µ m , l], respectively. Repeat executing
problem scale or application concurrency,巾e number of ,I••thread blocks,record the dynamic power consumption Pi and
threads running may be less than the maximum allowed by the program execution time of each execution Ti(lさi ::; C).
the SMs, that is r • e く (NsM) max , then there must be The execution time and dynamic power consumption curves
NsM= (NsM) max - r · e SMs in an idle state. Obviously, 釘e fitted by polynomial; The frequency tuning factor µ ex and
closing the idle SMs can reduce the static power consumption.
µ mx are selected to minimize the total energy consumption
Our approach samples and counts the data periodically. By PxTx . If Px 冗<PtTt, the current processor frequency f匹
estimating the increase ofpower consumption,the appropriate and memory frequency fmx are used, otherwise the original
frequency is calculated to guide real-time frequency regulation processor frequency fc and memory frequency f�n ai·e used
and task partitioning on由e processor, to achieve the optimal (Lines 27-44).
power consumption. We establish the repetition period in the
Similarly, according to the analysis in Section 6, in other
process. Each repetition period contains multiple sampling cases, we can also dynamically adjust the frequencies of the
periods and a sampling guidance period. ln terms of tl1e data program according to the algorithm with different parameter
from the sampling period, the proper frequency values of values, to achieve the optimal energy consumption without
processor and memory are calculated to guide the tasks for loss of performance.
the rest of the sampling guidance periods.
VIII. APPLICATION CASE ANALYSIS
Suppose the system contains (Ncpu)max GPUs, one of
which contains (NsM) max SMs, and the initial operating
We take a specific program as an example to introduce
frequencies of the processor and memory are.fco and.fmo, the proposed model-based power optimisation process. The
(Wp

=（ W砂n ax ) I\ (r · eき(NsM)max )V

9

This article has been accepted for publication in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. This is the author's version which has not been fully edi
content may change prior to final publication. Citation info「mation: DOI 10.1109/TCAD.2022.3200528
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS
TABLE ll: Model based p紅ameter classification
Category
System parameters
Program parameters
Profile parameters
Indirect parameters
Solving parruneters

Parameter name
Sw a rp, Ssmid,'Y
Wp , Nwarp, r, Ni ns t, F, NB
F
T1 (Date(r · e ·—
NB ））
Nby tes, Cwc, Cwm,'lk a (r),'l油(r), Tk(r),
巧(r), 72(r), T(r)
NsM, Jc, fm, Ncpu

key parameters of model solution mainly include system
parameters, program parameters, profile p紅ameters, indirect
parameters, and solution parameters (Table II).
Take the program Mati·ixmul (MM) as an example and exe
cute it on the simulated Quadro FX5600. System parameters:
The SMID width within the GPU is Ssmid = 8, the size
of w紅p is Swarp = 32, and the memory cycle of a fetch
operation is'Y = 420 cycles. Program parameters: The delay
of a command sent by a warp is four cycles. If ilie pipeline
depth of the simulated GPU is 24, the pipeline delay can be
effectively hidden when the number of concurrent warps on
an SM exceeds six (24/4).
In MM application, the number of warps contained in
40x256 ・ Suppose
thread =
each thread block Nw arp ＝ 心XN
32
s心ar
° o
that in the MM applic�tion, 2.56 KB is use[as the basic
granulai·ity of each of the tasks, and巾e total task is divided
into two subtasks, each subtask contains 20 巾read blocks,
so the average number of w紅ps executed per SM for a task
is WP =
�
16 = 10. The CUDA Programming Language
provides a platform independent intermediate code PTX in
stJ·uction, and the execution of the simulator is driven by PTX
instruction, so we use PTX instJ·uction to analyse ilie program
parameters. Ttu·ough the analysis, we can find 由e number of
computing insti·uctions Ne-i nst = 228 and the number of
accessing insti·uctions Nm-inst = 7 in each thread of MM
program, so it can be seen巾at the instJ·uction clock cycle of
= 228 X 閉＝ 912 cycles.
; warp Cwc = Ne -i ns t X �
Ssi,nd
The bandwidth per memory chip of Quadro FX5600 is
4 bytes/cycle. The number of bytes of a w紅p accessing off
chip memory Nb yte s = Nm -i n st X 4 bytes/cッclex Swar·p =
7 x 4 bytes/cyde x 32 = 896 bytes.
According to the system p紅ameters and program p紅ame
ters, we find an expression for the indirect parameters through
the previous analysis. By inti·oducing the system pai·ameters,
program parameters, and indirect parameters into (15), we can
get由e solution p紅ameters: the optimal number of active cores
NsM, the optimal number of GPUs involved in the calculation
Napu, the core operating frequency fc and memory access
frequency.fm- In the solution process, we use a dynamic
frequency regulation algorithm to find the optimal solution.
This method starts from the highest configuration (all SM
cores p紅ticipate in the calculation, all GPUs participate in the
calculation, and tl1e processor runs at the highest frequency),
and seai·ches through the dichotomy in four dimensions: the
number of SMs, the number of GPUs involved in the calcu
lation, the SM Operating frequency, and the memory access
frequency.

IO

IX. EXPERIMENTAL EVALUATION
A. Experimental Platform and Test Cases
GPGPU-Sim is a mature GPU simulator at this time. The
version GPGPU-Sim is v3.2.2. It is a clock level performance
simulator, which supports the CUDA PTX (parallel thread
execution) instJ·uction set and can run CUDA and OpenCL
programs. However, for the experimental environment needed
in this paper, GPGPU-Sim has two shortcomings: 1) it does not
support multi-GPUs environment, and 2) it does not support
the performance simulation of data communication. Therefore,
we modify GPGPU-Sim, add the performance statistics of data
communication part, and implement a simple simulator in the
application layer (the CPU side of CUDA program), which
simulates the simultaneous execution of multiple GPUs in an
event由iven manner. In addition, GPGPU-Sim configures the
GPU by reading the configuration file at runtime. Therefore,
we can simulate the multi-GPUs environment by dynamically
modifying the configuration file in the program.
For the interconnection network between CPU and GPU, we
refer to the power consumption modeling method used in Pow
erRed [24]. For the network topology between multiple GPUs,
we use the communication event driven method to simulate the
communication between multiple GPUs. We maintain a global
event list in CUDA main program. Each event contains the
following four information: time stamp, GPU number, event
type and paraineter list. "Time stamp" indicates the clock cycle
of the event. "GPU number" refers to the GPU to which
the event belongs. "Event type" includes four categories:
communication stait event, calculation end event, calculation
start event, and calculation end event. "Parameter list" refers
to the parameter list required for CUDA program to call
communication operation and kernel calculation. According
to the general method of event driven simulation, the event
with the smallest time stamp is taken from the global event
list every time.
GPGPU-sim provides a flexible architecture level configu
ration method, such as core number, core frequency, access
bandwidth, and other important structural parameters. In the
experiment, we simulated three high-perfonnance GPUs from
NVIDIA: Quadro FX5600, GeForce 8800GT, and GeForce
GTX 280. The hardware configuration p紅ameters 紅e given in
Table Ill. The real processor only supports discrete frequency
regulation, so we define nine discrete frequency values with
100MHz as the step.
This paper selects nine common test cases from NVIDIA
CUDA softw釘e development kit ai·e used in much scientific
computing. Table IV lists the relevant parameters of each
calculation core function, where thread represents the number
of threads contained in each block, and comp Inst and MEM
Inst represent the data of calculation instructions and memory
access instructions in each thread, respectively.
B. Single GPU Environment
We evaluate and analyse the energy optimisation effect of
the task in a single GPU environment at the thread level. For a
single task, according to the model we proposed, we obtain the
optimal number of active SM NsM, the optimal SM operating

This article has been accepted for publication in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. This is the author's version which has not been fully edi
content may change prior to final publication. Citation info「mation: DOI 10.1109/TCAD.2022.3200528
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS

II

TABLE Ill: Simulator configuration p紅ameters
Parameter
Shader core
SIMD pipeline width
Shared memory
Memory chip
Bandwidth per memory chip
Fetch latency
Shader core frequency
Memory frequency
Shader core frequency regulation
Memory frequency regulation

Quadro
FX5600
16
8
16 KB
16
4 bytes/cycle
420
1.35 GHz
1.6 GHz
0.5-1 35GHz,
Step= I 00MHz
075-1 6 GHz,
Step= I 00MHz

GeForce
8800GT
14
8
16 KB
16
4 bytes/cycle
420
1.5 GHz
1.8 GHz
0.65-1.SGHz,
Step=IOOMHz
0.95-1.8 GHz,
Step=IOOMHz

GeForce
GTX 280
30
8
16 KB
16
4 bytes/cycle
450
I 3 GHz
I I GHz
045-l.3GHz,
Step=I 00MHz
025-1.1 GHz,
Step=I 00MHz

TABLE IV: Kernel parameters
Kernel
B1onic(BI)
bLackscholes(BS)
fwtBatchl(FB)
Matrixrnul(MM)
RandomGPU(RG)
scalarProd(SP)
Scan_best(SB)
dwtHaarl D(DH)
Transpose(TP)

Data size
256
2000000
8M
128 X 80
4096 X 5860
256 X 4096
512
4096
256 X 4096

block
I
480
4096
40
32
128
I
4
4096

thread
256
128
512
256
128
256
512
512
256

Comp inst
610
3350
302
228
287268
655
295
142
52

Mem lllSI
2
163
8
7
5864
64
4
4
2

frequency.fc, and memory access frequency.fm on a single
GPU (Table V). The values in the angle brackets in Table V,
in turn, represent the number of active SM NSM, the SM
operation.fc -OL, and the memory access frequency operation
.fm -0 L. The value range of the.fc -0 L operation level is 08, and that of.fm - OL operation level is 0-4. The smaller the
number, the lower the frequency. As seen from Table IV, the
five programs BI, BS, FB, SB, and TP 紅e computationally
intensive, so they 紅e limited by the SMs in GPU, and the
main power optimisation space is to reduce the frequency of
memory. BI, SB, and DH have a very low level of memory
frequency operation, because the number of blocks in these
three applications is very small, and only part of SM is struted
when the program is running. Currently, the number of active
SM in BI, SB, and DH is f 紅 lower than that in BS, FB, and
T P. According to the theoretical model, the storage bandwidth
allocated by each wru·p is very high, so to save power, we
need to reduce the frequency of memory to match the running
speed of the processor. MM, RG, SP, and DH 釘e memory
intensive applications, so they ru·e limited by the memory in
the GPU. The main power optimisation space is designed to
reduce the frequency of the SM. To ensure little performance
loss in the process of energy consumption reduction, it can be
seen from Table V that the number of SM cores will be lower
upon the increase of SM operation level and vice versa.
Figure 6 shows the ratio of computing energy consumption
and memory energy consumption of SMs in a single GPU to
the original energy consumption after adjusting the pru·ameters
in Table 11. From the figure, we can see that the memory
energy consumption of BI, SB and DH is far less than that
of SMs. This is consistent with the conclusion in Table V
that the optimal active SMs number of NsM of BI, SB and
DH on a single GPU is less, resulting in a lower memory
frequency operation level. NIM, RG, SP and DH consume

Data/thread block
256 KB
4 MB+67 KB
2 KB
256 KB
128 x (5 MB+ 740 KB)
8 MB
512 KB
I MB
256 KB

Data/r thread block
256r KB
4r MB+67 KB
2r KB
256r kB
128 x (5r MB+ 740 KB)
8r MB
512r KB
Ir MB
256r KB

less computing energy than BS, FB and TP on a single
GPU. This is mainly because the power optimization space
of NIM, RC, SP and DH is to reduce the frequency of
processor core, which leads to the reduction of computing
energy consumption. To sum up, it can be seen from Figure 6
that by properly selecting the number of SM involved in the
calculation, adjusting the core frequency of the processor and
the frequency of the memory, the energy consumption of each
application program is decreased by different programs, and
the average energy consumption can be reduced by up to about
29%.

C. Multiple GPUs Environment
We evaluate and analyse the energy optimisation effect of
the program in multiple GPU environments at the task level.
For a program containing multiple tasks, according to our task
level model, we can get BI[I, 1), BS[7, 8), FB[IO, 8), MM[6,
「 R(6 · 256kB)l], RG[5, 「R(l28 x (5 · 5MB 740kB))l],
SP[7, 「 R(7 · 8MB)l], DH[!, I], TP[IO, 8). The first param
eter in square brackets represents the optimal task partition
granularity, and the second parameter is the optimal number
of GPUs involved in the calculation. From the experiments,
we find that the optimal task partition strategy is usually
obtained at 5 to IO basic task p 釘tition granularity. On the one
hand, this is because the task granulai·ity is too small allow
full development of the instruction-level parallelism in SM
and hide the memory access delay, so the computing power
of the GPU is not exerted. On the other hand, the CUDA
programming model stipulates that a maximum of eight-thread
blocks can run on an SM at the same time, which is limited by
the system resources occupied by each thread, so more than
eight-t旧ead blocks cannot continue to develop p 紅allelism
between thread blocks.

+

This article has been accepted for publication in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. This is the author's version which has not been fully edi
content may change prior to final publication. Citation info「mation: DOI 10.1109/TCAD.2022.3200528
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS

匿璽SMs(Quadro FX 5600)
区盈Memory(Quadro FX5600)
区SMs(Geforce 8800GT)
匿璽Memory(GeForce 8800GT)
筐璽SMs(GeForce GTX 280)
緊翠Memory(GeForce GTX 280)

80%

10%
0%

BI

BS

FB

MM

h

ぐ
‘
‘
、
、
、
、
、

鴎
鴎
娼
謳
鴎
in
su03A 5
Ja
uT
oq_B.Iuoq_du

70%

RG

SP

SB

DH

TP

Fig. 6: The ratio of optimized energy consumption to original
energy consumption(through warps parallel).
TABLE V: The number of active SMs, processors and memory
operation levels

Bl
BS
FB

MM

RG
SP
SB
DH
TP

< NsM,fc-OL,fm-OL >
Quadro FX 5600 GeForce 8800GT GeForce GTX 280
<7,8,1>
<15,8,I>
<8,8,I>
<15,8,3>
<7,8,4>
<9,8,1>
<13,8,1>
<6,8,2>
<7,8,2>
<10,3,2>
<27,3,3>
<14,3,2>
<25,3,4>
<14,5,3>
<16,5,3>
<24,3,3>
<8,3,2>
<13,3,2>
<16,8,1>
<8,8,1>
<7,8,1>
<6,3,1>
<19,4,1>
く2,3,2>
<14,8,2>
<6,8,3>
<7,8,2>

SMs(Quadro FX5600)
Memory(Quadro FX5600)
Communnication(Quadro FX5600)
SMs(GeForce 8800GT)
Memory(GeForce 8800GT)
Communication(GeForce 8800GT)
SMs(GeForce GTX 280)
Memory(Geforce GTX 280)
Communication (Geforce GTX 280)

80�

12

m
5
訊
姐
訊
巫
咽

msuou A gJau1
on E.r uon du

The choice of the optimal number of GPUs is affected by
the parameter R(r), because the number of blocks in BI,
SB, and DH applications is very small, there is no need
to divide the tasks of these applications, and only one GPU
can satisfy the calculation demand. When R(,I‘·) 2': Napu, the
five programs of BI, BS, FB, SB, and TP ai·e calculation
intensive. These applications return have the best performance
when all GPUs have to paiticipate in the calculation. When
R(,I`·) < Napu, �MN!, RG, and SP ai·e memory-intensive
programs. The execution time of the program is independent
of tl1e number of GPUs NcPU· From the distribution along
the task pipeline, it is not difficult to analyse and conclude that
only 「 R(r)l GPUs are needed to realise the complete hiding
of data communication from computing tasks. At this time,
using more GPUs will only increase the number of active
GPUs, but not reduce the execution time of the program.
Therefore, only 「R(r)l GPUs ai·e needed to participate in the
calculation to achieve optimal program performance. At the
same time, closing the remaining GPUs can reduce the total
energy consumption of the system as much as possible.
Figure 7 shows the ratio of computing energy consumption
and memory energy consumption of SMs in multi-GPUs to
the original energy consumption. The energy consumption of
each application program after two-tier energy optimjzation
is significantly lower than that after only one tier energy
optimization by selecting the appropriate task partition granu
larity and the number of GPUs involved in the calculation.
The number of blocks in BI, SB, and DH applications
is very small, so there is no need to divide the tasks of
these applications. At the same time, the optimal number
of GPUs involved in the calculation is 1, which has not
changed compared with the wa1-ps level, so the computing
energy consumption, memory energy computing of SMs has
not changed accordingly. The optimal number of GPUs for
BS, FB, and TP is the number of GPUs in the system, that
is, 8 GPUs. Therefore, although the energy consumption of
these three applications has been reduced, it is not obvious
that the energy consumption of NI NI, RG and SP has been
reduced. The main reason is that M NI, RG and SP only need
some GPUs to participate in the calculation to achieve the
optimal program performance. Our strategy is to turn off the
remaining GPUs, so the corresponding SMs computing energy
consumption and communication energy consumption will be
greaしly reduced. The computing energy consumption and com
munication energy consumption are the main part of the total
energy consumption of the system. We combine the dynamic
frequency regulation technology with the core shutting down
technology to minimize the total energy consumption of the
system, and the average energy consumption can be saved by
up to 57%.

0�

D. Dynamic and Static Optimisation Method
BI

BS

FB

RG

SP

SB

DH

TP

Fig. 7: The ratio of optimised energy consumption to original
energy consumption (through warp and task parallelism).

To highlight the effect of dynamic frequency regulation, we
compared it with static frequency regulation. The static fre
quenc� regL'.��:io1: here r�fers �o the result of one-time adjりst
ment. It is difficult to reduce the average energy consumption
to the specified level by one-time static frequency regulation.

This article has been accepted for publication in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. This is the author's version which has not been fully edi
content may change prior to final publication. Citation info「mation: DOI 10.1109/TCAD.2022.3200528
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS

13

TABLE YI: Performance and energy consumption results of dynamic frequency regulation and static frequency regulation
Program
BI
BS
FB
MM
RG
SP
SB
DH
TP
Average

Pe1formance degradation
under dynamic
optimisation algonthm
-4.93%
-0.53%
0.24%
-1.19%
-1.23%
-0.37%
-4.75%
-4.84%
3.16%
-1.60%

Energy consumption reduction
under dynamic
optim1sat1on algonthm
24.33%
46.34%
36.56%
53.67%
57.15%
46.34%
28.67%
34.67%
32.33%
40.01%

Table VI compares the energy consumption and performance
results of one-time static frequency regulation and dynamic
frequency regulation algorithms. All our experimental results
are aimed at simulating the average energy consumption
and performance of three high-performance GPUs: Qua由o
FX5600, GeForce 8800GT, and GeForce GTX 280. It can be
seen from the Table VI that there is still 25.92% performance
degradation in primary static frequency regulation (compai·ed
with that without optimisation). Therefore, to achieve the
required performance goals, frequency regulation is often
required several times. However, this method suffers from a
lack of practical operability, so a real-time dynamic frequency
regulation algorithm is needed to guide the processor to adjust
the frequency.
According to the results in Table VI, the power optimisation
algorithm of dynamic frequency regulation can control the
performance at the specified level. The performance after
optimisation is almost equal to that without optimisation, with
an average performance degradation of 1.60% and a energy
reduction of 40.0 I%. In all nine programs, the maximum
energy consumption increase is no more than 5%. BI, SB,
and DH have the worst effect, mainly because the number
of thread blocks of these two programs is small, resulting in
the number of repetition cycles being far less than with other
programs. The more repetition periods, the higher the sampling
frequency, the more voltage regulation points are inserted into
the program, and the more voltage regulation opportunities
are obtained. Therefore, the performance after optimisation
is closer to the target value. Compared with other programs,
the perfotmance of F B and T P is improved by 0.24% and
3.16% after optimisation. This is mainly because F B and T P
紅e computationally intensive applications, which are li1nited
by the SMs in the GPU. The main power optimisation space
is to reduce the memory frequency. The energy consumption
of memory is the main component of the total system energy
consumption, and its proportion is much greater than that of
the calculation component, therefore, by adjusting the memory
access frequency, the energy consumption of the system can
be reduced rapidly, and the performance is un affected.
Figure 8 shows the results of dynamic frequency regulation:
herein, the abscissa represents different repetition periods, and
the ordinate represents the target frequency. Each point in
the curve represents the optimal frequency calculated from
the sampling period in each repetition period. The frequency
value is used to guide the setting of processor frequency in

Performance degradation
under static
optimisation algorithm
25 82%
26.93%
2894%
22 85%
3065%
15 45%
23.45%
24.91%
31.56%
25.62%

Energy consumption reduction
under static
opt1m1sation algorithm
33.76%
54.59%
45.47%
59.67%
64.15%
56.34%
38.14%
43.56%
38.34%
48.22%

the next sampling guidance period. Our experimental platform
is tl1e GeFoce 8800GT. The number of血ead blocks in BI,
SB, and DH is too small (I, I, and 4, respectively), so this
is un suitable for periodic sampling, thus we only apply
dynamic frequency adjustment for six applications: BS, F B,
MM, RG, SP and TP. Throughout the process of dynamic
frequency regulation, we set the repetition period to N 圧
which is the block parameter value in Table IV. It can be
seen from Figure 8 that the optimal processor frequency and
memory access frequency decrease from the initial values
of 1.5GHz and 1.8GHz, respectively, then fluctuate within a
certain range. There ai·e two horizontal stt·aight lines plotted
through the centre of the curve, which represent the results of
static frequency regulation. If the results of dynamic frequency
regulation are taken as reference values, the horizontal straight
line position representing static frequency regulation results is
higher. Among them, the horizontal lines of RG and TP ai·e
almost at the top of the curve, evincing the greatest deviation
of the horizontal line from the centre of the curve in all
programs, which is consistent with the results in Table VI.
Table VI shows that the energy consumption increase of
RG and TP under static frequency regulation is the largest,
reaching 30.65% and 31.56%, respectively.
X. CONCLUSIO
With the continuous improvement of GPU integration, its
power consumption is increasing day by day. Therefore, re
searches into power optinusation for multi-GPU heterogeneous
pai·allel system architectures have become a focus among
those in this cognate ai·ea. The present work analyses the
pai·allel pipeline execution mechanism of multiple tasks and
multiple warps and establishes a hierai·chical energy con
sumption optimisation model. A dynamic frequency regulation
scheme for performance p 紅ameters in tenns of load balance
and load imbalance is presented. The evaluation results of
nine typical applications show that the proposed multi-layer
energy optimisation with dynamic frequency regulation can
achieve 40% energy consumption reduction with only 1.6%
performance degradation, thus reducing 59% maximum energy
consumption. It can further save about 30% energy consump
tion in compai·ison with the single-layer energy optimisation.
REFERENCES
[IJ Y. Lu, B. He, X. Tang, and M. Guo, "Synergy of dynamic frequency
scaling and demotion on DRAM power management: Models and

This article has been accepted for publication in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. This is the author's version which has not been fully edi
content may change prior to final publication. Citation info 「mation: DOI 10.1109/TCAD.2022.3200528
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS

14

2〇

臼曰

巳

!

｝

且'.

:0

L. ；

ー• 一 →
゜�ヽ

2
;
p`” :tion”rわd
て

↓

゜

’

2
3
`
R•”tition”五叫

;----:
’

゜� 3

’

6

5

S
i0 12 1ヽ16 18 20
R petition period
SP
そ

RG

MM

l"
’· O

l'"

i0
°· B

〇

↓

8 l2 i6 20 24 2る 32 36 40 44 `3 52 "} 60 64 C3
Re芦tition pemod
BS

i35 i80 225 2TO 3;5 5eo +”
Repetition period

゜� 2
゜�゜

0

FB

H

�'35,SO 225 270 315 JtOヽ05
ぬpetition period
TP

Fig. 8: Dynamic frequency regulation result.

l2J

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

ll I]

l12]

[13]

optimizations," IEEE Transactions on Computers, vol. 64, no. 8, pp.
2367-2381, 2015
J. Sun, C. Huang, and J. Dong, "Research on power-aware scheduling for
high-performance computing system," in 20/I IEEE/ACM International
Co吋erence on Green Computing and Communications, 2011, pp. 75-78.
H. V. Raghu and G. L. G. Prasad, "Power and performance modeling
of scientific applications for energy optimization in high performance
computing," in 2016 22nd Annual International Conference on Advanced
Computing and Communication (ADCOM), Sep. 2016, pp. 15-20
J. Richardson, S. Fingulin, D. Raghunathan, C. Massie, A. George,
and H. Lam,℃omparative analysis of HPC and accelerator devices:
Computation, memory, i/o, and power," in 2010 Fourth International
Workshop on High-Performance Reconfigurable Computing Technology
and Applications (HPRCTA), 2010, pp. 1-10
V. W. Freeh, D. K. Lowenthal, F. Pan, N. Kappiah, R. Springer,
B. L. Rountree, and M. E. Femal, "Analyzing the energy-time trade
off in high-performance computing applications," IEEE Transactions on
Parallel and Distributed Systems, vol. 18, no. 6, pp. 835-848, 2007
R. Ge, X. Feng, S. Song, H. Chang, D. Li, and K. W. Cameron, "Pow
erpack: Energy profiling and analysis of high-performance systems and
applications," IEEE Transactions 011 Parallel and Distributed Systems,
vol. 21, no. 5, pp. 658-671, 2010.
A. Munir, S. Ranka, and A. Gordon-Ross, "High-performance energy
efficient multicore embedded computing," IEEE Transactions on Parallel
and Distributed Systems, vol. 23, no. 4, pp. 684-700, 2012.
Y. Huangfu and W. Zhang, "W 釘p-based load/store reordering to improve
GPU data cache time predictability and performance," in 2016 IEEE 19th
International Symposium on Real-Time Distributed Compwing (ISORC),
2016, pp. 166-173
V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, 0. Mutlu,
and Y. N. Pan, "Improving GPU pe1fonnance via large warps and two
level warp scheduling," in 201 I 44th Annual IEEE/ACM International
Symposium on. Microarchilecture (MICRO), 2011, pp. 308—317
Y. Oh, K. Kim, M. K. Yoon, J. H. Park, Y. Park, M. Annavaram, and
W. W. Ro, "Adaptive cooperation of prefetching and warp scheduling on
GPUs," IEEE Transactions on Computers, vol. 68, no. 4, pp. 609-616,
2019
M. K. Yoon, Y. Oh, S. H. Kim, S. Lee, D. Kim, and W.W. Ro, "Dynamic
resizing on active warps scheduler to hide operation stalls on GPUs,"
IEEE Transactions on Parallel and Dislributed Systems, vol. 28, no. 11,
pp. 3142-3156, 2017.
S. Guo, J. Liu, Y. Yang, B. Xiao, and Z. Li, "Energy-efficient dynamic
computation offloading and cooperative task scheduling in mobile cloud
computing," IEEE Transactions on. Mobile Computing, vol. I8, no. 2,
pp. 3 I9-333, 20 I 9
T. Y. Morad, U. C. Weiser, A. Kolodnyt, M. Valero, and E. Ayguade,

[14J
ll5J

ll6J

ll7J

ll8]
[19]

l20]

[21J

[22J

[23J

[24]

"Pe1formance, power efficiency and scalability of asymmetric cluster
chip multiprocessors," IEEE Compuler Archileclure Lellers, vol. 5, no. I,
pp. 14-17, 2006.
Z. Wang, N. Xiong, H. Wang, L. Cheng, and W. Zhao, "Whole procedure
heterogeneous multiprocessors low-power optimization at algorithm
level," Cluster Computing, vol. 22, no. I, pp. 2407-2423, 20 I 9
Z. Wang, L. Cheng, H. Wang, W. Zhao, and X. Song, "Energy
optimization by software prefetching for task granularity in GPU
based embedded systems," IEEE TransacJions on Industrial Electronics,
vol. 67, no. 6, pp. 5120-5131, 2020
M. Bauer, H. Cook, and B. Khailany,℃udadma: Optimizing GPU
memory bandwidth via warp specialization," in SC'/1: Proceedings
of 20/1 buernalional Co叫rence for High Performance Compuling,
Ne/working, Storage and Analysis, 20 I I, pp. 1-1 I
N. Brunie, S. Collange, and G. Diamos, "Simultaneous branch and warp
interweaving for sustained GPU performance," in 2012 39/h Annual
lnlernational Symposium on Compuler Archilecture (!SCA), 2012, pp.
49-60
H. Choi, D. Son, and C.-H. Kim, "lmpact of warp formation on GPU
performance," lnlernational Journal of Compuler and Communicalion
Engineering, pp. 241-245, 01 2013.
L. Chiou, T Yang, J. Syu, C. Chang, and Y. Chang, "Intelligent
policy selection for GPU warp scheduler," in 20/9 IEEE lnlernalional
Conjerence on Arlificial !111elligence Circuits and Sys/ems (A!CAS),
2019, pp. 302-303.
M. Goraczko, Jie Liu, D. Lymberopoulos, S. Matic, B. Priyantha, and
Feng Zhao, "Energy-optimal software partitioning in heterogeneous
multiprocessor embedded systems," in 2008 45/h ACM/IEEE Design
Automalion Co,l}erence, 2008, pp. 191-196.
Z. Wang, W. Zhao, H. Wang, and L. Cheng, "T hree-level performance
optimization for heterogeneous systems based on software prefetching
under power constraints," F uture Genera/ion Computer Systems, vol. 86,
pp. 51 - 58, 2018.
Z. Wang, X. Song, L. Cheng, and H. Wang, "Activity-driven task
allocation in energy constrained heterogeneous gpus systems," IEEE
Transac1io11s 011 Computer-Aided Design of Integrated Circuils and
Sys/ems, vol. 40, no. 11, pp. 2357-2371, 2021
W. Liao, L. He, and K. M. Lepak,''Temperature and supply voltage
aware performance and power modeling at microarchitecture level,"
IEEE TransacJions 011 Compuler-Aided Design of ln1egrated Circuits
and Systems, vol. 24, no. 7, pp. 1042-1053, 2005
K. Ramani, A. Ibrahim, and D. Shimizu, "Powerred: A flexible modeling
framework for power efficiency exploration in gpus," in In Workshop on
General Purpose Processing 011 Graphics Processing Unils, 2007, pp.
1-10

且

This article has been accepted for publication in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. This is the author's version which has not been fully edi
content may change prior to final publication. Citation info「mation: DOI 10.1109/TCAD.2022.3200528
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS

Zhuowei Wang received the B.S. degree 111 com
puter science and technology from the China Uni
versity of Geosciences, Wuhan, China, in 2007, and
the M.S. and Ph.D. degrees in computer systems
architecture from Wuhan University, Wuhan, in 2009
and 2012, respectively
From 2019 to 2020, she worked as a Visiting
Scholar with the Norwegian University of Science
and Technology, Gj0vik, Norway. She is currently an
Associate Professor with the School of Computers,
Guangdong University of Technology, Guangzhou,
China. Her research interests focus on high-performance computing, low
power optimization, and distributed systems.

Xiaoyu Song received the Ph.D. de忠•ee from the
University of Pisa, Pisa, Italy, in 1991. From 1992
to 1998, he was on the faculty with the University
of Montreal, Montreal, QC, Canada. He joined the
Department of Electrical and Computer Engineering,
Portland State University, Prntland, OR, USA, in
1998, where he is currently a Professor. His research
interests include formal methods, design automation,
embedded systems, and emerging technologies
Dr. Song was awarded an Intel Faculty Fellowship
from 2000 to 2005. He was an Editor of the IEEE
TRANSACTIONS ON VLSI SYSTEMS and IEEE TRANSACTIONS ON
CIRCUITS AND SYSTEMS

sor networks.

Lianglun Cheng received the M.S. degree 111 au
tomation from the Huazhong University of Science
and Technology, Wuhan, China, in 1990, and the
Ph.D. degree in machine,"/ manufacturing and au
tomation from the Changchun Institute of Optical
Precision Machinery and Physics, Chinese Academy
of Sciences, Changchun, China, in 2000
He is currently a Professor with the Institute of
Computer, Guangdong University of Technology,
Guangzhou, China. His research interests focus on
Internet of Things, cyber-physical systems, and sen-

Hai Wan received the B.S. degree from the National
University of Defense Technology, Changsha, China,
in 2003, and the Ph.D. degree from Tsinghua Uni
versity, Beijing, China, in 2011. From 201 I to 2012,
he was a Post-Doctoral Researcher with INRIA
NancyCGrand Est, France.
Since 20 I 3, he has been on the Faculty with
Tsinghua University. He is currently an Associate
Professor with the School of Software, Tsinghua
University. His research interests include real time
systems and embedded systems

Wuqing Zhao reveived B.S. degree in information
management and information system from China
University of PetJ·oleum, Qiangdao, China, in 2005,
and the Ph.D. degree in computer systems architec
ture from Wuhan University, Wuhan, in 2011
He works in Digital Grid Research Institute, China
Southern Power Grid. He is especially interested in
high-performance computing, low-power optimiza
tion, and distributed systems

15

Tao Wang received the Ph.D. degree from the
School of Information Science and technology, Sun
Yat-Sen University in 2010.
He is now an associate professor of School of
automation, Guangdong University of technology.
His current research interests include smart manufac
turing, industrial intelligence and high-pe1fonnance
computing, collective intelligence and collaborative
optimization.

