Energy Minimization of System Pipelines Using Multiple Voltages by Qu, Gang et al.
Energy Minimization of System Pipelines Using Multiple Voltages 
Gang Qut, Darko Kirovskit, Miodrag Potkonjakt, and Mani B. Srivastavat 
t Computer Science Department, University of California, Los Angeles, CA 90095- 1596 
$Electrical Engineering Department, University of California, Los Angeles, CA 90095- 1596 
Abstract 
Modem computer and communication system design has to con- 
sider the timing constraints imposed by communication and system 
pipelines, and minimize the energy consumption. We adopt the re- 
cent proposed model for communication pipeline latency[23] and 
address the problem of how to minimize the power consumption 
in system-level pipelines under the latency constraints by selecting 
supply voltage for each pipeline stage using the variable voltage 
core-based system design methodology[l 11. We define the prob- 
lem, solve it optimally under realistic assumptions and develop al- 
gorithms for power minimization of system pipeline designs based 
on our theoretical results. We apply this new approach on the 4- 
stage Myrinet GAM pipeline, with the appropriate voltage profiles, 
we achieve 93.4%, 91.3% and 26.9% power reduction on three 
pipeline stages over the traditional design. 
1 Introduction 
System level pipelines are widely acknowledged as the most likely 
bottleneck of many computer systems [ 16,201. For example, a read 
miss in the system data or instruction cache blocks the application 
program until the entire block with requested data arrives [ 1, 221. 
The trade-off is clear: longer blocks imply fewer misses, but also 
longer interrupt latency. Similarly, in high speed local and wide- 
area networks selecting properly block size to exploit intrinsic con- 
currency in communication pipelines is a key issue [2, 6, 251. As 
the final example where communication pipelines dictate perfor- 
mances we mention path-oriented operating systems [ 173. 
Therefore, it is not surprising that recently the question of how 
to improve the performance of a system pipeline received a great 
deal of attention in computer architecture, operating systems, and 
compilers communities. The essence of the problem is abstracted 
in recent work by Wang et a1 [23] where they discuss how to mini- 
mize the transmission latency by carefully packet fragmentation. 
On the other hand, the increasing use of portable systems (such 
as personal computing devices, wireless communications and imag- 
ing systems) makes the power consumption one of the primary cir- 
cuit and system design goals. The most effective method to reduce 
power consumption is to lower the supply voltage level, which ex- 
ploits the quadratic dependence of power on voltage [4]. However, 
reducing the supply voltage increases circuit delay and decreases 
the clock speed. The resulting processor core consumes lower aver- 
age power while meets the deadlines. Unfortunately, this technique 
becomes ineffective when tight deadlines are present in systems. 
Recent progress in power supply technology along with custom 
and commercial CMOS chips that are capable of operating reliably 
over a range of supply voltages makes it possible to build processor 
cores with supply voltages that can be varied at run time according 
to the application latency constraints [18]. The variable voltage 
processor core is capable of operating at different optimal points 
along the power and speed curve in order to achieve high energy 
efficiency. In particular, with multiple supply voltages on the chip, 
the processor core can use high voltage for applications with tight 
deadlines while keep the voltage low for others to reduce the total 
energy consumption. 
In this paper, we address the energy minimization problem in 
system-level pipelines under latency constraints. We use the recent 
advances in power supply technologies and the variable voltage de- 
sign methodology to choose a voltage profile for each pipeline stage 
which optimally minimizes the energy consumption of the entire 
pipeline system. 
A Motivational Example 
To illustrate the key ideas behind our new approach, we consider a 
small communication system shown in Figure 1. The system con- 
sists of 3 store-and-forward stages operated by three identical pro- 
cessors. Assume stage 1 has the slowest transmission speed, and a 
packet of 4 equal-size fragment has to be sent through this system 
by a deadline T.  
The transmission starts from time 0 and is completed at T .  Fig- 
ure 1 shows three different strategies. Each rectangle represents the 
transmission of one fragment at one of the three stages. The base of 
the rectangle is the time that the fragment stays at that stage, while 
the height can be considered as the supply voltage. 





Figure 1: A 3-stage pipeline system transmits a 4-fragment packet. 
Traditional processors run at a fixed supply voltage. The to- 
tal energy consumption is minimized at the lowest possible supply 
voltage which guarantees a finishing time T .  Further calculation 
results in solution (a) in Figure 1, where we can see processors on 
stages 0 and 2 have been idle due to stage 1’s slow transmission 
speed. The total energy can be reduced by applying different volt- 
ages on different stages. As shown in (b), all stages are synchro- 
nized after reducing the supply voltages on stages 0 and 2. More 
0-7803-5471 -0/99/$10.0001999 IEEE 
1-362 
energy efficiency is possible when we vary the supply voltage lev- 
els on each processor. Since the total energy consumption is dom- 
inated by stage 1 ,  which requires the highest voltage, using high 
voltages on stage 0 for the first fragment and on stage 2 for the last 
fragment saves transmission time for stage 1. With more transmis- 
sion time, stage 1 requires lower voltage and thus could reduce the 
total energy consumption. The concept is shown in (c). 
The rest of the paper is organized as follows, we review the re- 
lated work in communication pipeline and low power design tech- 
niques, then we define the problem in section 3. We solve the 
problem optimally in two cases: (i) each pipeline stage has a fixed 
voltage which varies from stage to stage; (ii) every stage can have 
variable supply voltages. We present the experimental results in 
section 6 and then conclude. 
2 Related Work 
The most relevant related work are efforts in communication pipe- 
line design and evaluation, and low power design techniques. In 
particular, within the former domain fragmentation techniques for 
managing congestion control, packet buffering, packet losses, and 
the optimization techniques for improvement of distributed file sys- 
tems and high-speed local area networks are directly relevant. With- 
in the latter, we focus our survey on system-level power minimiza- 
tion techniques and variable voltage techniques. 
In the introduction section, we already surveyed a number of 
communication-pipeline systems and research efforts for latency 
optimization of these systems. It is important to note that many ap- 
plication specific systems operate at the highest-level of abstraction 
as processing pipelines on blocks of input. Fragmentation has been 
used in the design of Internet for quite a long time. More recently, 
studies of how to exploit flexible block fragmentation to improve 
performances of DEC workstations has also been conducted [ 131. 
More detailed survey of fragmentation techniques is given in [23]. 
Dynamically adapting voltage and therefore the clock frequency, 
to operate at the point of lowest power consumption for given tem- 
perature and process parameters was first proposed by Macken et 
a1 [14, 151. Later, [12] described implementation of several digi- 
tal power supply controllers based on this idea. Nielsen et a1 [ 191 
extended the dynamic voltage adaptation idea to take into account 
data dependent computation times in self-timed circuits. Recently 
several researchers developed efficient DC-DC converters that al- 
low the output voltage to be rapidly changed under external control 
[18]. Researchers at MIT [5 ,  101 have applied the idea of voltage 
adaptation based on data dependent computation time from [19] to 
synchronously clocked circuits. 
In the software world, also there has been recent research on 
scheduling strategies for adjusting CPU speed so as to reduce power 
consumption. The existing work is in the context of non-real-time 
workstation-like environment. [24] proposed an approach where 
time is divided into 10-50 ms intervals, and the CPU clock speed 
(and voltage) is adjusted by the task-level scheduler based on the 
processor utilization over the preceding interval. [9] concluded that 
smoothing helps more than prediction in voltage changing. Finally, 
[27] described an off-line minimum-energy schedule and an aver- 
age rate heuristic for job scheduling for independent processes with 
deadlines. 
A great variety of system-level low power techniques has been 
proposed. For comprehensive surveys see [7,26]. Energy efficient 
microprocessor design has been discussed in [3, 81. Hong et al. 
[ 1 I ]  describes a design methodology for the real-time system-on- 
chip based on dynamically variable voltage processor cores. 
3 Problem Formulation 
The variable voltage is generated by the DC-DC switching regu- 
lators, the amount of time for the voltage to reach steady state at 
the new voltage is in the order of 10 cycles in a micro-processor 
[18]. In most part of this paper, we use the ideal variable voltage 
processor [21] where the supply voltage can be changed from 0 to 
03 instantaneously without any overhead. Although this ideal pro- 
cessor is not feasible the study of this model gives us insight view 
of the problem and more important, it provides the lower bound of 
energy consumption by using variable voltage processors. 
As proposed in [23], we represent the communication system 
as a sequence of store-and-forward pipeline stages characterized 
by {n, gi, Ti(vref)}. There are n pipeline stages in the system, for 
each stage i, gi is the fixed per-fragment overhead and Ti(v,,f) 
is the per-byte transmission time with a reference supply voltage. 
gi can be considered as the context switch time. It may vary from 
stage to stage. Ti(v,,f) is proportional to the inverse of the band- 
width and high voltage implies a high transmission speed. 
A packet of size B (e.g. in byte) has to be transmitted through 
the pipeline with latency constraint T .  We send the packet in k 
fragments to utilize the pipeline, denote zi the size of the ith frag- 
ment and t i , j  the time that the ith fragment stays on stage j. 
Let vJ ( t )  be the voltage at which the j t h  processor operates at 
time t ,  then 
Ej = .i' P ( V j ( t ) )  d t  ( 1 )  
is the energy consumed by this processor, where P(v )  is the power 
dissipation at supply voltage v. We want to minimize E = 
by finding the best voltage and fragment schemes. 
Ej 
The problem is formulized as: 
Problem: Energy Minimization with Deadline on Variable 
Instance: A pipeline with parameters n,gi and Ti(w,,f), 
Question: Find the voltage scheme wj ( t )  for each processor 
Voltage Processor(EMDVVP). 
a packet with size B and deadline T .  
and a fragment {ZO, 5 1 , .  . . } of the packet, such that the 
entire packet is transmitted within T and the total energy 
consumption E = soT P(wj ( t ) )  d t  is minimized. 
Figure 2: Problem formulation. 
4 Design of System Pipelines Using Multiple Voltages 
To design application specific and energy efficient system pipelines 
with the variable voltage processors, we have to solve the EMDVVP 
problem based on the user-specified packet information (i.e., packet 
size B and transmission latency T )  and the parameters of the sys- 
tem pipeline (number of pipeline stages n, per-fragment overhead 
gi ,  transmission speed Ti(v,,f) as well as the power dissipation 
functions.). 
Lemma 4.1 A necessary condition for the energy to be minimized 
is to finish the transmission exactly at the deadline T .  
The intuition behind Lemma 4.1 is that the system will use as 
much time as possible to schedule the processors with low voltages 
and thus minimize energy consumption. On the other hand, from 
the convexity of the energy and voltage function [21], we have: 
Lemma 4.2 On every stage, to minimize the energy, supply volt- 
age changes on either the arrival of a new fragment or the accom- 
plishment of sending the current fragment. 
Recall that t i , j  is the time that the ith fragment stays in the j th  
stage, which includes both the overhead gj and the actual transmis- 
sion time. for each single stage, the best strategy is to transmit a 
fragment immediately upon its reception or at the accomplishment 
of sending the previous fragment whichever comes later. This ob- 
servation leads to the next lemma: 
1-363 
Lemma 4.3 
for all 0 5 i 5 k - 2 and 1 5 j 5 n - 1, the following holds: 
In the optimal voltage and fragmentation schemes, 
ti,? = t i+l , j - l  (2) 
4.1 Fixed Voltage on the Same Stage 
We first consider the simple case when the processor at each stage 
operates at a fixed voltage which can be arbitrary. The voltage 
scheme problem then becomes to finding a constant vj for the pro- 
cessor at the j th  stage, and ti,j can be expressed as: 
t i , j  = gj + Tj(vj)z i  (3) 
Assume that the packet can only be fragmented into equal size 
A voltage scheme {VO, V I , .  . . , vn-l} minimizes 
fragments, then from (2), 
Lemma 4.4 
the energy consumption only if 
ti,j = constant (4) 
From (3), the processor at the stage that has the largest per- 
fragment overhead has to operate at a high voltage to achieve a 
small per-byte transmission time Tj (vj) due to (4). Therefore, this 
stage will consume more energy than other stages and we call such 
a stage dominant stage because it dominates the total energy con- 
sumption. 
Theorem 4.1 Let stage d be the dominant stage, then there is a 
unique solution for the EMDVVP problem. The number of frag- 
ments is given by: 
and the constant on the r.h.s. of (4) is A. 
4.2 
Now we assume each fragment can have variable size and each 
processor can run at different level of voltage. 
As formulated in Figure 2, a solution to the EMDVVP problem 
means a supply voltage function for each processor and a packet 
fragmentation. 
Lemma 4.2 outlines the shape of the voltage functions, which 
are step functions with all possible break points at the time when 
new fragment anives or current one leaves. Therefore we only need 
to determine the supply voltage vi,j for each processor to transmit 
each fragment, which reduces the problem from finding n functions 
to determining nk  numbers, where k is the number of fragments. 
Lemma 4.3 predicts a recursive relation among the time that 
fragments stay at each stage, from which (n  - l)(k - 1) vi,j's can 
be easily determined. 
Lemma 4.1 tells us the energy is minimized only when the en- 
tire transmission finishes at the deadline, so one more variable can 
be eliminated. Combining all these, we propose an approach to the 
optimal scheme in Figure 3 and draw the following conclusion: 
Theorem 4.2 Given the number of fragments, the EMDVVP 
problem with variable-sized fragment and variable voltage at each 
stage is reduced to solving a nonlinear system (step 6 in Figure 3) 
of n + 2k - 3 free variables. 
Numerical solution can be obtained from the empirical power 
and speed function. Furthermore, by repeating this approach for 
all possible values of k, we can solve the EMDVVP problem opti- 
mally. 
Variable Voltages on the Same Stage 
1. Let{zo ,s l ,  ..., xk-l}beafragmentation,withzk_l given 
2. k t  f v o . 0 ,  ~ 1 . 0 ,  . . . , W k  - 1 .O 1 be the voltage scheme for 
by B - xi. 
. _  
the piockssor i t  stage 0. 
each stage to transmit the last fragment of the packet, where 
Vk - 1 , - 1 is solved from the latency constraint 
3. Let j z l k - 1 . 0 ,  Wk-1,1,  . . . , V k - 1 , n - 1 )  be the voltages at 
t i , O  + Ci"=;'tb-l,j = T.  
4. For each stage j(1 5 j 5 n - l), calculate its voltage scheme 
5. Total energy consumption: E = c:-ol P ( v i , j ) t i , j .  
6. Solve all the variables in steps 1 2, and 3 from the system 
{ w i , j  : 0 5 i <_ k - 21 from (2),(3). 
f o r o < i < k - l  
Figure 3: An approach to the optimal scheme. 
5 Experimental Results 
We report the results when apply our new energy minimization ap- 
proach on the Myrinet GAM pipeline that Berkeley researchers use 
to study the packet fragmentation and to build the model for system 
pipeline evaluation[23 1. 
Myrinet GAM pipeline consists of four stages, stage 0 copies 
data on the sender host; stage 1 is the sender host DMA; the next 
stage is an abstract pipeline stage of the network DMAs at both end 
hosts and a receiver host DMA; stage 3 is the copy on the receiver 
host. The parameters of this pipeline are given in Table 1[23]. The 
second column is the per-fragment overhead, the third column is the 
per-kilobyte transmission time at the reference supply voltage, the 
last column is the reference power for each stage at the reference 
supply voltage. It is clear that stage 2 is the dominant stage since it 
has the largest per-fragment overhead and the slowest transmission 
speed. 
s tage j  I gj(ps) I Tj(psIKB) I Pj(Watt) 
0 1  7.2 1 7.2 I Po 
Table 1: Myrinet GAM pipeline parameters. 
Suppose there is a 4KB-packet being transmitted via this pipe- 
line with various user-specified latency constraints, we apply the 
variable voltage approach with fixed-size fragmentation to schedule 
the supply voltage for processors at each stage. The result is shown 
in Table 2. 
The traditional energy minimization technique tries to find the 
minimal supply voltage and then apply it to the processors at all 
stages to meet the deadline constraint. In this case, this voltage is 
that in stage 2. Table 3 compares the energy consumption at each 
stage by our new approach vs. the traditional method. At both 
end hosts (stages 0 and 3), significant amount of energy (more than 
90% in average) is saved due to the high transmission speed at these 
two stages. On stage 1 ,  the average 26.9% energy reduction comes 
from its small overhead 91. 
6 Conclusion 
Variable size packet fragmentation can reduce transmission latency 
and variable voltage processors are capable for power efficiency 
system design. We combine these techniques to address the prob- 
lem of how to minimize the power consumption in system-level 
pipelines under latency constraints. We define the problem and 
solve it optimally based on the communication pipeline mode1[23] 
1-364 
Table 2: Optimal voltage schemes for Myrinet GAM pipeline to transmit a 4KB packet. 
I Average I 0.54 I 3.53e-02 I 93.4% I 0.54 I 0.39 I 26.9% I OS4 I 0.54 I 0% I 0.54 1 4.33e-02 1 91.3% I 
Table 3: Energy reduction on Myrinet  GAM pipeline for transmission of a 4KB packet. 
and variable voltage processor model[21]. Even when restricted to 
equal size fragmentation and fixed voltage on each processor core,  
w e  show that significant power reduction is possible without addi- 
tional latency. 
References 
T.E. Anderson, M.D. Dahlin, J.M. Neefe, D.A. Patterson, and oth- 
ers. Serverless network file systems. ACM Transactions on Computer 
Systems, Vol. 14, No. 1 pp. 41-79, 1996. 
N.J. Boden, D. Cohen, R.E. Felderman, A.E. Kulawik, and others. 
Myrinet: U gigabit-per-second locul urea network. IEEE Micro, Vol. 
T.D. Burd, R.W. Brodersen. Processor design jbr portuble systems. 
Journal of VLSI Signal Processing, Vol. 13, No. 2-3, pp. 203-221, 
1996. 
A. Chandrakasan, S. Sheng, and R.W. Broderson. Low-power CMOS 
digital design. IEEE Journal of Solid-State Circuits, Vol. 27, No. 4, 
pp. 473-484, 1992. 
A. Chandrakasan, V. Gutnik, T. Xanthopoulos. Datu driven signul 
processing: an approach for energy efficient computing. International 
Symposium on Low Power Electronics and Design, pp. 374-352, 
1996. 
B.N. Chun, A.M. Mainwaring, D.E. Culler. Virtual network transport 
protocolspv Myrinet. IEEE Micro, Vol. 18, No. 1 pp. 53-63, 1998. 
L. Claesen, H. De Kuyper, R. Tits. Low power applications ut system 
level. Low Power Design in Deep Submicron Electronics. Ed.: W. 
Nebel, Mermet, J. Dordrecht, Netherlands: Kluwer Academic Pub- 
lishers, pp. 543-64, 1997. 
R. Gonzalez, M. Horowitz. Energy dissipation in generul purpose nii- 
croprocessors. IEEE Journal of Solid-state Circuits, Vol. 31, No. 9, 
K. Govil, E. Chan, and H. Wassennan. Comparing algorithmsfor dy- 
namic speed-setting qf U low-power CPU. ACM International Con- 
ference on Mobile Computing and Networking (MOBICOM’95), pp. 
13-25, 1995. 
V. Gutnik, and A. Chandrakasan. An eficient controller for variable 
supply-voltuge low power processing. Symposium on VLSI Circuits, 
pp. 158-159, 1996. 
I. Hong, D. Kirovski, G. Qu, M. Potkonjak, and M.B. Srivastava. 
Power Optimizution o j  Vuriable W t u g e  Core-Bused Systems. Pro- 
ceedings of 35th Design Automation Conference (DAC’98), pp. 176- 
181, 1998. 
M. Horowitz. Low power processor design using self-clocking. Work- 
shop on Low-power Electronics, 1993. 
H.A. Jamrozik et al. Reducing network latency using subpuges in U 
global memory environment. International Conference on Architec- 
tural Support for Programming Languages and Operating Systems, 
SIGPLAN Notices, Vol. 31, No. 9 pp. 258-67, 1996. 
15, NO. 1 pp. 29-36, 1995. 
pp. 1277-1284, 1996. 
[I41 V. Von Kaenel, P. Macken, M. G. R. Degrauwe. A voltuge reduction 
technique for buttery-operuted systems. IEEE Journal of Solid-State 
Circuits, Vol. 25, No. 5 ,  pp. 1136-1 140, 1990. 
[I51 P. Macken, M. Degrauwe, M. Van Paemel, H. Oguey. A voltage re- 
duction technique for digital systems. 1990 IEEE International Solid- 
State Circuits Conference (ISSCC) Digest of Technical Papers, pp. 
238-239, 1990. 
[I61 R.P. Martin, A.M. Vahdat, D.E. Culler, T.E.Anderson. Effects ofcom- 
munication lutency, overhead, und bundwidth in U cluster rchitecture. 
(24th Annual International Symposium on Computer Architecture. 
ISCA ’97). Computer Architecture News, Vol. 25, No. 2 pp. 85-97. 
1997. 
[I71 D. Mosberger, L.L. Peterson. Making paths explicit in the Scout ~ p -  
erating sysiem. Second USENIX Symposium on Operating Systems 
Design and Implementation (OSDI), pp. 28-3 I ,  1996. 
[ 181 W. Namgoong, M. Yu, T. Meng. A high-efficiency variable-voltuge 
CMOS dynamic dc-dc switching regulutor 1997 IEEE International 
Solid-state Circuits Conference (ISSCC) Digest of Technical Papers, 
[I91 L. S. Nielsen, C. Niessen, J. Sparso, K. van Berkel. Low-power op- 
erution using self-timed circuits and uduptive scaling of the supply 
voltuge. IEEE Transactions on Very Large Scale Integration (VLSI) 
Systems, Vol. 2, No. 4, pp. 391-397, 1994. 
[20] L.L. Peterson & B.S. Davie. Computer networks : U systems upprouch 
San Francisco, Calif. : Morgan Kaufmann Publishers, 1996. 
[21] G. Qu. Scheduling Problems for Reduced Energv on Vuriuble Voltage 
pp. 380-381, 1997. 
Sys/& Master Thesis, Computer Science Dept:, Univ. of California, 
Los Angeles, 1998. 
G.M. Voelker et al. Manuging server loud in global memory sys- 
tems. ACM International Conference on Measurement and Modeling 
of Computer Systems (SIGMETRICS 97), Performance Evaluation 
Review, Vo1.25, No.1 pp. 127-138, 1997. 
R.Y. Wang, A. Krishnamurthy, R.P. Martin, T.E. Anderson, D.E. 
Culler. Modeling Communication Pipeline Latency. Joint Interna- 
tional Conference on Measurement and Modeling of Computer Sys- 
tems (SIGMETRICS ’98/PERFORMANCE’98), Performance Evalu- 
ation Review, Vo1.26, No.1 pp. 22-32, 1998. 
M. Weiser, B. Welch, A. Demers, S. Shenker. Scheduling,for reduced 
CPU energy USENIX Symposium on Operating Systems Design and 
Implementation (OSDI), pp. 13-23, 1994. 
M. Welsh, A. Basu, T. von Eicken. ATM and fust Ethernet network 
inferjkes for user-level communication. Third International Sym- 
posium on High-Performance Computer Architecture, pp. 332-342, 
1997. 
A. Wolfe. Issues jhr low-power CAD tools: U system-level design 
srudy. Design Automation for Embedded Systems, Vol. 1, No. 4 pp. 
3 15-332, 1996. 
F. Yao, A. Demers, S. Shenker. A scheduling model for reduced CPU 
energy. IEEE Annual Foundations of Computer Science, pp. 374-382, 
1995. 
1-365 
