me performance of many modem computer and communication systems is dictated by latency of communication pipelinm. At the same time, power consumption is often another fimiting factor in many portable systems. We addrws the problem of how to minimize the power consumption in system-level pipetines under latency constraints. h partictiar, we exploit advantages provided by variable voltage design methodology to optimtiy select SPA and therefore voltage of each pipefine stage. We define the problem and solve it optimtiy under rdstic and widely accepted assumptions. We apply the obtained theoretical r~sdts to develop rdgorithms for power minimization of computer and communication systems and show that significmt power reduction is possible without additiomd latency.
Introduction
System level pipehnes are widely acknowledged as the most Ekely botieneck of many computer systems [11, 15] . For example, a read miss in the system data or instruction cache blocks the application program until the entire block with requ=td data arrives [1, 17] . me trade-off is CIW longer blocks imply fewer misses, but dso longer interrupt latency. Similarly, in high speed Iocd and widearm networks selecting properly block size to exploit intrinsic concurrency in communication pipefines is a key issue [2, 4, 20] . As the find example where communication pipefines dictate performance= we mention path-oriented operating systems [12] . erefore, it is not surprising that recenfly the question of how to improve the performan~of a system pipefine received a great dd of attention in computer architecture, operating systems, and compilers communities. me essence of the problem is abstracted in recent work by Wang et d [18] .
h this paper, we address the energy minimization problem in system-level pipetines under latency constraints. We use the recent advances in power supply technologies and the variable voltage design methodology to choose a voltage profile for wch pipeEne stage which optimNy minimizes the energy consumption of the entire pipetine system. me paper is organized as fo~ows, we review the related work in communication pipetine and low power dwign techniques, then we define the problem in section 3. We solve the problem optimally in two cas= (i) mch pipefine stage has a fixed voltage which varies from stage to stagq @l)every stage can have variable supply voltages. We present the experimental rwsdts in section 6 and then conclude.
Petision
to make &gitaSor hard copi= of aUor part of h \vork for persoti or &%~oom w is granted \ti&out fee protided that copim are not made or~tiũ ted for profit or commer~advanbge and tSratcopiffibear W notice md fie M dtation on tie fit page. To copy ofia&, to repubhh, to~t on sewers or to rediitibute to kk, rquir= prior S-C Petilon and/or a fee. ICC~8, Sm Jew, W USA O 19S AChl 1-58113~8-Z9W~l l.S5.M 2 Related Work emost relevant relatd work are efforts in communication pipefine design and evaluation, and low power design techniques. b particuhtr, within the former domain fragmentation techniques for managing congestion control, packet buffering, packet losses, and the optimization techniques for improvement of distributed file systems and high-sp~local arm networks me directiy relevant. Whhin the latter, we focus our survey on system-level power minimization techniques and variable voltage tmhniqu~.
In the introduction section, we rdready surveyd a number of communication-pipefine systems and research efforts for latency optimization of thwe systems. It is important to note that many application specific systems operate at the highest-level of abstraction as processing pipefines on blocks of input (e.g. digiti N and audio and segmentation subsystems of communication devices).
Apparently, fragmentation has ben used in the dsign of Internet for quite a long time. More recentiy, studies of how to exploit flexible block fragmentation to improve performmces of DEC workstations has rdso been conducted [8] . More detailed survey of fragmentation techniques is given in [18] .
Dynamictiy adapting voltage and therefore the clock frequency, to operate at the point of lowest power consumption forgiven temperature and process parameters was first proposed by Macken et d [9, 10] . Later, [7] describd implementation of several digiti power supply controllers based on this idea. Nielsen et d [14] extendd the dynamic voltage adaptation idea to take into account data dependent computation times in self-timed circuits. Recentiy sever~r~achers developed efficient DC-DC converters that flIow the output voltage to be rapidy changd under extemrd control [13] . Researchers at~ [3, 6] have appfied the idea of voltage adaptation based on data dependent computation time from [14] to synchronously clocked circuits.
In the software world, dso there has been recent reswch on schedting strategies for adjusting CPU sp~so as to reduce power consumption. me existing work is in the context of non-red-time workstation-like environment. [19] proposal an approach where time is dividd into 10-50 ms intervals, and the CPU clock speed (and voltage) is adjusted by the task-level schedtier basal on the processor utilization over the preceding interv~. [5] concluded that smoothing helps more than prediction in voltage changing. FinMy, [21] describd an off-fine minimum-energy schedtie and an average rate heuristic for job schedfllng for independent processes with dea~ines.
Background and Problem Formulation
We describe the variable voltage processor and the store-and-forward pipefining network, characterize the user packet to be transmitted, and then we state the problem.
Variable Voltage Processor
h most part of this paper, we use the i&al variable voltage processor [16] where the supply voltage can be changd from Oto m instantaneously without any overhead. Mthough this ided processor is not feasible bmause of the irmeglectable amount of time for the voltage to r~ch steady state at the new voltage and the feedback control behavior of the DC-DC switching re@ator, the study of this model gives us insight view of the problem and more important, it provides the lower bound of energy consumption by using variable voltage processors.
JVith different supply voltages, the processor is able to operate at different spwds and therefore the time and power used to accomplish the same task will rdso be different.
Network Model
As proposed in [1S], we represent the network as a squence of store-and-forward pipeline stages characterized by the following parameters:
. n: the number of pipeline stagw. g:, the fixedper-fragment overhmd can be considerd as the context switch time. It may vary from stage to stage. If none of the stages has overhmd, as we will show soon, the best strategy is to fragment the packet as smfl as possible.
Ti (5) is proportional to the inverse of the bandwidth for stage i with 5U supply voltage. In the extreme case, if there is no bandwidth limitation for dl stages, to achieve the minimum latency the entire packet should be send as a single fragment.
Problem Formulation
Our objective is to minimize the energy consumption for transmitting a packet through the network under the user-specifid latency constraint. FoUowing variables are associated with the packet for the convenience of rmdysix . B : thesize of the entire packet.
. T: thedwfllne to trmsmit the entire packet.
. k : the number of fragments.
. xi : the size (in byte) of the ith fragment (O < i < k -1).
. ti,j :the time that the ith fragment stays in the jth stage. me packet's size B and the deadine T aregiven by the user, the network is characterized by n, gi, Ti (5), and we assume that the processors at dl stages are identicd.
Let VJ(t) be the voltage at which the jth processor operates at time t, then JVefirst consider the simple case when the processor at each stage operates at a fixed voltage which can be arbitrary. me voltage scheme problem then becomes to finding a constant~j for the processor at the jth stage, and the energy c;nsumed by this processor, from (l), is simpfifid to Ej = P(tij)T. Moreover, the time that the ith fragment stays in the jth stage can be expressed as:
Lemma 4.1 A nec=sary condition for the energy to be minirnizd is to finish the transmission exactiy at the dadine T.
me in~ition behind Lemma 4.1 is that the network will use as much time as possible to schedde the processors with low voltages and thus minimize energy consumption. On the other hand, for mch single stage, the best strategy is to transmit a fragment immediately upon its reception or at the accomplishment of sending the previous fragment whichever comes later.~s observation leads to the next lemma Le~4.2 Given that the packet can otiy be fragmental into fid size and the supply voltage for each processor cannot be changd, if a voltage scheme {VO,VI, ..., an-l } minimizes the energy consump tion, then ti,j = wnstant (3) From (2), the processor at the stage that has the largest perfragrnent overhead has to operate at a high voltage to achieve a small per-byte transmission time T; (v;) due to (3) .~erefore, this stage~villconsume more energy th"~;ker stages and we cdl such a stage dominant stage because it dominates the totrd energy consumption. 
'==-(n-l)
and the constant on the r.h.s. of(3) is~.
(4)
How do the networ~s parameters and the latency affect the OF timrd scheme?
. Z the dmdine. JVhen the latency constraint is loose (i.e., T goes large), we can have more fragments from (4). Energy consumption is reduced since every processor gets a longer transmission time.
. n the number of stages in the network. If we differentiate (4) with respat to n, the restit is positive which mms that the more stages in the network, the more fragments we shotid have.~s takes advantage of the pdefism.
. gd: the per-fragment overhead at the (energy) dominant stage. If this overhead goes large, less fragments shotid be used to cut the total overhead. And if there is no overhead, then we shotid fragment the packet as smM as possible so that more part of the packet can be transmitted parrdle~y.
q B: the size of the entire packet. me number of fragments in the optimal scheme is independent of the packet size.~s is not surprising, since we use the idd variable voltage processor, which can adjust itss@y changing supply voltage) according to the size of the packet.
Variable Voltages within the Same Stage
The discussion in the previous section is very restricted, the frsgments have qud length and each processor runs at a fixed supply voltage, though different procmsors may run at different voltagw. Now we assume each fragment can have variable size and each processor can run at different level of voltage.
First of dl, hrnma 4.1 sti~holds, which says that we shodd finish the transmission on its deafine, not any other wly time. Another basic fact is from the convexity of energy as a finction of the supply voltage Lemma 5.1 In eve~stage, to minimize the energy, supply voltag= change either on the arrival of a new fragment or at the accomplishment of sending the current fragment.
Recrdl that t~,~is the time that the itb fragment stays in the jtb stage, which includes both the overhead gj and the acturd transmission time. bmma 4.2 synchronizes W processors on fixed length fragments such that no stage win congest or starve. This can be generfllzed to the case when the fragments have different sizes.
Lemma 5.2 In the optimrd voltage and fragmentation schemw, for~1 O < i < k -2 and 1< j < n -1, the fo~owing holti ti,j= ti+l,j-1
Combining rdl these, we propose an approach to the optimrd schemti Figure 2 An approach to the optimal scheme.
As forrmdatd in Section 3.3, a solution to the EMDVVP prob lem means a supply voltage function for each processor and a packet fragmentation.
hrnrna 5.1 outines the shape of the voltage functions, which are step functions with W possible break points at the time when new fragment comes or current one leaves. Therefore we ordy need to determine tie supply voltage Vi,j for each processor to transmit each fragrnen~which reduces the problem from finding n functions to determining nk numbers, where k is the number of fragments. bmma 5.2 predicts a recursive relation among the time that fragments stay at each stage, from which (n -l)(k -1) Vi,j'scan be easily cdculatcd. Lemma 4.1 te~s us the energy is rninimizd ody when the entire transmission finishes at the destine, so one more variable can be eliminated. The remaining variabb~i,j's are thosefisted in steps 2 and 3 in Figure 2 . Since now we dow vtiable supply voltages within the same stage, we can shut down the processor (or run it at the minimum voltage if shut down is not dowcd) to save energy and this gives the expression of total energy consumd in step 5. Fintiy we apply tie first order condition to solve for the optimal scheme.
Considering that the size of each fragment xi has dso to be determind, we have Theorem 5.3 Given the number of fragments, the EMD~problem with variablesized fragment and variable voltage at each stage is reduced to solving a nordinear system (step 6 in Figure 2 ) of n + 2k -3 free variables.
By repeating Theorem 5.3 for dl possible vrdues of k, we can solve the E~VVP optimdy. However, the diffictity is that even n+2k -3 variables are too many for us to hande and the nofilnear system is rdso hard to be solved audytic~y. (See the twhnicd report for a detailed example and discussion.)
6 Experimental Results h this section, we report the restits when apply our new energy minimization approach on the Myrinet GAM pipefine [18] .
Myrinet GAM pipeline consists of four stages, stage O copies data on the sender hose stage 1 is the sender host D~, the next stage is an abstract pipefine stage of the network DMAs at both end hosts and a receiver host DMA, stage 3 is the copy on the receiver host. me parameters of this pipeline are given in Table 1 [18] . The swond column is the per-fragment overhead, the third column is the per-kilobyte transmission time at the reference supply voltage, the last column is the reference power for each stage at the reference supply voltage. Further, we suppose there is a4~-packet being transmitted via this nehvork with various user-specifid latency constraints, and let the threshold and reference supply voltages be 0.8 volts and 5 volts respwtively.
As discussed in Section 4, energy consumption on each stage is determined by the supply voltage which is proportioned to~, where C is a stage-independent constant. Ws is clearly from the proof of Theorem 4.3 which has been omitted due to space constraint.) Therefore, the larger the per-byte transmission time Tj (5) is, the more energy is consumed. So does the per-fragment overhead gj. h theMyrinet GAM pipeline, it is clear that stage 2 is the dominant stage. \Ve apply our new variable voltage approach with fixd-size fragmentation to schedule the supply voltage for processors at each stage. The result is shown in Table 2 , where the number of fragments is cdctiatd from (4) and the voltage and energy consumption are computed based on this best fixed length fragmentation.
The traditionrd energy minimization tries to find the rninimrd supply voltage and then apply it to the processors at dl stages to mmt the dm~ne constraint. h this case, this voltage is that in stage 2. Table 3 compares the power consumption at each stage by our new approach vs. the tradition method. At both end hosts (stages Oand 3), sigrdficmt amount of energy are savd due to the high transmission speed at these two stages. At stage 1, energy reduction comes from its sm~overhead gl. Table 2 Optimal voltage scheme for Myrinet GAM pipeline. Table 3 : Ener~reduction on Myrinet GAM pipehne.
Conclusion
In this paper, we addras the problem of how to minimize the power consumption in system-level pipefines under latency constraints. h particular, we exploit advantages provided by variable voltage design methodology to optimally select spa and therefore voltage of each pipeline stage. We define the problem and solve it optimally under rdistic and widely accepted assumptions. We apply the obtained theoreticrd results to develop algorithms for power tinimization of computer and communication systems and show that significant power rduction is possible without additionrd latency.
