Abstract-The performance of many modern computer and communication systems is dictated by the latency of communication pipelines. At the same time, power/energy consumption is often another limiting factor in many portable systems. We address the problem of how to minimize the power consumption in system-level pipelines under latency constraints. In particular, we apply fragmentation technique to achieve parallelism and exploit advantages provided by variable voltage design methodology to optimally select voltage and therefore speed of each pipeline stage. We focus our study on the practical case when each pipeline stage operates at a fixed speed. Unlike the conventional pipeline system where all stages run at the same speed, our system may have different stages running at different speeds to conserve energy while providing guaranteed latency. For a given latency requirement, we find explicit solutions for the most energy efficient fragmentation and voltage setting. We further study a less practical case when each stage can dynamically change its speed to get further energy saving. We define the problem and transform it to a non-linear system whose solution provides a lower bound for energy consumption. We apply the obtained theoretical results to develop algorithms for power/energy minimization of computer and communication systems. The experimental result suggests that significant power/energy reduction, is possible without additional latency. In fact, we achieve almost 40% total energy saving over the combined minimal supply voltage selection and system shut-down technique and 85% if none of these two energy minimization methods is used.
I. Introduction
S YSTEM level pipelines are widely acknowledged as the most likely bottleneck of many computer systems. For example, a read miss in the system data or instruction cache will block the application program until the entire block with requested data arrives [1] , [23] . The trade-off is clear: longer blocks imply fewer misses, but also longer interrupt latency. Similarly, in high speed local and widearea networks, selecting proper block size to exploit intrinsic concurrency in communication pipelines is a key issue [7] , [27] . As the final example where communication pipelines dictate performances we mention path-oriented operating systems [16] . Therefore, it is not surprising that recently the question of how to improve the performance of a system pipeline received a great deal of attention in computer architecture, operating systems, and compilers communities. The essence of the problem is abstracted in a recent work [24] where the discussion is on how to minimize the transmission latency by careful packet fragmentation.
On the other hand, the increasing use of portable systems (such as personal computing devices, wireless communications and imaging systems) makes the power consumption one of the primary circuit and system design goals. The most effective method to reduce power consumption is to lower the supply voltage level, which exploits the quadratic dependence of power on voltage [5] . However, reducing the supply voltage increases circuit delay and decreases the clock speed. The resulting processor core consumes lower average power at the cost of increased latency. Therefore, it becomes less effective when tight deadlines are present.
Recent progress in power supply technology along with custom and commercial CMOS chips that are capable of operating reliably over a range of supply voltages makes it possible to build processor cores with supply voltages that can be varied at run time according to the application latency constraints [3] , [17] . The variable voltage processor core is capable of operating at different optimal points along the power and speed curve in order to achieve high energy efficiency. In particular, with multiple supply voltages on the chip, the processor core can use high voltage for applications with tight deadlines and keep the voltage low otherwise to reduce total energy consumption [3] , [22] .
In this paper, we address the energy minimization problem in system-level pipelines under latency constraints. We use the recent advances in power supply technologies and the variable voltage design methodology to choose a voltage profile for each pipeline stage which optimally minimizes the energy consumption of the entire pipeline system. The rest of the paper is organized as follows. Section II describes the related work in communication pipeline and low power design techniques. In Section III, we discuss the pipeline model, processor model and formulate the problem. We solve the problem optimally in two cases: (i) each pipeline stage has a fixed voltage, which may vary from stage to stage; (ii) every stage can have variable supply voltages (detailed proof, example, and discussion can be found in the technical report [19] ). We present the experimental results in Section VI, and Section VII concludes.
II. Related Work
The most relevant related work are efforts in communication pipeline design and evaluation, and low power design techniques. In particular, within the former domain fragmentation techniques for managing congestion control, packet buffering, packet losses, and the optimization techniques for improvement of distributed file systems and high-speed local networks are directly relevant. Within the latter, we focus our survey on system-level power minimization techniques and variable voltage techniques.
In the introduction, we already surveyed a number of communication pipeline systems and research efforts for latency optimization of these systems. It is important to note that many application specific systems operate at the highest-level of abstraction as processing pipelines on blocks of input (e.g. digital TV and audio and segmentation subsystems of communication devices). Apparently, fragmentation has been used in design of the Internet for quite a long time. More recently, studies on how to exploit flexible block fragmentation to improve performances of DEC workstations has been also conducted [12] . More detailed survey of fragmentation techniques is given in [24] .
Dynamically adapting voltage and therefore the clock frequency, to operate at the point of lowest power consumption for given temperature and process parameters was first proposed by Macken et al [13] , [15] . Later, [11] , [26] described implementation of several digital power supply controllers based on this idea. Several researchers have recently developed efficient DC-DC converters that allow the output voltage to be rapidly changed under external control [17] , [21] . We mention that a dynamic voltagescaled microprocessor system has been reported recently [3] and leave further discussion on variable voltage processor for the next section.
In the software world, there has been also recent research on scheduling strategies for adjusting CPU speed so as to reduce power consumption. For example, Weiser et al. [25] proposed an approach where time is divided into 10-50 ms intervals, and the CPU clock speed (and voltage) is adjusted by the task-level scheduler based on the processor utilization over the preceding interval. Govil, Chan and Wasserman [9] concluded that smoothing helps more than prediction in voltage changing. Yao, Demers, and Shenker [29] described an off-line minimum-energy schedule and an average rate heuristic for job scheduling for independent processes with deadlines, though under the assumptions that (i) the processor can change its speed arbitrarily, i.e., the changes are instantaneously with no physical bounds, and (ii) the jobs are preemptive with no preemption penalty. Qu [18] extended this by the discussion of both non-preemptive jobs on such ideal variable speed processor and general jobs on real variable speed processors where both the maximal/minimal voltage constraints and the limitation on speed changes are considered. Survey on system-level low power techniques can be found in [28] and energy efficient microprocessor design has been discussed in [2] , [8] .
III. Background and Problem Formulation
In this section, we first describe the variable voltage processor and the store-and-forward pipelining network, then characterize the user packet and formulate the problem.
A. Variable Voltage Processor
The variable voltage is generated by the DC-DC switching regulators. Time to reach steady state at a new voltage is normally innegligible. However, recent work on DC-DC converters allows the output voltage to be changed rapidly. For example, Burd et al. [3] implemented a microprocessor system that consist of a DC-DC switching regulator, an ARM V4 microprocessor, a bank of SRAM ICs, and an interface IC. The supply voltage and clock frequency can be dynamically varied from 1.2V to 3.8V within 70µs with an energy efficiency of 0.54-5.6mW/MIP.
To compensate the complexity of real variable voltage system, we have seen plenty of efforts in the following two directions. On one hand, there have been many proposal and implementation of multiple supply voltage systems [6] , [14] , [20] , [22] . These research groups have addressed the use of two or three discrete supply voltages. The idea is to switch among these simultaneously available voltages according to the processing load, computation requirement, latency constraint, etc. On the other hand, ideal variable voltage system has also been studied theoretically [18] , [29] . An ideal variable voltage processor can change its speed from 0 to ∞ instantaneously without any overhead. Apparently, such ideal processor is not feasible, but the study of this model gives us insightful view of the problem and more importantly, it provides the lower bound of energy consumption by using variable voltage processors. Although there is no reported studies that takes these overheads into consideration, there exist evidence showing that this bound could be tight. First, Hong et al. [10] reported a task scheduling heuristics which, when applied to multimedia benchmarks, results in a total energy consumption only 1.5% higher on average than the lower bound obtained from the ideal case. Furthermore, Burd and Brodersen [4] discussed various design issues for dynamic voltage scaling systems. In their prototype design, it takes 26µs and 6.5µJ for a full-scale transition from 1.2V to 3.8V . They estimated a practical limit of voltage change rate on the order of 5V/µs, with the potential of going as high as 20V/µs, for 0.6µm process. This will further reduce the transition time and energy.
With different supply voltages, the processor will be able to operate at different speeds, the time and power consumed to execute the same task (or same amount of computation) will also be different. We adopt the following relationships among the voltage, delay, power and energy [5] : Suppose with a 5-volt constant supply voltage, the processor finishes a task in time T (5), the power dissipation is P (5). Then with a supply voltage v, to finish the same task, the processing time T (v), the power dissipation P (v), and the energy to complete the task E(v) are given as follows:
where v t is the threshold voltage.
B. Pipeline Model
As proposed in [24] , we represent the network as a sequence of store-and-forward pipeline stages characterized by the following parameters:
• n: the number of pipeline stages.
• g j : the fixed per-fragment overhead for stage j.
• T j (5): the per-byte transmission time for stage j with the 5-volt reference supply voltage.
The fixed per-fragment overhead, g i , can be considered as the context switch time and may vary from stage to stage. If none of the stages has overhead, the best strategy, as we will show soon, is to fragment the packet as small as possible to utilize parallelism. T j (5) is proportional to the inverse of the bandwidth for stage j with a 5v supply voltage. In the extreme case, if there is no bandwidth limitation for all stages, to achieve the minimum latency the entire packet should be sent as a single fragment to avoid the per-fragment overhead.
At the sender's end, the packet is fragmented and sent to the first stage of the pipeline. A pipeline stage will start the transmission of a fragment as soon as it receives both the entire fragment from the previous stage and an acknowledgement from the next stage, which is sent when the next stage is ready for the reception of the current fragment. We refer to these as the rules for transmission. The transmission is completed when the receiver's end receives the last fragment of the packet.
C. Problem Formulation
Our objective is to minimize the power consumption for transmitting a packet through the network under the userspecified latency constraint. The two tools that we use to achieve this are packet fragmentation and supply voltage selection. The following variables are associated with the packet for the convenience of analysis:
• B : size of the entire packet.
• T : deadline to transmit the entire packet.
• k : the number of fragments.
• x i : size of the ith fragment (0 ≤ i ≤ k − 1).
• t i,j : (life)time that the ith fragment stays on the jth stage.
The packet's size B and the deadline T are given by the user, the network is characterized by the number of pipeline stages n and the overhead g j and the unit transmission time T j (5) for each pipeline stage. We further assume that the processors at all stages are identical (i.e, with the same T (v), P (v) and E(v)). A fragment's lifetime on a stage is the sum of the per-fragment overhead and the actual transmission time on this stage. Let v j (t) be the voltage at which the jth processor operates at time t ∈ [0, T ], then the processor's energy consumption is:
where P (v) is the power dissipation at supply voltage v. We want to minimize E = 
We explain our approach and give the main results with sketch of proof in the following sections, while interested readers can find detailed proof, example, and discussion in the technical report [19] .
IV. Fixed Voltage within the Same Stage
We first consider a simple case when the processor on each stage operates at a fixed voltage, but the voltages can be different from stage to stage. It is important to study this case because of the extreme simplicity of implementation. Since each processor will operate at a constant supply voltage, no additional hardware is required. Once the voltage level for each processor is determined, the pipeline can be easily set up by applying the required voltages to corresponding processors. The voltage scheme problem is reduced to finding the constant voltage v j for the processor at stage j. The energy consumption on this stage, from (4), is simplified to E j = P (v j )T . Moreover, the lifetime that the ith fragment (with size x i ) stays on the jth stage can be expressed as:
Lemma 4.1 A necessary condition for the energy consumption to be minimized is to finish the transmission exactly at the deadline T .
[sketch of the proof:] Suppose that we have a packet fragmentation, a voltage scheme {v 0 , v 1 , · · · , v n−1 } where v j is the constant voltage at which processor on the jth pipeline stage, and the last fragment leaves the last stage before the deadline T , we show that this cannot be optimal by constructing another voltage scheme on the same packet fragmentation that consumes less energy.
We consider the voltage scheme {v 0 , v 1 , · · · , v n−1 } where v n−1 < v n−1 is the reduced voltage on the last stage such that the transmission will complete on the deadline T . This clearly consumes less energy. However, we still need to verify the new voltage scheme does not violate rules of transmission. (i) With low voltage and hence slow transmission speed, each fragment will spend more time on the last stage. Therefore, the starting transmission time of each fragment will be no earlier than its original starting transmission time at v n−1 . This implies that we will not start transmitting a fragment that has not yet arrived. (ii) The transmission cannot start until the next stage is ready for reception of a new fragment. The slow-down of the last stage will delay the transmission of previous stages. But this delay will not be longer than the delay on the last stage caused by lower voltage and therefore the deadline T will not be missed. Finally, the new voltage v n−1 on the last stage can be easily determined. Let t be the time that the first fragment arrives at the last stage, then the total transmission time on this stage will be T − t − k · g n−1 (if there is no starving), where k is the number of fragments and g n−1 is the per-fragment overhead. For a packet of size B, we select v n−1 such that the per-byte transmission time becomes exactly
Q.E.D.
Intuitively, Lemma 4.1 says that the pipeline will use as much time as possible for transmission such that the processors can be scheduled at low voltages and thus minimize energy consumption. On the other hand, on each single stage, the best strategy is to transmit a fragment immediately upon its reception and the accomplishment of sending the previous fragment. This implies that the voltage should be adjusted such that all stages are synchronized and leads to the following lemma. 
When we restrict fragmentation to be equal-sized, t i,j = g j + T j (v j )x i = constant for all i when j is fixed, i.e., equal lifetime for all fragments on the same stage. We will show that these constants are the same for all stages by contradiction.
Suppose t i,j 's are not the same, then there exists l ∈ [0, n − 1], such that either t i,l < t i,l−1 , or t i,l < t i,l+1 , or both. We can reduce the supply voltage on the lth stage and construct a better solution with less energy consumption. In fact, such solution can be found in four steps:
(iii) Make appropriate changes on the stages after the lth stage because of the delay of fragments by T l (v l ).
(iv) Modify the voltage schemes to fit the deadline T .
The new solution will consume less energy. Therefore, any strategy with different t i,j 's cannot be optimal.
Q.E.D. From (6) , the processor at the stage that has the largest per-fragment overhead must operate at a high voltage to achieve a short per-byte transmission time T j (v j ). Therefore, this stage will consume more energy than other stages and we call such a stage dominant stage because it dominates the total energy consumption. 
and each stage will operate at a fixed supply voltage that can be determined by Equation (6) with the constant on the r.h.s. equals
[Sketch of the proof:] Let k be the number of fragments, x = B k be the size of each fragment for a packet of size B, and {v 0 , v 1 , · · · , v n−1 } be an optimal voltage scheme. The time to transmit the entire packet is:
The first term is the time for the first fragment to travel through the entire pipeline. From Lemma 4.2, it equals to n[g j + T j (v j )x] for any j. The second term is the time to send the rest of the packet from the last stage. Lemma 4.1 requires their sum to be T for the energy to be minimized, therefore we have:
T j (v j ) can be easily solved in terms of k from (8) and considering (1), we get:
Solving this quadratic equation gives us, for a given number of fragments k:
where
Next we plug the values of v j 's into Equation (3) and get the total energy consumption, which is expressed in terms of k. Since the energy is dominated by stage d and we know that low voltage results in low energy. To find the optimal scheme, we take the first derivative of v d with respect to k, set it to 0 and get the unique solution (7).
It follows from (8) immediately that the constant on the r.h.s. of (6) is
The voltage level on each stage can be easily determined from equation (6). Q.E.D.
Remarks: How do the network's parameters and the latency affect the optimal scheme? • T : When the latency constraint is loose (i.e., T is large), Equation (7) predicts more fragments. Energy consumption is reduced because each processor gets a long transmission time and thus can use low voltage.
• n: From (7), we see k is an increasing function with respect to n, the number of pipeline stages. This means that the more stages in the network, the more fragments we should have. This takes advantage of the parallelism.
• g d : If the per-fragment overhead at the energy dominating stage is high, less fragments should be used to avoid a large total overhead. If there is no overhead, then we should fragment the packet as small as possible so that more parts of the packet can be transmitted in parallel.
• B: The number of fragments in the optimal scheme is independent of the packet size. However, B does play a very important role in the voltage scheme (10). This is not surprising, since we use the ideal variable voltage processor, which can adjust its speed (by changing supply voltage) according to the size of the packet.
To end this section, we show that (1 < α < 2, the current technology has α as 1.5 or 1.6). Recall that Equation (8) is independent of the voltage-delay model. (9) will be replaced by
We rewrite this as
Differentiating both sides with respect to D, we get
Therefore,
= 0 if and only if ∂D(k) ∂k
= 0, and the latter gives formula (7).
Q.E.D.
V. Variable Voltages within the Same Stage
We first explain how to transform the energy minimization problem to a non-linear system and then discuss implementation challenges for variable voltage on the same stage.
A solution to the Energy Minimization with Deadline on Variable Voltage Processor problem requires a supply voltage profile for each processor and a packet fragmentation. Suppose that there are n pipeline stages and the packet is cut into k fragments, an optimal solution to the general EMDVVP problem consists of n voltage vs. time functions v j (t) for j = 0, 1, · · · n − 1 and the size of fragment
Power/energy is a convex function on supply voltage, so any best voltage scheme will not change voltage during the transmission of a fragment on a single stage. That is, Lemma 5.1 In an optimal solution, the supply voltage changes either on the arrival of a new fragment or on the completion of transmission of the current fragment.
This outlines the shape of the voltage functions v j (t), which are step functions with all possible discontinuous points at the time when new fragment arrives or the current fragment leaves. Therefore we only need to determine (k · n) constants v i,j (i = 0, 1, · · · k − 1, j = 0, 1, · · · n − 1), the voltage for processor j to transmit fragment i. Lemma 4.2 synchronizes all processors on a fixed length fragmentation such that no stage will congest or starve. We can generalize this for variable fragment size: Lemma 5.2 The optimal voltage scheme, for a given fragmentation, provides the lifetime t i,j of the ith fragment on the jth stage such that for all 0 ≤ i ≤ k − 2 and 0 ≤ j ≤ n − 2, the following holds:
This gives a recursive relationship among adjacent fragment's transmission time on adjacent stages. From the (n − 1)(k − 1) such recursive formulas in Equation (11), we can easily solve (n − 1)(k − 1) v i,j 's.
Finally, there are two global constraints: the transmission deadline T and the packet size B. Therefore, there are kn − (n − 1)(k − 1) − 1 = (n + k − 2) v i,j 's and (k − 1) x i 's, a total of (n + 2k − 3) constants, need to be determined. We express the total energy consumption in terms of these n + 2k − 3 variables from Equation (4), and the EMDVVP problem becomes equivalent to finding the minimal of this function. Applying the first order condition, we will have a non-linear system with (n + 2k − 3) variables where the non-linearity comes from the nature of the power model.
Theorem 5.3
Given the number of fragments k, the EMDVVP problem with n pipeline stages is reduced to solving a non-linear system with n + 2k − 3 free variables.
Unlike the easy-to-implement pipeline systems with fixed voltage on the same stage, system with variable voltage on the same stage introduces many implementation challenges: What is the most energy efficient way to change voltage? With a dynamically changed supply voltage (and therefore clock frequency), what is the system's performance? The extra hardware (e.g., the DC-DC switching regulator) that enables the variable voltage also consumes power, how should the solution change if we take this into consideration? Based on a simplified model, Qu [18] describes how to dynamically vary voltage to minimize the energy for a give task. The more practical multiply supply voltage systems have been reported. For example, the dual supply voltage media processor, graphic controller LSI, and a MPEG4 codec core. Many implementation issues (placement, routing, synchronization, etc) and empirical power reduction of the system have also been addressed [19] , [22] .
VI. Simulation Results
In this section, we report the results when applying our new energy minimization approach on several pipeline models, in particular the Myrinet GAM pipeline that researchers in Berkeley adopted to study the transmission latency minimization by variable sized fragmentation [24] .
Myrinet GAM pipeline consists of four stages, stage 0 copies data on the sender host; stage 1 is the sender host DMA; the next stage is an abstract pipeline stage of the network DMAs at both end hosts and a receiver host DMA; stage 3 is the copy on the receiver host. The parameters of this pipeline are given in Table I [24] . The second column is the per-fragment overhead, the third column is the perkilobyte transmission time at the reference supply voltage, the last column is the (normalized) reference power for each stage at the reference supply voltage. We further suppose there is a packet of fixed size being transmitted via this network with various user-specified latency constraints, and let the threshold and reference supply voltages be 0.8 volts and 5 volts respectively. We first determine the energy dominant stage. As we discussed in Section 4, energy consumption on each stage is determined by the supply voltage which is proportional to
C−gj , where C is a stage-independent constant. (This is clear from Equation (10) and the expression of D j (k) in the proof of Theorem 4.3.) Therefore, the larger the per-byte transmission time T j (5) is, the more energy is consumed. So is the per-fragment overhead g j . In the Myrinet GAM pipeline, it is clear that stage 2 is the dominant stage because it has both the largest per-fragment overhead and the longest per-byte transmission time.
After identifying the energy dominant stage, we can apply Theorem 4.3 to decide the optimal packet fragmentation directly from Equation (7) which is reported in the second column of Table II . Then we can compute the constant on the right hand side of Equation (6) and calculate the supply voltage level for each stage from Equations (1) and (6) . Finally, the power consumption can be obtained from Equation (2) . We normalize it to the power consumption at the 5-volt reference supply voltage and details are shown in Table II .
To demonstrate the energy efficiency of the new approach, we compare the above result with the traditional energy minimization techniques, namely minimal supply voltage selection and system shut-down. We report our power/energy saving over these techniques in Table III. The minimal supply voltage selection method computes the minimal voltage that can meet the transmission deadline and applies it to the (fixed-voltage) processors on all stages. In this case, such optimal voltage is the one that we use for stage 2. Columns 2-5 in the top half of Table  III give the energy saving of the new approach over the best (voltage-) configured fixed-voltage system on each individual stage. An average of 92.3%, 22.6%, and 90.7% power/energy reduction on the three pipeline stages, excluding the dominant stage 2, respectively is achieved. At both end hosts (stages 0 and 3), significant amounts of power/energy are saved because of the high transmission speed at these two stages (see Table I ). Stage 1 has the same per-byte transmission time as stage 2, however, it has a smaller per-fragment overhead g 1 , so we can lower the supply voltage (as shown in Table II ) and this little difference in the per-fragment overhead results in a more than 22% power/energy saving. There is no saving from stage 2 because this approach uses the same voltage on stage 2 as our approach.
If we use systems with a fixed 5 volts voltage, the energy dominant stage 2 becomes the bottleneck as it has the largest per-fragment overhead and the slowest transmission speed. For a tight 200 µs latency, it fails to meet the transmission deadline. The use of variable voltage processor solves this problem since we can speed up the bottleneck stage by applying a higher voltage as indicated in Table II . Columns 2-5 in the bottom half of Table III show the power/energy saving for loose latency constraints. The average saving is almost 85% and we save nearly 70% from the energy dominant stage.
The system shut-down technique shuts the system (or some components of the system) down when the system is idle to save energy. We compare our approach with an ideal system shut-down technique that shuts the system down whenever there is no processing load and turns the system back on whenever necessary and there is no overhead associated with system shut-down and wake-up. Because energy consumption is the product of power and execution time, it becomes necessary to distinguish power and energy consumption when the system shut-down technique is applied. Basically, reducing supply voltage saves power and energy consumption, but not at the same rate since low voltage results in long execution time to complete the same amount of workload. In our simulation, we assume that the system shuts down to save energy when idle, either waiting for packet from the previous stage or waiting for the acknowledge from the next stage. In this case, our approach saves more than 85% energy on both end hosts and 56% and 50% respectively on stages 1 and 2. This gives a total energy saving of almost 70% over the fixed 5-volt system combined with the system shut-down technique. When both minimal supply voltage selection and system shut-down are applied, our approach achieves an average 39.3% energy saving. Detailed energy saving on each stage is reported in the right half of Table III. Comparing the four blocks in Table III , one can see that both minimal supply voltage selection and system shutdown techniques can save system's power/energy consumption. Their combination, the best fixed-voltage system with shut-down in the top right block, is capable of reducing more than half of the energy consumed by the fixed 5-volt system without shut-down. Our approach can save 39.3% more on top of this. Furthermore, energy saving is mainly determined by g j and T j ; loose latency results in less energy saving over the minimal supply voltage selection method and more energy saving over the fixed 5-volt system.
We have constructed several other communication pipelines using the Hyper tool and Table IV shows the pipeline parameters and our simulation results. The three systems, ADPCM, MC Unit, and RLS, have three, five, and three pipeline stages respectively. The energy dominant stages are marked in bold. The g j and T j columns are the same as before. P j column is the relative power consumption. We simulate the transmission under different latency constraints and the optimal voltages for each pipeline stage are reported. The last two columns show the energy saving on each stage over constant supply voltage without and with the system shut-down technique. On most stages, we see significant (close to or more than 90%) energy saving. The last row of each pipeline system gives the total energy saving over all the pipeline stages when the relative power consumption P j is considered. For ADPCM, we are able to save 28% even when system shut-down technique is applied. However, for the other two systems, the energy savings are less than 5%. The reason is that in these systems the energy dominant stages consume large portion, e.g., almost 97% on stage 2 in RLS, of the system's total energy. Therefore, our technique is especially efficient for pipeline systems where the non-dominant stages also contribute significant to the total energy consumption.
VII. Conclusion
In this paper, we address the problem of how to minimize the power consumption in system-level pipelines under latency constraints. In particular, we exploit advantages provided by variable voltage design methodology to optimally select speed and therefore voltage of each pipeline stage. We define the problem and solve it optimally under realistic and widely accepted assumptions. We apply the obtained theoretical results to develop algorithms for power minimization of computer and communication systems. We direct our discussion in detail in two specific cases: (i) the packet has to be equally fragmented and the supply voltage on a stage cannot be changed; and (ii) both the size of the fragment and the voltage are variables. We derive an explicit formula for the first case and transform the latter to the problem of finding the minimum of a nonlinear function. The simulation with real life pipeline parameters shows that even with the former approach, significant power reduction is possible without additional latency. 
