This paper presents a new technique for global energy optimization through coordinated functional partitioning and speed selection for embedded processors interconnected by a high-speed serial bus. Many such serial interfaces are capable of operating at multiple speeds and can open up a new dimension of trade-offs to complement today's CPU-centric voltage scaling techniques for orocessors. We urouose a multi-dimensional dvnamic ororram-1394) and USB are commonly used not only for peripheral devices but also for connecting emkidded processors. Many bave advocated high-speed, serial packet networks for system-on-chip for their compelling advantages including modularity, composability, scalability, form factor, and power efficiency.
INTRODUCTION
A key trend in embedded systems is towards the use of high-.. Permission to malre digital hard copies of or this work for penonal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the fiat page. To COPY otherwise, to republish, Io post on sewen or to redisuibute to lists, requires prior specific permission andlor a fee. 
ISSS'O2. October

Approach
For a given workload on a networked architecture, our problem statement is to generate a functional partitioning scheme and to select the speeds of communication interfaces and processors, such that the total energy is minimized. In general, this problem is extremely difficult. Forhmately, for a class of systems with pipelined multiple processors under a latency constraint, efficient, exact solutions exist. We construct such a system model and formulate the energy consumed hy the processors and communication interfaces with their power/speed scaling factors within their available time budget. In [SI, we presented the schedulability conditions and the problem of communication speed selection and sketched sohtiom by exhaustive search. This paper combines communication speed selection with functional partitioning and presents an efficient multi-dimensional dynamic programming solution to min. mize system energy. We demonstrate the effectiveness of this technique with an image processing algorithm mapped onto a pipelined multi-processor architecture interconnected hy a GigaBit Ethemet.
RELATED WORK
Previous works have explored communication synthesis and optimization in distributed multi-processor systems. 1131 presents communication scheduling to work with rate-monotonic tasks, while [5] assumes the more deterministic time-triggered protocol ('IT€'). [IO] distributes timing constraints on communication among segments through priority assignment on serial busses (such as controlarea network) and customization of device drivers. While these assume a bus or a network protocol, LYCOS 171 integrates the ability to select among several communication protocols (with different delays, data sizes, burstiness) into the main partitioning loop. These techniques do not specifically optimize for energy by exploiting the processors' voltage scaling capab teristics of the communication interfaces' power consumption.
Related 
We assume a processor's voltage-scaling characteristics cai be expressed by a scaling function Scalep that maps the CPU's frequency to its power level. A communication interface also has scaling functions Scale, and Scale, for sending and receiving. (2) implies Scale, is continuous, while communication interfaces support only a few discrete scaling points. Let Pp. P,, and P$ denote the power for the processor, receiving, and sending, respectively. Then, P, = Scalep(Fp); P, = Scale,(F,); Ps = Scule,(F,) (3) Let Po,* denote the power overhead associated with having an additional node into the system. It captures the power of the memory, minimum power of the CPU and communication interface, CPU's power during RECV and SEND (DMA), and communication interfaces' power during PROC.
The energy consumption of a tark is the power-delay product. Let E,,E,,E,, and E , n denote the energy consumption of tasks PROC, RECV, SEND, and overhead of a node, respectively. Let EN, denote the total energy of node NI. Finally, the total energy of the system is the sum of energy consumption on each node. To summarize,
.
(5) Fig. 2 shows 'an example of a three-node pipeline. For brevity, the overhead is not'shown. Fig. 2 0 ) shows the pipelined timing diagram by folding the tasks in Fig. 2(a) into a common interval with duration D , which is the delay of each pipeline stage. [8] presented the schedulability conditions for an M-node pipeline based on collision and utilization of the shared communication medium.
An M-node pipeline can be partitioned and mapped onto an 
MOTIVATING EXAMPLE
Task to Node Mapping
Given the decomposition into five stages of the ATR algorithm, several partitioning schemes are possible for mapping them onto a number of pipelined nodes. Fig. 4 shows an example by considering how they map the first two stages onto (a) two nodes and @) one node. In Fig. 4(a) , mapping onto two nodes N1 and N 2 enables both processors to operate at a.reduced speed (300MHz) for compumtion. The two nodes together consume lower computation energy than one node at a faster speed but must pay the price of communication energy for SENDl --f RECVZ. In Fig; 4(b 
Speed Selection for CPU andcommunication
The selection of communication speed is an equally critical issue. For example, a lO/lOO/l~O Base-T Ethemet interface can consume more power than a CPU at high (100/1000Mbps) speeds, but less power at the slower, lOMbps data rate. In Fig. 4@) , the processor must operate at a high clock rate due to the 1ow:speed communication at LOMbps. Because of the deadline D, communication
The commrinication-computation interaction becomes more intricate in a multi-processor environment. Any data dependency between different nodes must involve their communication interfaces. The communication speed of a sender will not only determine the receiver's communication speed but also influence the choice of the receiver's computation speed. The communication speed on the first node of the pipeline will have a chain effect on all other nodes in the system. A locally optimal speed for the first node will not necessarily lead to a globally optimal solution.
Combining Partitioning and Speed Selection
Given a fixed partitioning scheme, the designers can always find the corresponding optimal speed setting that minimizes energy for that scheme. However, energy-optimal speed selection for a partitioning is not necessarily optimal over all partitionings. Instead, partitioning and speed selection are mutually enabling. In this paper, we take a multi-dimensional optimization approach that considers performance requirements, schedulability, load balancing, communication-computation trade-offs, and multi-processor overhead in a system-level context.
PROBLEM FORMULATION
Given an M-node pipeline, choices of pmitioning and communication speed senings will lead to different levels of energy consumption at the system level. This section formulates three energy minimization problems: by partitioning, by communication speed selection, and by both. In the first two problems, the optimal solution can be obtained by dynamic programming, and the combined optimization problem can be solved by multi-dimensional dynamic programming.
Problem 1 (Optimal Partitioning) Given To avoid exhaustive enumeration in the 0(2M-1) solution space, we construct a series of optimal solutions to sub-problems by mapping the original M nodes one by one onto new sub-partitionings. We compute the optimal cost function in terms of the minimum energy consumption over the sub-partitionings. Upon mapping each node, the new optimal sub-solution can be computed from past optimal sub-solutions. Therefore, a dynamic prngramming approach is applicable.
For dynamic programming, we use an energy matrix E to store the cost function. Each enuy E[i, j ] indicates the minimum energy of a sub-problem that maps the first j original nodes NI , N z , . . , ,N, onto a new sub-partitioning with i nodes % ,Nk,. . . , N;. Matrix E is initialized to -. node Ni. Its energy is denoted as E. ; . . Since Eli, j] is the optimal energy for the sub-problem, it must be the minimum value of (7) among all possible choices of 1. The dynamic'programming algorithm can iterate (7) Due to the inter-dependency between speed setting and partitioning, the optimal solution cannot be achieved by solving two previous problems individually. Exhaustively enumerating over one We propose a multi-dimensional dynamic programming slponthm given the t a n that thc previous two problcnh can hc solved b) dynamic programming independentlv. Based un the previuu, tun dynamic programming appruachcs, the unrrg? mulrtx E 1. w thc conibined prohkm is Jzrined as follow\: c3ch element E l i PI stores the minimum energy of a sub-prnhlem that niapr the first 1 uriginal nodeb N I , Nl. ,N, on1oa new t-nodc iuh-panitinning, whoa< last node N: h x sending bpeed 6' = F,,
The optimal-energy Eli, 
EXPERIMENTAL RESULTS
To evaluate our energy optimization techniques, we experiment with mapping the ATR algorithm onto two fixed partitioning schemes: (a) a single-node that combines all block, and @) a five-node pipeline that maps each block onto an individual node (Fig. 6) . The input data size is 128K hits, and the output is 14K bits per frame. In scheme (a), the single node combines all the worktoad of five nodes in @); and it eliminates all internal communication instances between nodes in @). (a) and @) are two extremes representing serial vs. parallelschemes. For both (a) and (b) we apply optimal speed selection. We also find the optimal partitioning with speed selection as (c) and compare its energy consumption per image frame with ( Fig. 7) , and an LXT-IOW Ethemet interface [I] with.power.levels ofO.8WO IOMbps, 1.5WO lWhfbps, and 6WB 1MX)Mbps (Fig. 8) .
We assume eachnodehasaconstantpowerdraw P,,h = IWmW. The results are presented in Fig. 9 . In all cases, 1oMIMbps is always the optimal speed setting for communication. The low-power, IOMbps communication speed results in the highest energy. This is because. it leaves so little time for computation such that the processor~ must run faster with more energy to meet the deadline, and it has the highest energy-per-bit rating. The low-speed communication also tends to violate the schedulability conditions [8] . Given properties of this particular Ethemet interface, 1Mx)Mbps communication will always lead to'the lowest.energy consumption since it requires the least amount of energy per bit and leaves the maximum amount of time budget for reducing CPU energy. However, in cases where the energy-per-bit rating does not decrease monotonically with the communication speed, the optimal speed setting may involve some combinations of low-speed and high-speed settings between different nodes. For example, the node Nt may communicate with Ni-1 at IOOOMbps and with N;,, at ICQMbps. Fig. Y (1) shows the energy consumption of all three partitioning schemes under a tight performance constraint. The single-node (a) is heavily loaded with computation. Therefore it is desirable to reduce CPU energy by pipelining. As a result, the five-node pipeline @) is more energy-efficient at the cost of additional-communication and overhead. However, the optimal partitioning is (c) withthreenodes: [NI,NZ],[N3,N4J;[N5]:It consumes moreCPU energy than (b), but overall it is optimal with less energy on communication and overhead. In case of the moderate performance constraint ( Fig. 9(2) ), (a) is still dominated by computation hut it is not heavily loaded due to the relaxed deadline. The reduction of CPU energy by @) cannot compensate for the added overhead of new nodes and communication. Therefore (a) is better than @) and pipelining seems inefficient. However, the optimal partitioning (c) is still a pipelined solution. It combines Nl,NZ,N3,N4 into one node and maps N5 to another node. (c) achieves minimum energy by appropriately balancing computation, communication with pipeliming overhead.
If the performance constraint is funher relaxed, the serial solution (a) will become optimal.
CONCLUSION
We present an energy optimization technique for networked embedded processors and emerging system-on-chip architectures with high-speed on-chip networks We exploit with the multi-speed feature of modem high-speed communication interfaces as an effective way to complement and enhance today's CPU-centric power optimization approaches. In such systems, communication and computation compete over opportunities for operating at the most energyefficient points. It is critical to not only balance the load among processors by functional partitioning, hut also to balance the speeds between communication and computation on each node and across the whole system. Our multi-dimensional dynamic programming formulation is exact and produces the energy-optimal solution as defined by a partitioning scheme and the speed selections for all computation and communication tasks. We expect this technique to he applicable to a large class of data dominated systems that can be structured in a pipelined organization.
