Transmitting compressed data can reduce inter-processor communication traffic and create new opportunities for DVS (dynamic voltage scaling) in distributed embedded systems. However, data compression alone may not be effective unless coordinated with functional partitioning. This paper presents a dynamic programming technique that combines compression and functional partitioning to minimize energy on multiple voltage-scalable processors running pipelined data-regular applications under performance constraints. Our algorithm computes the optimal functional partitioning, CPU speed for each node, and their respective compression ratios. We validate the algorithm's effectiveness on a real distributed embedded system running an image processing algorithm.
Introduction
Dynamic voltage scaling (DVS) has been studied extensively as a power-saving technique for applications with slacks. By lowering the voltage and slowing down the processor to fill the slack, one can potenriallyaihicue quadratic energy-saumg in, CMOS technologies. However, iftheapplication8ooesnot have much slack to begin with -that is, if the processor is always. around its peak utilization -then DVS will not achieve any saving alone. Instead, it is well known that by increasing parallelism, one can afford to slow down the clock to enable more voltage scaling opportunities without performance loss. By partitioning the workload onto multiple proces-SOTS, each processor is now responsible for only a fraction of the workload and can now afford to slow down by DVS to run at more power-efficient levels. This, of course, assumes that the application is parallelizable and that architectural overhead on the parallelism can be well amortized. In processor-based systems, having multiple processors means either shared memory or message passing communicatibn. This paper assumes message passing communication for modulanty and scalability reasons.
While distributed systems have many attractive properties, they pay a higher price for message-passing communication. Each node now must handle not only IiO with the external world, but also WO on the internal network. Common communication interfaces such as RS-232 or BlueTooth are serial and are relatively slow. As a result, even if the actual data workload is not large on an absolute scale, it appears expensive relative to the computation performance that can be delivered by today's low-power embedded microprocessors. Since IiO transactions always appear on the critical paths in that they c a n y data dependencies between processors, they have became a limiting factor in exploiting DVS opportunities through parallelism.
Compression has been applied to saving energy and increasing effective bandwidth in many areas, ranging from telephone modems and faxes to caches and memories. By compression and decompres-'This research was sponsored in pan by DARPA ~onlract F33615-00-1-1719 and by National Science Foundation under grant CCR-0205712
Pcmiission t o #make digital or hard copies o f a11 or pin o r this work fur pcrsonal 01 classroom use is granted williout fcc provided that copies are not madr or dislribuled for prolit or conmerciiil advantsge and that copics beor this nolice and the full citation an the first page. 70 copy othenrise. lo rcpublish. to post on scrvcrs or to redistribute to lists, reqiiires prior specific pcrmirsion andlor n fcc. 
201
sion before and after communication transactions, it will be possible to save significant amounts of energy in communication. This may sound like an obvious idea, and in fact it has been used from modem standard to cache and memory. For the multi-processor, message-passing architecture studied in this paper, however, the trade-offs are not obvious and may even be counterintuitive. Compression can free up extra time budget by reducing the long communication delays in embedded systems. This extra time can be utilized towards either higher performance, or as additional DVS opportunities for energy savings. Different compression algorithms are available with different compression ratios, and even within an algorithm, it may be possible to set different target compression factors for both lossy and lossless algorithms. The compression algorithm chosen by a sender will not only dictate the receiver's decompression algorithm, but also determine the receiver's WO delay and CPU speed. Thus, it can make a global impact on all communicating processors on their choices of compression algorithms and CPU clock rates with DVS. The design space becomes even larger if we also consider multi-speed communication interfaces.
The main challenge is that the selection of CPU speed, communication speed, and compression algorithms cannot be performed independently or greedily, because a local decision can have a global impact. The CPUs cannot all be run at the slowest, most powerefficient speeds, because they must compete for the available time and power with each other and with the communication interfaces. A high-ratio compression algorithm with time and power overhead may actually save energy by creating oppormnities for voltage scaling the processors. Greedily saving power for communication or computation may actually result in higher overall energy. At the same time, functional partitioning must be an integral p m of the optimization loop, because different partitioning schemes can dramatically alter the communication and computation workload for each node. For a given workload on a networked architecture, our problem statement is to generate a functional partitioning scheme, select the corresponding compressioddecompression algorithms, and select the speeds of processors to perform computation tasks and compressioddecompression, such that the total energy is minimized. In general, this is an extremely difficult optimization problem. Fortunately, for a class of systems with pipelined communication panems under a latency constraint, efficient, exact solutions exist. This paper construct such a system model and fnrmulate the energy consumed by communication, computation, compression and decompression within their available time budget. We present an efficient multi-dimensional dynamic programming solution to minimize system energy. We demonstrate the effectiveness ofthis technique with an image processing algorithm mapped onto a fully implemented distributed embedded system.
Related Work
Besides the well-known DVS techniques, previous studies also explored compression schemes for caches and memory busses to reduce energy in embedded processors. [ 1.21 applied compression to reduce the code size and memory accesses for an SoC architecture.
[3, 41 proposed bus encoding schemes to minimize the switching activities on the memory bus. These techniques often do not target inter-processor communication for multi-processor systems.
Many 
Running ATR on ltsy
The structure of the ATR algorithm is shown in Fig. 3 and Fig. 4 . It performs four sequential processing stages to an image frame. We constructed a parallel version of the algorithm such that it can be mapped onto I, 2, 3, or 4 ltsy nodes with pipelined communication pattems. Given s frame delay D as the performance constraint, the host computer provides one image and collects one result in every D seconds. Pipelining allows each Itsy node to run at a lower frequency while maintaining the same throughput. However, communication between adjacent nodes costs additional time and power. Fig. 4 also summarizes the performance proRlc of ATR on Itsy. The performance degrades proportionally with the CPU clock frequency. The maximum data rate of the serial port is I15.2Kbps. though our measured data rate is 70-XOKbps over TCP/IP. Therefore, even though the raw data size is not large, the communication still takes long delays (e.g., 0.8Ss for 8K bytes). To reducc long communication delays, we compress the data before transmission and decompress after using g z i p on the host computer (sourcc and sink of images) and on each ofthe ltsy nodes (image procersing stages). Compression and decomprcssion take less than IOms. For brevity they are omitted in Fig. 4 . Fig. S shows the mcasuremcnt results of the current draw (in mA) over different speeds of an Itsy node running diflcrcnt tasks. Thc horizontal axis represents the frequency and voltagc levels. ltsy has a 4V lithium-ion battery supply. Therefore, thc curves rcfcr to thc actual power consumption ranging from IOOmW to 700mW. Wc refine the tasks of this multi-node ATR system as follows:
In Fig. 6(c) , the idle period cannot be further utilized for DVS.
Many DVS studies indicated the three tasks DECO, PROC and COMP must operate at the same CPU speed to achieve minimum energy, under an assumption that the CPU clock rate can be scaled continuously to fully utilize the slack time (idle period). However it is not true in reality when the processor can operate only at discrete frequency levels. If the processor further reduces its frequency to the next level, e.g., IOOMHz, it will fail to meet the timing constraint. The idle period represents the wasted (or, fragmented) time budget, when DVS can be performed on only a few discrete frequencies. 
Data Compression for Pipelined Nodes
Next, we map the ATR algorithm onto multiple pipelined nodes. The trade-offs between communication and computation with data compression discussed earlier for the single node are generally applicable to pipelined multiple nodes, too. With multiple nodes, network contention tends to have a greater impact on the entire system and therefore must he avoided through the selection of compression algorithms and partitioning schemes.
Compression Algorilhm Selection in the Pipeline
Having a choice of compression algorithms adds a new dimension of communication-computation trade-offs in multiple processors. By selecting a compression algorithm for a sender, it forces the receiver to choose the corresponding decompression algorithm, thereby affecting not only the receiver's communication delay but also the receiving node's CPU speed. Then, the choice of the receiver's CPU speed could further affect the receiver's compression algorithm and the subsequent nodes in a chain effect. A locally optimal choice for the first node will not necessarily lead to a globally optimal solution.
Partitioning with Compression
Data compression also affects the choices of partitioning schemes. This is primarily because different data do not compress equally well even by the same algorithm. As an example, Fig. 7(a) shows the the optimal partitioning scheme without data compression for two nodes with the minimum internal communication payload (8.3KB). However, Fig. 7 (a) is no longer optimal with data compression, because the internal data from NI21 to NI31 cannot be effectively compressed (8.3KB down to 7.4KB), and the relatively long communication delay limits DVS opportunities. The optimal partitioning scheme is shown in Fig. 7 (b) by remapping task N\2) to node "(21. Although the raw data size from N ( l ] to N(21 is also 8.3KE, the data can be compressed very well (8.3KB down to Tasks DECO and COMP dominate the power consumption. However, since their execution delays are short, task ?ROC is the primary energy consumer. SEND and RECY also take long delays, but their power levels are relatively low. We allow PROC, DECD, and COMP to operate at any CPU clock rate enabled by DVS. However, during communication (that is, SEND and RECY), we set the CPU speed to the lowest power state (0.91YV at 5YMHz), since there is no performance benefit to running the CPU faster during serial communication. When a node idles, we also set its CPU frequency to 5YMHz. In addition, to avoid extra power draw from other components, we completely shut down unnecessary peripherals, including the LCD screen and the speaker during all experiments. pressionldecompression delays, then this new slack can be applied towards DVS at a much lower power level (ISOMHz). Compressionldecompression could also allow the node to deliver higher performance with a reduced delay on its critical path.
Data Compression for One Node
The Impact of DtTeerent Compression Algorithm Different compression algorithms can achieve different compression ratios over a given piece of raw data. Unlike Fig. 6 (b) which uses only one compression algorithm, Fig. 6 (c) applies alternative algorithms with higher compression ratios to further reduce communication delays. This creates DVS opportunities for reducing the CPU clock rate to I20MHz. Algorithms with higher compression ratios typically require more energy with more CPU cycles, but if this overhead can be more than compensated by aggressive DVS, then (c) will consume less energy than (b). 0.6KB) to effectively reduce the communication delay. As a result, both nodes are able to operate at much lower power levels with more energy savings, although the computation loads on the two nodes are more imbalanced compared to Fig. 7(a) .
Compression to Reduce Network Conrention
Given the assumption of a shared communication medium, all communication transactions should be scheduled into different time slots. Since a transaction consists of a pair of send and receive tasks on neighboring nodes, they should be scheduled together. As an example in Fig. 8(a) , SEND [II and RECY[2] should always occupy the same rime slot. In the case of long communication delays, two different transactions such as R E c~I ] and SEND[^] might overlap in time slots, causing network contention. If the network utilization is over-saturated, one way to eliminate network contention is to increase the stage delay D, but this causes performance degradation.
Altematively, data compression can reduce network utilization and eliminate the network contention while maintaining the same performance, as shown in Fig. 8(b) . These trade-offs include timing budget for both communidation and computation, compression algorithm selection with DVS fragmentation, and compression algorithm selection with functional partitioning. We next formulate a multi-dimensional optimization approach to effectively minimize energy consumption for both communication and computation on all nodes.
System Model
This section defines a system-level performancelenergy model of a distributed embedded system running an application with a natural pipelined organization. We first define the process-to-architecture mapping followed by the associated cost functions.
Node
A node is a computer in our system. It consists of a processor, local memoty, one or more communication interfaces, and optional compression and decompression units. A processing j o b assigned to a node is modeled in terms of five task. RECY, DECD, PRDC, COMP and SEND that must be executed serially in this order. A node receives data by RECV, decompresses the data by DECO if necessary. Then task PRDC produces the result that can be compressed by COMP if necessary. Finally, the result is sent to the nexr node bysEND. Fig. 9 shows the timing vs. power diagram of a node. The total area of these five tasks plus the idle period represents the energy consump tion of the corresponding node. Let F, denote the CPU clock frequency to perform task PROC, F, and F, the respective bandwidths for receiving and sending, and let Fd and F, be the processing speeds of decompression and compression, performed by the processor or other hardware units. Let P,, P,, Ps, Pd, and P, denote the power level o f t a s h , and E p , E,, E,, Ed and E, be the e n e w consumption of lash, f i d l e and be the power level and energy consumption of the idle period. Finally, let EN denote the energy consumption of a node. We have 
(2) is a reasonable estimate for processing units executing datadominated tasks, including PROC, DECO and COMP, where the total cycles W can be analyzed and bounded statically. The communication bandwidth is normally less than the rated maximum data rate and can be measured or profiled.
A node can choose from a set a,[l : C] of compression algorithms. The corresponding set of decompression algorithms adil : C] must be used by the receiver to correctly recover the raw data. We denote a decompression and a compression algorithm as Ad directly related to the CPU frequencies. In addition, the tasks may consume different power levels even if they tun on the same processor with the same clock rate. Therefore, the power levels P,, Pd and P, are also functions (lookup tables) of F,,Fd and F,, rather than constant values. For example, the ATR algorithm's power profile on ltsy (Fig. 5 ) consists of multiple lookup tables. In this paper we omit the details of lookup tables to keep the notation concise.
M-node Pipeline
We consider a specialized organization, called an M-node pipeline, and each node acts as a pipeline stagc with delay D. Fig. IO shows an example ofa three-node pipcline. Fig. 10(b) shows the pipelined timing diagram by folding the tasks in Fig. 10(a) into a common in- 
Problem Formulation
In this section we formulate three energy minimization problems by: ( I ) compression algorithm and CPU speed selection for one node, (2) compression algorithm and CPU speed selection for a pipeline with a fixed partitioning scheme, and (3) combined compression algorithm and CPU speed selection with functional partitioning for the pipeline. For all three problems, we assume the delay, power and compression ratios of all corresponding tasks are known as functions or look-up tables and the details are omitted. The incoming data r,, and the choice of the decompression algorithm Ad (assuming the sender will also agree to compress the data with the corresponding compression algorithm) determine the energy consumption E, for RECV. Similarly, the outgoing data srow and the choice of the compression algorithm A, decide the energy E$ for SEND. For DECO and COMP, the decompression algorithm Ad, the incoming data r,,,, and the CPU speed Fd decide Ed; and A,, sraW, F, decide E,. For PROC, Ep depends only on Fp. The choices of Ad and A , are independent for one node. So are the choices of Fd and F,, but together they decide Fp due to the timing constraint D. Therefore, we must enumerate over C choices of both Ad and A,, and S choices of both Fd and F, for the minimum energy consumption.
The algorithm shown in Fig. 1 I has a runtime complexity of O(C2S2). It selects the optimal compressionldecompression algorithms, combined with the optimal CPU speed settings to overcome the DVS fragmentation problem. In reality C and S are usually small integers ranging from 3 to IO. Therefore the runtime complexity of this algorithm is close to a constant. In an M-node pipeline, there are M + I communication transactions that require a combination of M + I pairs of independent compressionldecompression algorithms. The CPU speed selection is an O(S2) procedure to be performed on M nodes. Therefore, the overall enumeration space is O(CM+'S2M). Problem I becomes a special case when M = 1. We propose a dynamic programming solution to eliminate exhaustive enumeration. We construct a series of optimal solutions to the sub-problems by selecting the compression algorithm for one node at a time. We compute the optimal cost function in terms of the minimum energy consumption over the subproblems. Upon selecting a compression algorithm for each node, the new optimal sub-solution can be computed from past optimal sub-solutions. Therefore. dynamic programming is applicable. (Fig. 12) as a special case. Therefore, we omit it for brevity. Its time complexity is O(C2S2M), which is practically linear with M. 
Find
We propose a two-dimensional dynamic programming algorithm shown in Fig. 12 to solve this more complex problem, whose solution space is exponential with M. I If we let k = i in this algorithm, the new partitioning algorithm is fixed to he the same as the original one, and the two loops over k and m (at line 9 and 11) will he eliminated. Then, the same algorithm can solve the previous Problem 2 on a fixed paltitioning. In reality, algorithm OPT-M and OPT-I should also compute the optimal partitioning, decompression and compression algorithms, and CPU speeds for all nodes. Since it is not difficult to derive the optimal values of these parameters, we omit the details for brevity.
Experimental Results
We experiment with the ATR algorithm mapped onto one and two Itsy nodes. The delay D for each frame is used as the performance metric. We repeat executing the ATR algorithm until the battery is fully discharged. We define I to be the processed image count per node and use it as a measure of energy efficiency,
I. Experiments with One Node
With the one-node configuration, we perform three experiments:
(1.A) The baseline configuration is a single ltsy node to run the entire ATR algorithm at the maximum CPU speed of 206.4MHz, without data compression. Its peak performance is D = 2.9s for each frame and the node can process I = lO.lK images before the battery is exhausted. while processing I = 11% images. That is, it can speed up the performance by 26% and increase the energy efficiency by 14% at the same time.
1I. Experimenls with Two Nodes
We also perform three similar experiments for the two-node pipeline. Fig. 13(a) . With more parallelism, the CPU speeds may be reduced on both processors. However, the first node must still run at the fastest speed of 206.4MHz to achieve the same performance of D = 2.9s, due to the long communication delays. The second node can operate at 88.5MHz with a much lower power level. As a result, the two-node pipeline can process, 21.2K frames with two batteries. Therefore, I = 10.6K. Compared with (LA), The energy efficiency is improved by 5%. The increased parallelism with two nodes cannot further improve performance due to imbalanced workload, where the first node must run at the highest speed.
(1I.B) shows that data compression unveils a new optimal partitioning as N [ I ] , N ( 2 : 41 ( Fig. 13(h) ), because the raw data between the two new partitionings can he very well compressed (8.3KB down to 0.6KB). The saved time budget allows both nodes to reduce their CPU clock rates to 59MHz and 73.7MHz, respectively.
While not perfectly balanced because the second node must now process more workload, this is a much better solution. The battery efficiency is increased by 38% with 27.8K images being processed by two nodes (I = 13.9 K).
(1I.C) Data compression also allows a 100% speedup with D = 1.45s and a 16% improvement in the energy efficiency with I = I I .6K. Without data compression, it would be impossible to deliver higher performance with two nodes in (1l.A).
In summary, Fig. 14 (1.B) . Finally, (1.C) and (1I.C) improve both performance and energy efficiency at the same time. The curves along (1.B) -(1.C) and (1I.B) ~ (1I.C) represent many new solutions that were not possible without data compression. They strictly dominate (LA) and (1I.A) with both higher performance levels and lower energy consumption. It should be noted that in experiment (I1.B) and (ILC), data compression achieves a much wider range of energy vs. performance trade-offs. This finding validates the concept that multiple processors can support both high-performance and low-power applications. However, as indicated by (KA), increasing parallelism alone may not bc effective unless it is explored synergistically with other trade-offs by a joint effort. These important trade-offs include selecting compression algorithms, CPU speeds and partitioning schemes that are discussed in this paper.
I Conclusion
We present an energy optimization technique for distributed embedded systems. In such systems, communication and computation compete over time and power budgets for operating at the most % energy-efficient states. It is critical to balance the time and power budget for both communication and computation on each node and across the whole system. With data compression, the system can be tuned towards either high performance with shortened critical delays, or low power with extra DVS opportunities. We'present an exact multi-dimensional dynamic programming formulation that produces the energy-optimal solution as defined by a partitioning scheme with compression algorithm selections for all tasks. This technique is applicable to a whole class of data-oriented systems that can be structured in a pipelined organization.
