For motion estimation (ME) and discrete cosine transform (DCT) of MPEG video encoding, content variation andperceptual tolerance in video signals can be exploited to gracefully trade quality for low power As a result, power-aware hardware cores have been proposed for these video encoding subsystems. Adaptive System-on-a-Chip, aSoC, supports power-aware cores by providing an on-chip communications framework designed to promote scalability andflexibility in system-on-a-chip designs. This paper describes aSoC's ability to dynamically control voltage and frequency scaling through a simple voltage and frequency selection scheme. A smali demonstration system is tested and shows up to 90% reduction in core power when the aSoC voltage scaling features are enabled.
INTRODUCTION
The demand for portable image and video processing continues to increase in products including PDAs, hand held gaming units and cell phones. To meet the power and performance demands of these applications, many hardware architectures have been proposed for specific subsystems [I, 21. Many recently proposed subsystems include features, which control power and performance trade-offs at mn-timd [3, 4] . Given the development of these intellectual property (IP) subsystems, or cores, the main challenge lies in integrating them into a single system capable ofleveragingtheir individual flexibilities.
In this paper, Adaptive System-on-a-Chip (aSoC) is used as a backbone for power-aware video processing cores. In our previous work [5] we showed that aSoC, by nature of its statically scheduled mesh interconnect, performs up to 5 times faster than bus-based architectures. Additionally, interconnect usage for typical digital signal processing applications is under 20%. This leaves significant interconnect bandwidth to accommodate the control communications required by the power-aware features of modem cores 161.
scaling is critical to future portable digital signal processing applications. This will allow SoC implementations to exploit the inevitable mismatches in core utilization, due to data content variations or user requirements, to reduce power consumption. A simple clock division scheme makes it possible for each core to select and switch between 8 different frequencies. Based on principles described in [7] , the clock selection scheme is complemented by a voltage scaling procedure, which allows the individual core supply voltages to switch between 4 different values. System power consumption is reduced since each core operates at a voltage and frequency, that is coarsely tuned for its specific utilization. To demonstrate this, we use a partial video encoding system consisting of a motion estimation (ME) core and a discrete cosine transform (DCT) core. This simple system shows how the availability of multiple core frequencies and voltages reduces individual core power by 90%. Additionally, we describe a hardware control system that automatically selects the voltage and frequency of each core at run-time by monitoring interconnect utilization. This paper proceeds as follows. Section 2 presents a brief overview of the aSoC architecture. Section 3 presents the power management techniques used in aSoC. The experimental approach and results for our demonstration system are described in Section 4. Section 5 concludes the paper and suggests future work. Figure 1 , each tile in these architectures includes a computational core and its interface to the network. Our approach to SoC integration. aSoC, is a tiled architecture, which supports the use of heterogeneous processing cores occupying one or more tiles [ 5 ] . ASoC connects these tiles using a statically scheduled mesh of interconnect, which assures predictable inter-core communication. Data moves between neighboring tiles in a communication pipeline, enabling fast clock rates and time sharing of interconnect resources. The interconnect is reconfigurable at run-time to allow for dynamic communication patterns.
The core interface uses a synchronized global communications schedule to manage communications through each tile. As shown in Figure 2 , the instruction memory holds a list of the communication patterns required at run-time. A program counter (PC) fetches these patterns in succession and a decoder converts them into switch settings for a crossbar. The crossbar routes data between the local core and the neighboring tiles (North, East South or West). Each incoming data word can contain local interface configuration information to be sent over the local con&-. line to the controller. The core-ports in Figure 2 use a simple protocol to interface communications between the potentially different clock domains of the core and interconnect. Multiple input and output core-ports can be used depending on the core and application requirements. During normal operations, the controller simply loops through the communications schedule.
DYNAMIC POWER MANAGEMENT FOR ASOC
Dynamic power management exploits run-time variations in data content and operational requirements to minimize one or more of the terms in the VLSl power equation 
To meet critical path requirements, independentlydevcloped heterogeneous cores may require independent clock and voltage domains. Additionally, reconfigurable IP cores may require the reconfigurability of both the clock and supply voltage. As a result, much of the overhead for adaptive clock and supply selection already exists in heterogeneous SOC. Figure 3 shows our approach to coupled frequency and voltage scaling. At each core, frequency and voltage are automatically adjusted using a four part system. The first subsystem, Data Rate Measurement, uses upidown counters to track the data transfer rate between core and interconnect. Blocked or unsuccessful transfers cause the count to increase, while successful transfers decrease the value. If the core input port is blocked consecutively, the core is running too slowly with respect to its predecessors. lfthe core output port is consecutively blocked, the core is running too quickly for its successors. In either case, these counters send trigger signals to the core configuration unit t o increase or decrease the core clock. To change the clock, the Clock Selector selects a different frequency. Eight different fre- , are used to save system overhead. During the transition the local clock is disabled until test data can successfully pass through a reconstructed core critical path [I31 in the Clock Enable system. When the clock is changed, a latch is reset blocking the transmission ofthe new clock signal to the core. In the Clock Enable system, the Critical Path Check models the critical path of the core. When a bit of data can successfully pass through the Critical Path Check, the latch is set and the new clock can propagate to the core. This prevents data loss in the core during voltage and frequency changes. The Normalized De/ay in Figure 4 shows the relative changes in critical path delay for the changes in voltage. This is done by dividing the critical path delay by the delay when a supply of 1.8 V is applied to the system. With this information the values of the four selectable voltages, V1 to V4, can be chosen based on the desired relative clock frequencies. The core can be driven by the supply voltage V 2 when the core processing delay can be twice the minimum possible delay. When slower frequencies can be used, operating at the voltages, V 2 = IV, V3 = 0.72V, and V3 = 0.6V reduces core power by 70%, 84% and 90% respectively.
METHODS AND RESULTS

Schematic and
A major issue with voltage scaling is its overhead in system performance, power and area. Using gating transistors with multi-stage driven, the power grid for even large (5 x lo6 gates) cores can be made to switch voltages within 30ns. Switching time is on the order of IO aSoC interconnect clock cycles. The multi-stage driver and gating transistor for any core with lo4 to 5 x lo6 gates, uses only 0.15% of the area used by the core. The energy consumption in switching voltages ranges from 2.49 x lO-'*J to 1.22 x 10-9J per switch over the tested range of core sizes.
Thus, the largest cores could switch voltages 1000 times a second before the power overhead became noticeable at approximately ImW.
Our demonstration system, shown in Figure 5 , is a simple combination of video encoding cores developed in [3, 151. The DCT is a replicated row accumulate (RAC) unit implementation [15], which includes dynamic power sav-ME:FS64x32 ME:FS 16x 16 (MHz).
(MHz). 1 05 I05 0% 13.1 13.13 90% Table 1 . Power for Modes and Clock Rates ME:spiral ME:TSS DCT ings mechanisms such as most significant bit (MSB) rejection and row column classification (RCC), as found in [4] . These mechanisms create a system throughput, which varies with data content. The ME core permits selection of several search algorithms including full, spiral and three step [I] . The optimal operating frequency for this core varies dramatically with the selected search algorithm. The second column of Finally, in a dynamic system, the addition of voltage scaling greatly reduces power consumption. In [3] , we show bow to use the magnitude of motion vectors to select the search range for motion estimation in the upcoming frames.
With this approach we were able to evaluatc motion vectors with small, 16 x 16 pixels, search windows nearly 70% of the time. This search window reduction saved nearly 60% of the power for motion estimation with only a 2% reduction in quality [3] . Applying dynamic voltage scaling to this system saves an additional 20% in power. L 9.9 13.13 90% 2.75 3.28 90% 9.6 13.13 90%
SUMMARY AND FUTURE WORK
This paper presents the dynamic power management capability of aSoC applied to video processing systems. A methodology has been presented to use aSoC core-port monitoring to dynamically vary both frequency and voltage individual cores. Reconfigurable clock based system halancing creates an environment ofjust in time computing, which can reduce overall power usage. When coupled with coarse grained voltage selection, this method can reduce core power by 90%. The overhead of this type of system was shown to be insignificant. Presently, a C-based simulator is being modified to support the adaptive frequency and voltage mechanism. We hope to show that many real applications can benefit from this dynamic voltage scaling.
