The ability to do fine grain power management via local voltage selection has shown much promise via the use of Voltage/ Frequency Islands (VFIs). VFI-based designs combine the advantages of using fine-grain speed and voltage control for reducing energy requirements, while allowing for maintaining performance constraints. We propose a hardware based technique to dynamically change the clock frequencies and potentially voltages of a VFI system driven by the dynamic workload. This technique tries to change the frequency of a synchronous island such that it will have efficient power utilization while satisfying performance constraints. We propose a hardware design that can be used to change the frequencies of various synchronous islands interconnected together by mixed-clock/mixed-voltage FIFO interfaces. Results show up to 65% power savings for the set of benchmarks considered with no loss in throughput.
INTRODUCTION
One of the main long-term system-level design challenges (as mentioned in the 2005 ITRS [3] ), is the prohibitively costly global, on-chip synchronization due to process variability, power dissipation, and multi-cycle cross-chip signaling. Indeed, with increasing clock speeds and shrinking technologies, distributing a single global clock signal throughout a chip is becoming a difficult and challenging proposition. A Globally Asynchronous, Locally Synchronous design (GALS) is considered a promising technique for achieving low power consumption and modularity in design. As one other longterm system-level design challenge is on-chip power management, such an organization fits nicely with the concept of voltage islands, which can be effectively used as a means for achieving fine-grain system-level power management.
Voltage-Frequency Islands (VFIs) enable the design of systems that use a clock for local synchronization of data, but communication between different blocks is handled asynchro-nously. This not only helps to reduce the power consumed by the clock network due to reduced number of buffers that are used to meet the skew, but also helps in reducing the overall power significantly by using voltage scaling.
Most systems are overdesigned to meet the performance requirement of the worst case scenario. Such systems constantly operate at peak performance consuming peak power all the time. However, cooling and battery technology are not able to keep up and meet the power requirements of those designs. It, therefore, becomes necessary to make these systems more power and energy aware such that they use just enough power to meet the performance requirements of the given workload. Dynamic Voltage and Frequency Scaling (DVFS) schemes have become a common-place solution for adapting the power/energy consumption of a system based on a dynamically changing workload. While DVFS schemes have been applied mostly at application and system level by exploiting available slack in task scheduling for minimizing dynamic power with little or no performance hit, the case of hardwarebased DVFS for VFI systems has received less attention. The goal of this paper is to provide a solution for dynamic voltagefrequency selection by using a fully hardware-based control scheme driven by the workload variations.
The rest of the paper is organized as follows: Related work and contribution of this paper are presented in Section 2. Section 3 discusses the problem formulation and assumptions made in this paper. In Section 4, we present the theoretical basis for our method and how it can be used to configure an entire system for low power. Our proposed architecture to enable DVFS in a system is discussed in Section 5. Section 6 discusses the Topology Generation Tool, while in Section 7, we provide the experimental results for software radio and MPEG-2 encoder benchmarks. Final conclusion with directions for future research are provided in Section 8.
RELATED WORK AND PAPER CONTRIBUTION
Previous approaches based on availability of channel in multiple clock systems (e.g., [4] ), only gate the clock to the synchronous module. While this approach can reduce total power consumption, voltage scaling is not used as each synchronous module still operates at a fixed frequency. Also, too many pauses in the clock produce sharp variations in power consumption, potentially degrading the battery performance [13] . Our approach changes the clock frequency to minimize the idle time spent waiting for FIFOs.
There have been several proposals to implement VFIs in modern systems such as a Multiple Clock Domain processors [14] [15] . Such architectures allow a system designer to implement local DVFS algorithms [16] , but most of these approaches assume hardware control is done via FIFO occupancy monitoring which can provide incorrect decisions, as it will be seen in the sequel. Some of the on-line algorithms are inherently non-linear [16] requiring detailed analysis of queue behavior before an actual hardware could be implemented. Our method provides a flexible hardware platform that can be used to enable DVFS for VFI systems with simple data patterns while also providing methods to support more complicated workloads. The problem of voltage/speed selection in VFI systems has been addressed before [12] via providing an off-line algorithm and a dynamic on-line algorithm with limited efficiency. In our approach, the benefits of DVFS are exploited at finer granularity level, while maintaining the possibility of global configuration.
The main contributions of this paper are two-fold:
• First, it provides an online, hardware-based control mechanism for dynamically selecting the operating speed and voltages for individual VFIs in a VFI-based system. As opposed to existing schemes that monitor only FIFO occupancy to determine scaling factors [14] [9] [15] , our approach takes into account the workload dynamics and relies on a combination of producer/consumer stall and FIFO occupancy monitoring. In addition, the approach is cost minimal as it relies on counters associated with stall events, as opposed to complex schemes relying on control theoretic approaches (e.g., PID controllers [16] ).
• Second, we provide a framework that enables any application specified in TGFF format [1] to be automatically converted into a Verilog description of the VFI system including both computation and FIFO-based communication.
PRELIMINARIES AND ASSUMPTIONS
Without loss of generality, we consider the case of systems comprised of a number of synchronous cores, IPs or processing elements (PEs) (homogeneous or heterogeneous). In the case of VFI-based systems, PEs can only be assigned to a single VFI (in other words, cores cannot belong to more than one VFI).
A VFI might consist of a single PE or may include a group of PEs. We assume that power in the case of VFI systems is supplied by an off-or on-chip source and can be controlled independently for a VFI. This may be achieved by using either on-chip voltage regulators or multiple power grids [2] . Since each VFI is locally synchronous, it is assumed to be clocked using a ring oscillator controlled by the intra-island supply voltage using a digital phased lock loop [11] [10] . Communication is implemented via a modified version of mixed-clock FIFOs [6] that also allows for voltage level conversion. We assume that the allocation and mapping of various processes or computational kernels of the application to PEs, as well as the number and types of the communication links and PEs have already been determined. We also assume that the processes have already been scheduled on their respective processing elements. For VFI systems, a bounded number of storage cells is available in the mixed-clock FIFOs used between two communicating PEs. To this end, the system comprised of communication cores is modeled using a component graph. In a component graph G(V, E), cores are modeled as communicating processes (nodes) that have associated communication channels between them (edges).
We will assume the following, without loss of generality:
• The component graph G(V, E) is characterized by the set of nodes represented as V = {1, 2, ...,n} and edges represented as E={(i,j) | i precedes j}.
• Although the underlying component graph model may include feedback paths, in the initial theoretical treatment we restrict ourselves to directed acyclic graphs (DAGs). General graphs have been shown to be reducible to acyclic component graphs by lumping strongly connected components (SCCs) including feedback loops into supernodes [12] , [7] . As shown in [7] , the processing rates of these supernodes (and thus, their latencies in cycle counts) can be found by averaging across all nodes in the SCC. However, the case of feedback loops is addressed and discussed in Section 5.3.
• The component graph includes a single source node (s) and a single sink node (S ). Graphs including multiple sinks or source nodes can be reduced to this case by adding dummy, zero-latency source (sink) nodes feeding into (from) the actual source (sink) nodes.
THE COMMUNICATION ARCHITECTURE
In this section, we describe the use of mixed-clock FIFO as a point-to-point communication architecture for connecting synchronous islands in a GALS system.
The Producer-Consumer Model
In a VFI design, a mixed-clock/mixed-voltage FIFO provides a communication channel between two VFIs. One of the VFIs (producer) writes data into the FIFO while the other one (consumer) reads data from the FIFO [6] . For proper operation of design, it is required that a producer does not write data into the FIFO if it is full. Similarly, a consumer should not read data from a FIFO if it is empty. The producer and half of the mixed-clock FIFO share a clock (producer clock) while the consumer and other half of the mixed-clock FIFO share the other clock (consumer clock). Such a clock domain partition is shown in Figure 2. 
Rate Matching
Considering a simple producer-consumer model of a mixedclock FIFO, the behavior for ideal frequency of operation can be derived based on the read and write data rates.
The time interval between any two write operations by the producer can be written as, Tp = ap/fp, where ap is the number of clock cycles between any two write operations by the producer and fp is the frequency of operation of the producer. Similarly, the time interval between any two read operations by the consumer can be written as, Tc = ac/fc, where ac is the number of clock cycles between any two read operations by the consumer and fc is the frequency of operation of the consumer.
If Tp is equal to Tc, then the FIFO utilization will be constant most of the time. However, if Tp<Tc, the FIFO will tend to become full. Hence once the FIFO is full, the producer will have to wait until the consumer has taken at least one data item out of the FIFO. Therefore we can write,
where Tw is the time spent by the producer waiting for an empty slot in the FIFO. To operate the system near optimal operating point, this time Tw should be minimized and made zero in an ideal case. For such a case, we can write,
where Tpi is the ideal time interval between any two write operations by the producer while fpi is the ideal clock frequency of the producer. k is the ratio of consumer clock frequency to producer clock frequency. Thus, we can also write ideal clock frequency of the producer as follows: fpi = Sfp, where S = (ap/ac)/k is the Frequency Step factor by which the producer frequency should be scaled so that the wasted power is minimized. The choice of the new clock freqeuncy should be made conservatively, such that there is no drop in overall throughput. For example, if ap = 2, ac = 6 and fp = fc, the ideal speed of the producer should be fpi = (1/3)fp. The optimal available frequency should be chosen such that it is the closest, largest value available such that no throughput loss is experienced. E.g., in this case, if a value of f avail = fp/2 is available, the producer will still be slow enough to reduce waiting time Tw, but fast enough to not decrease the throughput. If, however, ap = 2 and ac = 3, the ideal producer speed would be fpi = (2/3)fp and a f avail = fp/2 available frequency will not guarantee the throughput constraint. Hence it is always necessary to have fpi < f avail . This analysis can be similarly applied to the case of Tp>Tc, where the FIFO will tend to become empty. In this case, the frequency of the consumer should be kept just enough to operate the FIFO near empty state, without having to experience any throughput reduction.
Problem Formulation
The goal of the work presented in this paper is to reduce the total energy consumption as well as power consumption of a system represented by a component graph G(V, E) subject to rate or throughput constraints.
The energy consumption per sample for every processing element in the component graph G(V, E) is given by:
where the first term corresponds to dynamic power and the second term corresponds to static (leakage) power consumed while core P Ei is not actively executing a process. Ci is proportional to the switched capacitance of P Ei, Ni is the number of active execution cycles for P Ei, ci is proportional to the number of off-devices in P Ei, ni is the number of idle cycles for processing a sample, k is a technology dependent constant, while Vi and Vt are the voltage supply and threshold voltage for P Ei, respectively [5] .
The cycle time for the P Ei core in G(V, E) can be written as:
where Ki and α are design and technology dependent parameters [8] . Thus from (4), we get the worst case execution time of a process on P Ei at voltage Vi as (Wi is the worst case number of cycles for the process mapped on P Ei):
For a system to operate as per the requirements of an application workload, it is needed that,
where Ti is the required time period of every VFI core. Most of the modern systems are not only designed for worst case workload conditions, but also operate at peak performance all the time to be able to handle the worst case workload. As a result, for an average workload we get W CETi(Vi) << Ti. This results in smaller τi(Vi) and hence larger Vi which leads to higher energy consumption. To reduce the amount of the wasted energy, W CETi(Vi) should be as close as possible to Ti, i.e.
M inimize(Ti
By taking W CETi(Vi) closer to Ti, the amount of time wasted Tw (1) waiting for the communication channel is minimized. The reverse is also true i.e. Tw → 0 ⇒ (Ti − W CETi(Vi)) → 0. Operating each PE at its ideal frequency/ voltage, the amount of time wasted Tw is minimized resulting in minimum energy and power consumption. However, based on the available system configuration settings of a real system (for example, number of available frequency and voltage levels), the optimal achievable solution will be close, but not identical to the ideal one. Our hardware based approach tries to find this optimal solution based on dynamically changing speeds/voltages driven by the workload.
THE FIFO LINK ARCHITECTURE
The derivations shown in Section 4 can be used to calculate the ideal frequencies of the producer and the consumer under dynamically changing workload. However, in a complex system, the values of ap and ac are likely to change due to varying workload conditions. Also, the overhead of computations to find the value of the Frequency Step factor (Section 4) is likely to be significant. We propose an architecture that can predict the value of the Frequency Step factor (and hence the ideal frequency) on the fly.
Proposed Architecture
To implement such a logic for estimating the optimal operating frequency, we take advantage of the fact that when the producer/consumer is not operating at the ideal frequency, the FIFO will always operate near full/empty state. We call these mostly full and mostly empty conditions. A simple way to monitor the FIFO utilization is to check the full and empty signals and measure the amount of time they are asserted: the larger the time of assertion of any one of these signals, the greater the deviation of the frequencies of producer (or consumer) from the ideal frequency. However, full/empty signals do not accurately represent the need for scaling up or down the speed/voltage of a VFI. It can happen that even though the full signal is asserted, the producer/consumer does not have any data to write/read into/from the FIFO. Thus, taking the decision to slow down a VFI only based on the FIFO occupancy can prove to be incorrect. Figure 3 shows an example of a producer writing data into a FIFO. For the time interval between t1 and t5, the full signal is asserted for time period (t4 − t2). However, the time period where producer is actually waiting for the FIFO to have an empty slot is (t4 − t3). If the Frequency Step factor is calculated based on the full signal alone, it is likely to overestimate the frequency decrease and can potentially reduce the throughput of the system. A similar argument applies to the empty signal. A more accurate estimation can be achieved if a signal (called stall signal) generated by a producer/consumer is used to estimate the ideal frequency. This signal is asserted whenever the producer/consumer has data to write/read to/from the FIFO, but the FIFO is full/empty. Figure 4 shows the architecture that can predict the ideal frequency based on this method. The stall monitors count the number of clock cycles (S f -for the producer part or Se-for the consumer part) the stall signal from producer/consumer is asserted in a sampling window T sample . The Frequency
Step factor can then be calculated based on the non-zero values of Se and S f . While in steady-state it is impossible to have both Se and S f non-zero (i.e., both consumer and producer of a FIFO link stalling at the same time), when cumulative stalls are accounted for, this could happen, e.g., for bursty traffic: the producer might stall during the beginning of the sample interval T sample , while the consumer might stall during the last part of it. In such a case, if the amount of stalling is the same on both ends, scaling the speeds of producer/consumer will not remove this problem. On the other hand, usually, in a sampling interval it is always the case that either the producer stalls due to a full FIFO or a consumer stalls due to an empty FIFO. To capture both of these cases, the Frequency
Step factor can be calculated as S = 1 − |Se − S f |/T sample . If only one of producer or consumer stalls, then the scaling factor is computed according to S f or Se, respectively. If both stall at different times during the sampling interval, then the difference is used to smooth out any differences between the two rates. For a producer, if S f > Se ≥ 0, then
where fnew is the new frequency while fcurr is the current frequency. However, if Se > S f ≥ 0, then
as in this case, the consumer is experiencing stalls and producer needs to increase the frequency. The reverse (i.e., changing division to multiplication and vice-versa) is true for consumer. However, for each FIFO link, only one of the producer or consumer modules will be scaled up or down to keep the throughput constraint, while minimizing wasted power during stalls. This approach is described next. 
Throughput Constraint and Scaling State
In general, throughput constrained systems require an output rate to be satisfied for correct operation. For example, in the case of the system in Figure 1 , the sink node S needs to have a certain rate of generating data items. Examples of throughput constrained applications include most media processing, data communication systems, digital-toanalog converters, etc. However, many times, the constraint is given at the input -that is, the incoming data items must be processed at a certain rate to ensure correct operation. Such an example is an analog-to-digital converter. Irrespective of where the rate constraint is specified (source s or sink S in Figure 1) , based on it, we can determine how each producer/consumer port can be configured for possible scaling up or down of the corresponding VFI, as described in Section 5.1. Let us consider the more common case of output rate constrained systems depicted in Figure 5 . For the producer port of the sink node S, there is no FIFO link associated with it, but a stall monitor can be used to determine if the data is produced at the required rate. If not, a corresponding scaling factor can be associated with the sink: SS = T S observed /TS where T S observed is the observed period between data items being produced and TS is the required value. For the rest of the nodes we need to consider all incoming and outgoing ports associated with each FIFO link. Intuitively, if throughput constraints are propagated from the outputs to the inputs, we need to maintain required throughput in the downstream VFIs while allowing only producers to be scaled (up or down), while the consumer port is assumed to be fixed. We call this state associated with the producer port dvfs en prod, and the one associated with the consumer fixed since it is not allowed to change speeds/voltages based on stall information related to that FIFO link.
In Figure 5 , the assignment of port states for VFIs 4, 5, 6 and S is shown (similar for the other nodes 1, 2, 3, and s) for an output rate constrained system. Similarly, for an input rate constrained system, each consumer in a FIFO link would be in a state of dvfs en cons (consumer is allowed to scale) and each producer would be in a fixed state (no scaling).
Functionality of Clock Control Logic
We are now ready to determine what is the correct scaling factor for each VFI, given the constraints on the output (or input) rate and given that multiple scaling factors may be determined from multiple incoming/outgoing FIFOs. We need to keep in mind that the FIFO link architecture depicted in Figure 4 might be replicated many times, for each producerconsumer channel. More precisely, the Clock Control Logic gets the prediction value from both stall monitors associated with the FIFO. As described previously (Eqn. 8 and Eqn. 9), in the case of the producer, the stall information from the consumer is used to increase the frequency of that domain if the current frequency is not able to meet the throughput requirements of the design (similar for the consumer).
For each VFI, there might be multiple producer and consumer ports as data may be coming from multiple sources or distributed to multiple sinks. In addition, for each VFI, there are as many stall monitors, associated with producer ports, as there are outgoing FIFOs, and as many stall monitors, associated with consumer ports, as there are incoming FIFOs. Figure 4 shows a single one-to-one FIFO link, hence there is only one stall monitor on each side of the FIFO. Since the Clock Control Logic module controls the frequency and voltage of a single VFI, there are as many Clock Control Logic blocks as VFIs in the system, but they will have to receive as many S f and Se signals as there are stall monitors for each FIFO link interface of that VFI. The decision as to what the prevailing scaling factor is for a given VFI when multiple incoming/outgoing FIFO links dictate different scaling factors is taken conservatively. To ensure that the throughput is not reduced, the highest frequency/voltage is considered. Each VFI can have multiple producer or consumer ports, but out of these, only a subset are configured in dvfs en prod (or dvfs en cons) state. Only these ports and the scaling factor associated with their stall monitors are considered in determining the prevailing scaling factor by taking the maximum resulting speed among these. For example, in the example depicted in Figure 5 , the new speed/voltage for node 5 depends on the resulting speeds/voltages determined by the FIFO links (5, S) and (5, 6) . Assuming that based on Eqn. 8 and Eqn. 9, fnew,5(5, S) and fnew,5(5, 6) are the new potential clock speeds, the final clock speed (and associated voltage) is taken such that fnew,5 = max(fnew,5(5, S), fnew,5 (5, 6) ). For all the other nodes (VFIs), there is only one port configured as dvfs en prod, and based on it and its associated new clock speed, the final speed/voltage is assigned. Based on these observations, the detailed algorithm for the speed/voltage selection of an output (input) rate constrained VFI system is described in Figure 6 . 
If system is sink constrained then state prod(i, j) = dvf s en prod; state cons(i, j) = f ixed; else //source constrained state prod(i, j) = f ixed; state cons(i, j) = dvf s en cons; 3. Repeat every T sample cycles If system is sink constrained then S S = T S observed /T S ; f S = f S /S S ; set corresponding V S ; else //source constrained Ss = T s observed /Ts; fs = fs/Ss; set corresponding Vs; For all FIFO links (i, j) S i,j = 1 − |S e i,j − S f i,j |/T sample ; If S e i,j < S f i,j then S i,j = 1/S i,j ; If system is sink constrained For all nodes i with successors j and state prod(i, j) = dvf s en prod f i = max j (f i /S i,j ); set corresponding V i Else // system is source constrained For all nodes j with predecessors i and state cons(i, j) = dvf s en cons f j = max i (f j * S i,j ); set corresponding V j 4. until (source is idle)
TOPOLOGY GENERATION TOOL
Embedded applications can be very effectively partitioned into tasks with various, but well defined functionalities. With clearly defined computational boundaries, they are very good candidates for being mapped onto a VFI system. Most of these applications can be represented as task graphs. Embedded Systems Synthesis Benchmarks Suite (E3S) based on benchmarks from The Embedded Microprocessor Benchmark Consortium contains a set of task graphs representing various applications including, but not limited to automotive, consumer, networking, etc. The task graphs available in E3S benchmark suite contain the information about the applications, constraints and various processors that can be used to map the various tasks.
We created a tool (Topology Generation Tool ), that can convert task graphs into behavioral Verilog. This program takes .tgff files [1] as inputs and converts all the tasks to behavioral Verilog models of producer/consumer while all the edges are converted to FIFO links. The tool uses the processor information from the task graphs to assign the delays of each of the producer/consumer. With the help of this tool, a designer can test many types of applications just by specifying high level description in the form of task graphs. The generated Verilog can be simulated using any Verilog simulator.
EXPERIMENTAL RESULTS
To test our proposed DVFS architecture of a FIFO link, we used Software Defined Radio and MPEG-2 Encoder as driver applications. These applications were represented as task graphs and implemented as behavioral Verilog models which were used to determine the benefits of the online voltage/frequency scaling for each module. T sample was set to 5000 clock cycles for each of these benchmarks.
Software Radio
Software defined radio application can basically be partitioned into five components -namely source, low pass filter (LPF), demodulator, equalizer (EQ) and sink ( Figure 7) . Each of these nodes can be represented as a producer consumer model. Samples are generated at a fixed rate by the source which therefore defines the throughput constraint. The samples pass through various blocks finally reaching the sink node. A base configuration of Hitachi SH3 cores running at the clock frequency of 60M Hz and supply voltage of 3.3V along with an off-line algorithm [12] (with six levels of voltage and frequency) was used for comparison purposes. The six voltagefrequency pairs (in V, M HZ) chosen were (3.3,60), (2.9,52), (2.5,45), (2.1,38), (1.7,31), and (1.3,23). The results were obtained for a required sample rate of 1kHz. As it can be seen from Figure 8 , some of the modules like Demod, Equalizer and Sink show significant savings in power, while the second instance of the pipelined LPF modules, which is the bottleneck in the system, shows no improvement at all. However, the overall improvement is still around 50% and compares well with the off-line method. When there are infinite levels of frequency and voltage levels available, the power saving are greater than those with finite levels (six frequency-voltage pairs) as expected (up to 55% power savings).
MPEG-2 Encoder
The MPEG-2 Encoder is broken down into six components namely the motion estimator (ME), motion predictor (Pred), DCT and quantization block, IDCT and inverse quantization block, the variable length encoding (VLC) block and the sink. For MPEG-2 Encoder, a base configuration with ARM cores running at a clock frequency of 133M Hz and supply voltage of 1.6V was chosen. The same off-line algorithm [12] was used for comparison purposes (with six voltage-frequency pairs). The six voltage-frequency pairs (in V, M HZ) chosen were (1.6,133), (1.4,117), (1.2,100), (1.0,83), (0.85,70), and (0.65,54). The results were obtained for frame processing rate of 3.5f /s with 99 macroblocks per frame. Figure 10 shows that all blocks, except DCT and IDCT, show a large improvement in power consumption. DCT being the bottleneck of the system, operates at highest available frequency and voltage. For IDCT, our proposed method performs better than the off-line method due to precise detection of workload behavior, providing additional 30-40% power savings locally and 8% additional power savings globally. The overall savings in power are close to 65% for all the three cases with infinite frequency-voltage levels showing more improvement over the finite case (six frequency-voltage pairs).
CONCLUSION
In this paper, we proposed a hardware based architecture that can be used as a basic building block to build VFI systems and support Dynamic Voltage and Frequency Scaling schemes. The logic to predict the optimal frequency of operation is also presented. A method to propagate the throughput constraint through the entire system is also discussed. We introduced a tool to automatically generate behavioral Verilog from task graphs that can enable and automate analysis of such VFI systems. Future work in this direction can include modification of the FIFO link architecture to address latency constraints, in addition to rate constraints.
REFERENCES
[1] Embedded systems synthesis benchmarks suite (e3s).
http://www.ece.northwestern.edu/∼dickrp/e3s/.
[2] Ibm blue logic cu-08 voltage islands.
http://www.ibm.com/chips/products/asics/products/ v island.html. 
