Abstract-This work presents the design and the silicon implementation of an on-line energy optimizer unit based on novel analog computation approaches, which is capable of dynamically adjusting power supply voltages and operating frequencies of multiple processing elements on-chip. The optimized voltage/frequency assignments are tailored to the instantaneous workload information on multiple tasks and fully adaptive to variations in process and temperature. The optimizer unit has a response time of less than 50 s, occupies a silicon area of 0.021 mm 2 task and dissipates 2 mW/task.
I. INTRODUCTION

D
UE to recent developments in the embedded systems technologies, multi-core System-on-Chip (SoC) and Network-on-Chip (NoC) architectures are becoming virtually ubiquitous. Such systems are widely used in many wireless applications such as mobile computing, where the available energy is fundamentally limited [1] .
In multi-processing element (PE) systems, due to the diversity of the applications that run within the system and their different degrees of parallelism, the workloads imposed on the system components are non-uniform over time. This introduces slack times during which the system can reduce its performance to save energy. The key to energy-efficient designs is the ability to tune PE performance to the non-uniform workload [2] - [4] .
In cases where performance requirements of a component vary significantly during the active operation regime, dynamic voltage scaling (DVS) is the preferred approach for reducing the overall energy dissipation. DVS exploits the fact that the peak operating frequency of a processing element is proportional to the supply voltage, while the amount of dynamic energy required for a given workload is proportional to the square of the PE's supply voltage [5] . In applications such as MPEG-2 video encoding/decoding where instantaneous workload variations in the order of 1:10 have been reported between consecutive frames (every 33 ms) [6] , [7] , DVS typically results in substantial energy savings. The varying workload of a given application (e.g., user's request or the computational task to be carried out) provides an opportunity to tune the performance (speed) of the system. Each task of the given application requires a finite number of clock cycles ( , a dimensionless variable representing the workload) to be executed, depending on the PE on which they are mapped. Non-stationary workload of an application is predicted based on stochastic models, which use distributions to describe the times of user request arrivals, and time it takes for state transitions for a device to service a user request. In these approaches, it is assumed that the system and the workload can be modeled by Markov chains [8] , [9] . To serve as a motivational example, Fig. 1 (a) shows a fourcore implementation of MPEG-2 video decoding where the algorithm is decomposed into a set of parallel tasks communicating over a general-purpose data network and mapped onto a SoC consisting of four processing elements [10] . In [6] , the authors derive Hidden Markov Models (HMM) for the workload characterization of specific tasks as shown in Fig. 1(b) . It can be seen that the estimated workload characteristics (light grey) closely follow the actual (dark) workload characteristics, and that the frame-to-frame variations in the workload may be as high as a factor of 10. This observation also reinforces the case for fine-grained DVS to be performed on all PEs in order to minimize energy dissipation, based on instantaneous workload.
DVS is based on reducing the performance level of the component during periods of low utilization so that the task is always completed just-in-time, consuming minimum energy [11] , [12] . While the local energy dissipation of each PE can be minimized using DVS techniques based on workload predictions, it can be shown that these local minima usually do not represent the global energy minimum, which can only be reached by considering the relative timing dependencies of all tasks running in the system. This problem of minimizing the overall dissipated energy in a multi-PE system under timing constraints and subject to DVS, has already been formulated in a rigorous fashion, yet a compact .real-time implementation has not been offered [13] - [15] Our approach demonstrates the solution to the problem of online optimization of the dissipated energy in multi-PE systems with interrelated tasks under timing constraints using the basic principles of analog computation by converging on the global minima of the constrained optimization problem which are represented as stable operating points of a simple resistive network (RN). The input set of the circuit consists of individual workload estimates for each task and for each PE, while the output consists of assigned supply voltage/frequency values for each PE as well as the allocated time duration for each task as illustrated in Fig. 2 .
The remainder of this paper is organized as follows. In Sections II and III, we concentrate on demonstrating an online solution to complex multi-variable energy optimization problem. Implementation of main building blocks is given in Section IV. Configurability of pseudo-resistor array is discussed in Section V. The closed-loop operation principle of the proposed analog optimizer block is described in Section VI. In Section VII, experimental results is discussed in comparison with the simulation results. Conclusions are provided in Section VIII.
II. FROM TASK GRAPH TO RESISTIVE NETWORK
The authors have previously demonstrated the analogy between the problem of minimizing energy consumption on a complex system under timing constraints, and the problem of minimizing power dissipation in a resistive network under Kirchoff's Current Law (KCL) constraints [16] . According to Maxwell's Heat theorem the RN will consume the lowest possible power , at steady-state for a given driving current [17] . The equivalence between the two analogous minimization problems is summarized in (1) and (2), given in Fig. 3 . Here, individual tasks are modeled with device conductances , controlled by the ratio where is the average power consumption of the PE during task , and is the task duration. For a detailed explanation of this analogy, refer to Appendix A.
Consider the example illustrated in Fig. 3(a) , where five tasks are mapped and scheduled on two PEs. The total dissipated energy in the system can be written as the summation of all the task energies (1). Since each task requires a given number of cycles, and each cycle consumes an amount of energy, this amount can be reduced if the supply voltage of the PE is reduced under the cost of cycle time increase. The formal algorithmic solution of (1) is certainly possible in real time, yet the computational overhead that is needed may become prohibitive especially when taking into account realistic timing/delay models and secondary effects such as leakage dissipation. Fig. 3(b) shows the equivalent RN of the given task graph (TG), in which duration of each task corresponds to the current in a resistor . The TG period T corresponds to the total current driving the circuit. Due to KCL, will be split into parallel branch currents that are inversely proportional to branch resistances. Hence, it can be seen that the simple RN actually realizes the solution to the dissipated power minimization problem for a given driving current under KCL constraints (2) . It is important to emphasize that the mapping of a given TG to its equivalent RN is based on converting the time domain relation between tasks into equivalent RN currents. Hence, we do not consider this procedure to be equivalent to finding the dual of a given TG.
The overall block diagram of the intended system is given in Fig. 4 , where the only input (task workloads-) to the proposed optimizer block is highlighted in grey and the outputs of the optimizer block to the SoC/NoC architecture (the supply voltageand the operating frequency-) are given in black color for easy identification. The simplified representation of the proposed optimizer unit is also shown in Fig. 4 , where the CFL blocks represent the controlling feedback loop of an individual device conductance corresponding to a single task. The basis of the proposed idea of continuously (in time domain) adjusting the control knobs of the overall system in order to minimize the global energy consumption, subject to timing constraints, the design of the proposed central (global) optimizer unit based on simple analog circuit topologies and design aspects will be explained in detail in the following sections. As shown in Fig. 4 , the central optimizer unit receives a number of input parameters that represent the estimated workload profiles of each task in the system and implements in real time an energy management policy, determining the voltage/frequency values to be assigned to various modules based on their workload and/or other limitations.
III. IMPLEMENTATION OF THE ANALOG OPTIMIZER
The total energy required by the system to execute the whole set of tasks within a fixed duration T, is emulated by the power dissipated in the equivalent RN, driven by a current . Each resistor in the equivalent RN is implemented as a pseudo-resistor , so that its value can be adjusted proportionally to the ratio given by by means of a feedback loop that includes a calculation of .
The proposed feedback loop is capable of accurately calculating (estimating) the supply voltage and the corresponding operating frequency for each individual task, guaranteeing that the job (task) will be finished in time. Since, , the task duration is intrinsically embedded in the resistive network (device currents) and the workload (the number of clock cycles necessary to complete the task) information for each task present in the system is known, the corresponding required operating frequency of the related processing element during that task is determined individually by means of (3). Here, is the workload information of the task, and is the cycle time (clock period).
Fig . 5 shows the simplified block diagram implementation of the feedback loop for one branch conductance, where a current-based approach is used to represent key loop variables. A key element of the loop is the dynamic Ghost Circuit (GC) that emulates the maximum operation frequency of the processing element operated at the same supply voltage . This frequency-to-voltage mapping could be done with a look-up table based on analytical approximations (e.g., Alpha-power law MOSFET model) subject to certain modeling errors. There are some distinct advantages of using a representative (yet small) circuit block that mimics the operation characteristics of the actual processing element-hence, the name "ghost circuit" (also called "canary circuit" in the literature [18] ) is used.
This GC is essentially a ring oscillator replicating the critical path of the PE that is used in each loop to continuously determine the minimum supply voltage and the supply current that correspond to a target operation frequency for the PE. As such, the ghost circuit is capable of generating minimum possible supply voltage level necessary for a given frequency, while adapting itself with respect to changing operating conditions such as temperature, process parameters, device aging effects, and voltage variations. It is forced to run at frequency which is imposed by its supply current . In this solution, as shown in Fig. 6 , the ring oscillator is driven by its supply current rather than the supply voltage since the instantaneous operation of the oscillator is imposed by the calculated required frequency of operation . Recall that the ring oscillator will dissipate power at required operating frequency of (modeled with the current ) corresponding to a specific supply current provided that the leakage effects are neglected. Hence, the supply current of the ring oscillator is calculated as given in (4). (4) Note that the supply voltage of the ghost circuit will be intrinsically adjusted to achieve the required frequency of operation, i.e., supply voltage of the oscillator is a variable quantity. In order to compensate the variations of the supply voltage level, the current representation of the ghost circuit supply voltage is introduced in to the equation. Here, is the current modeling the required frequency of operation, is the current representation of the ghost circuit supply voltage, B and are the proportionality constants. Eq. (4) demonstrates that is linear with respect to provided that the total switching capacitance (C) is constant. Consequently, is linear with respect to operation frequency based on (4). According to Fig. 6 the ring oscillator is driven from a current source (high impedance). Normally, the supply current of a ring oscillator can strongly vary during one period of oscillation. Therefore, a blocking capacitor of proper size (20 pF) is used for AC decoupling in the implementation which provides low impedance at the frequency of operation (refer to Fig. 6 ).
The predicted workload information is injected into each loop in the form of a 4-bit external control variable. Any change in influences the current corresponding to the target operation frequency in the feedback loop. Hence, the simple GC determines the supply voltage level to be applied to the PE for achieving the target frequency as well as the resulting dynamic current consumption . The voltage and the frequency are transmitted to the PE. They are also converted to current representations and in order to calculate the pseudo-resistor controlling currents . An alternative look-up table approach would have required the characterization of the core for throughput at a given clock speed and at a given voltage, where ample margins must be allowed for temperature, power supply and fabrication variations which requires an extensive effort to build a hard-coded speed versus supply table. The implementation of the ghost circuit in the control path effectively eliminates the need for such modeling that is inherently prone to inaccuracies. The amount of current that will be drawn from the supply rail of the oscillator will follow the variations in the frequency, and hence, of the supply voltage level. Also, the short circuit current characteristics of logic gates constructing the processing element is imitated accurately at first approximation, based on the expectation that the processing element is designed appropriately; i.e., the transitions and the waveforms during transitions at in/output of the internal CMOS gates are homogeneous. Since, the transitions in the ghost circuit are homogeneous; short circuit power consumption of the processing element is modeled by the implemented ghost circuit.
IV. IMPLEMENTATION OF THE MAIN BUILDING BLOCKS
Current-mode processing in each feedback loop is carried out by single quadrant current multiplier/dividers labeled as . Each current operator is implemented by the simple alternating topology translinear loop (TLL) of four transistors operated in weak inversion with their bulks connected to the common substrate as shown in Fig. 7 . Here, a clockwise element (CW) is the one whose gate-to-source voltage is a voltage drop in the clockwise direction of the loop. So we shall consider a counterclockwise element (CCW) as the one whose gate-to-source voltage is a voltage increase in the clockwise direction of the loop. Recall that the channel current, , of a saturated MOS transistor, operating in weak inversion regime, is given by (7), where , , and are the gate-to, source-to and drain-to bulk potentials respectively.
is given by (5), where is the specific current [given in (6) ], is the slope factor [it can be interpreted as the (incremental) capacitive divider ratio between the gate and the channel potential], is the threshold voltage for a channel at equilibrium, and is the thermal voltage, [19] - [21] .
(5)
The given loop has an alternating topology; that is, we alternate between CW and CCW elements, as we go around the loop. Applying the Kirchhoff's voltage law and using voltage-translinear principle around the loop illustrated in Fig. 7 results in (8) . By introducing (7) into (8) for all loop transistors (divided by , that is common to all transistors), and exponentiating both sides of the equation yields in (9) if all transistors are identical. The precision of the translinear loop is degraded by mismatch which is equivalent to the relative errors of proportional to .
This is a single quadrant current multiplier/divider, thus all currents should be positive. Hence, we can take the inverse of a current with respect to a unit current, provided that all currents entering/leaving the TLL are positive. It should be noted that current mirrors of necessary type (PMOS and/or NMOS) are used whenever needed to drive the TLLs in the feedback loops. The current mirrors are excluded from Fig. 7 for the sake of simplicity. Such multiplier/divider schemes are utilized in the feedback loop to implement the analog optimizer for converting duration (time) to frequency (where all variables are represented as currents) and to implement the pseudo-resistor controlling current definition by means of ratio of current multiplications.
The simulation results of the implemented single quadrant current multiplier/divider are given in Fig. 8 , for different current levels, , , and being the three current inputs, where the block functionality is given in (9) as . The TLL calculates the inverse of the current appearing in the denominator of the function. Here, current is constant and equal to 1 A for all curves in the figure. In Fig. 8 , the ideal linear output current is also provided for A level, for comparison. The linearity of the translinear loop degrades from ideal as the current levels increase.
Each pseudo-resistor in the resistive network is realized as a single MOS transistor operating in weak inversion where the equivalent conductance value of each transistor is controlled independently by a current by means of a control transistor (Fig. 9) -thus, utilizing only a few transistors.
For a given value of , a transistor is in weak inversion if both and are large enough to obtain at both ends of its channel. This condition is equivalent to ensuring a value of its saturation current that is much smaller than its specific current, . Recall that the channel current of a MOS transistor operated in weak inversion can be written as given in (10): where (10) By defining a pseudo-voltage (independent of ) given by (11):
for p-channel for n-channel (11) where is an arbitrary positive scaling voltage [21] , [22] , and the channel potential, the channel current of the MOS device can be rewritten as: (12) which corresponds to a linear pseudo-Ohm's law. The pseudovoltage is always negative for an n-channel transistor (positive for a p-channel), and tends to 0 for large values of V. Thus, the pseudo-ground 0 (0-reference for pseudo-voltage ) is obtained by imposing large enough to make sure that the transistor is saturated. Pseudo-conductance which is controllable by , and thus, pseudo-resistance can be defined as in (13) in weak inversion operation. (13) In such systems, the linear pseudo-Ohm's law is valid, i.e., a network of transistors remains linear with respect to currents, and the pseudo-resistance, of each transistor is controllable independently by the value of its gate voltage. Conductance of each pseudo-resistor may alternatively be controlled by a current by means of a control transistor, as shown in Fig. 9 . In this solution, all transistors must share the same substrate and the reference voltage is common to all the control transistors of the network and selected to ensure weak inversion in all possible situations. The mapping of resistive-networks into its transistor-based equivalents on the basis of pseudo-Ohm's law can be generalized: any arbitrary network of linear resistors can be implemented by replacing each resistor by a transistor , with all the transistors in the same substrate [22] - [24] .
A resistor connected to ground potential in the resistive-network corresponds to a saturated MOS transistor (operated in weak inversion) that provides a pseudo-ground (0 ) in the equivalent pseudo-resistive network (refer to Fig. 10 ). Any current flowing to the pseudo-ground can be easily extracted without influencing the branch current ratio, by means of a grounded current mirror made of transistors complementary to those of the network as shown in Fig. 10 [16] - [18] . Hence, grounded current-mirrors are used to sense each branch current separately, to be further used in the feedback loop, which is presented in Fig. 5 .
This solution could become problematic when we consider the mapping of more complex task graphs into their equivalent resistive networks. An example resistive network mapping is given in Fig. 11(a) corresponding to an arbitrary task graph. In order to measure the branch currents and , we should duplicate the parallel branch consisting of and and bias this copy with , which is the current flowing through the main parallel branch as in Fig. 11(b) . Since now, there are two currents flowing towards ground we can easily measure the branches currents individually, by means of grounded currentmirrors. The result of current mode processing in each loop is the current (14) that drives (controls) the corresponding pseudo-resistor as illustrated in Fig. 5 . The factor introduced by is proportional to the equivalent switching capacitor, that may be different for different processing elements. Here, represents the modeled static current consumption of the PE (proportional to the total number of gates), with a static GC which is added to the loop. This current is added to the dynamic current consumption , resulting in (14) . (14) Consequently, the corresponding device conductance value changes according to (14) . This change in the value of device conductance forces all the branch currents in the RN to be adjusted by means of KCL. As the system settles to its new operating point, the new device currents in the pseudo-RN are determined by KCL, dictating the optimum task duration with the prescribed supply voltage and operating frequency for each PE and for each task to minimize system-wide energy dissipation. Detailed simulation results of the implemented pseudo-resistive network can be found in [25] .
Minimum current limiter (Maximum current selector) blocks are used to restrict the operation range to a defined value. The upper limit of the operation is intrinsically limited by the technology due to the fact that the maximum allowed core operating voltage is fixed to a constant level. In order to guarantee that the lower limit of operation, which is 1.2 V or 150 MHz is not violated under any circumstances two minimum current limiter blocks are used in all the feedback loops. For this purpose a combination of nMOS transistors is used to carry out addition/subtraction of the replicas of the input currents as given in Fig. 12(a) . The figure also shows the simulation results of the block. Note that the output current follows the higher of the two input currents.
V. CONFIGURABLE PSEUDO-RESISTOR ARRAY
Typically, a large number of diverse applications can be mapped and run on high-performance distributed embedded systems (SoC/NoC). Hence, the proposed circuit architecture should be built with a modular approach to support different software applications as opposed to a hardwired circuit solution. This capability of a modular architecture can be exploited by implementing an array of pseudo-resistors with corresponding control feedback loops (CFL) and various number of necessary type of current mirrors to pick up the branch currents at pseudo-ground nodes of the RN as shown in Fig. 13 . In addition to these modular building blocks several constant current sources can be implemented to model various task graph periods. It should be noted that the switching network is not shown in the figure for the sake of simplicity. Consequently, the implemented array of pseudo-resistors can be easily expanded to support any arbitrary TG that can be mapped on the given system of PEs. Recall that device currents model the corresponding task durations. Hence, each device current should be picked up and fed back to the related CFL. Besides, extracting currents flowing to the pseudo-ground is preferred in order not to influence the branch currents. Hence, it is favored to configure the RN in such a way that in any constructed architecture the maximum number of parallel branch currents flow to the pseudo-ground nodes of the RN.
A simple example of such a modular configuration, based on built in pseudo-resistor array, is shown in Fig. 14(b) for the TG of Fig. 14(a) . Here, the connections necessary for the given configuration are indicated as dashed lines and for the sake of easy identification only the necessary sub-blocks used for the given configuration are shown. In the given TG, on the first PE is executed in parallel to sequential tasks , , and on the second PE. Therefore, as a consequence of parallelism, the available time will be split among , , and tailored to their instantaneous workload, where can be executed during the task graph period . Consequently, the pseudo-resistors modeling the tasks mapped on the second PE are connected in parallel and the resulting RN is connected in series to the pseudo-resistor modeling the first task as shown in Fig. 14(b) . Note that the parallel section of the equivalent RN is connected to the pseudo-ground node.
VI. CLOSED-LOOP OPERATION OF THE OPTIMIZER
After showing how the feedback loop is implemented and how the proposed optimizer block can be made configurable, it is now necessary to show how the network of controlled resistors operates for a given set of workload requirements. It is important to highlight that the feedback loop responsible for updating each value operates in continuous time (based on GC response), rather than in a discrete-time iteration. The stability behavior of the feedback loops taking into account the coupling between loops through the RN has been thoroughly analyzed (refer to Appendix B). It was shown that the dynamic behavior of each resistive element control loop is governed by a single-dominant-pole transfer function. Therefore, as also shown analytically, the entire system always converges to a stable and unique operating point for a given set of workloads. Also, note that the GC can effectively capture the actual frequency-voltage-power relationship of the PEs, reflecting the actual operating conditions on-chip (inherently taking into account the local variations of temperature, as well as process-related fluctuations of device parameters) eliminating any analytical approximation of the physical behavior that is inherently prone to inaccuracies. Fig. 15 shows the simulated versus measured operation of a three-loop optimizer network which is used to model the behavior of a TG comprising three sequential tasks. Here, the supply voltages resulting in the optimum system energy dissipation are shown for various workload combinations indicated as for each simulation interval. Although the numerical values slightly differ between simulation and measurement results, it is shown that the desired voltage range of 1.2 V-1.8 V is fully utilized with some voltage offset. It should be noted that measurement results are also given in time domain just to be able to superimpose simulation results and measured values in a single graph, although tests are not continuous in time. Similarly, Fig. 16 shows the corresponding simulated and measured task durations (branch currents) for the same set of workload conditions. Please note that the summation of the branch currents-monitored off-chip-is slightly higher than the resistive-network biasing current , due to the current mirroring error caused by the on-board resistor-loads connected to the drains of the NMOS current mirror devices. Still, the current mirroring error encountered is less than 0.5 A for all settings.
The available time is shared among the three tasks for all workload conditions, guaranteeing timing constraints and optimizing the dissipated energy in the system by means of optimally utilizing the available time. The comparison of measured and simulated branch currents as well as the GC supply voltages shows a good agreement between simulated and measured values.
As can be seen there are minor differences between measured and simulated ghost circuit supply voltages and device currents. This behavior is mainly due to the physical placement of the third loop circuit with respect to the biasing circuitry causing mismatch between bias currents of the three loops. Furthermore, current mismatch in current mirrors extensively used in all three loops and the mismatch in the absolute value of the resistors used in the loop could be the possible secondary cause of the differences between measured and simulated values.
The mismatch in loops of the analog optimizer is equivalent to the relative error in predicted (estimated) workload levels. Consequently, the accuracy of the system can be modeled with the precision of the estimated workload conditions. Furthermore, since the workload of a given task is represented with 4-bit coded value, an error of approximately 6% in the predicted workload of each task is inevitable due to the quantization. Hence, trying to further increase the accuracy of the system above this value will be unnecessary. Still, this can be calibrated even with pre-correction per loop after fabrication. The comparison of the simulated supply voltages (V), operation frequencies (MHz) and task durations (branch currents-A) of the same system has been made for the proposed global optimization approach versus local energy optimization applied to each task. When using the proposed global optimization approach, any change in workload condition of any of the tasks influences all task durations (hence, supply voltages and operation frequency) corresponding to a minimization of the total system energy dissipation by optimally using the overall available time . In Table I the comparison of the simulated supply voltages (V), operation frequencies (MHz) and task durations (branch currents-A) of the same system are given for the proposed global optimization approach versus local energy optimization applied to each task. Note that only the workload of first task increases throughout the table. Hence, in the local optimization scheme the core supply voltage levels and operation frequency remain constant during the second and the third tasks resulting in a higher power dissipation and energy consumption in the overall system. In contrast, when using the proposed global optimization approach, any change in workload condition of any of the tasks influences all task durations (hence, supply voltages and operation frequency) corresponding to a minimization of the total system energy dissipation by optimally using the overall available time . The additional energy savings varies between 11% and 20% for different cases.
VII. SILICON IMPLEMENTATION AND MEASUREMENT RESULTS
The three-loop demonstrator circuit of the proposed analog optimizer architecture has been implemented using a 0.18 m standard digital CMOS process (Fig. 17) . The overall circuit area of the optimizer is 250 m 700 m excluding decoupling capacitors, while each loop circuit occupies only 180 m 120 m. The circuit is capable of supporting the desired frequency range of 170-290 MHz, as well as the voltage range of 1.2-1.8 V. The average power consumption of the entire three-loop optimizer is 6.5 mW.
The test chip has been integrated to validate the on-line analog optimization concept and to verify the functionality and performance of the circuit techniques used. The implemented circuit also employs specific circuit blocks and dedicated control signals to ensure maximum configuration flexibility and improve testability of the circuit. The test configuration circuitry has crucial functionality on overall system instead of its simple structure. The combinatorial circuit is responsible for the processing global power down, and several test configuration signals generated from the control signals with dedicated IO pads. The test configuration circuit, generates global power-down signal, power-down signals for each feedback loop individually and power-down for stand-alone frequency-to-current converter loop. The internal signals in critical nodes can be accessed and observed off-chip, and feedback loops can be configured to operate in open-loop configuration so that the loops can be closed through the external signals fed to the blocks. This allows testing the circuit thoroughly in a variety of configurations and conditions. Physical and geometrical symmetry is one of the most important considerations in back-end design of sensitive analog circuits. Thus, common-centroid placement is preferred for translinear loops, current sources and differential pair layouts. 45 bending on the signal paths have been implemented instead of 90 turns, since the parasitic resistance of the metal line with an angle of 90 is much larger that that of metal line with 45 bending. In the layout of analog optimizer, separate shielding has been performed to each sub-block. Noise generated by minority and majority carriers have been treated separately, and a shield has been drawn for each, n-well and p-substrate depending on the case. Although substrate and n-well biasing contacts are connected to and nodes, they have not been shorted to shielding pick-up vias, which are also on the same metal and close to each other. Instead, they have been connected to power routings via different sets of vias, and are connected to each other only at the highest metal layer. In this way, when a noise component is picked up by a shield, it does not couple directly to the substrate of the circuit, instead, it first couples to the highest metal layer of the related power routing, and is carried out of the chip via a few sheet resistances of that metal.
For the three-parallel loop configuration, 125 different workload combinations are tested. A set of measured ghost circuit supply voltages and pseudo-resistive network branch currents (task durations) are provided as well as the clock frequency to be applied to the processing element in Fig. 18 . Each data set (three columns, represented with different filling patterns, showing three different measured data, i.e., supply voltages, clock frequencies and durations, respectively for the three tasks in Fig. 18 ) indicates the response of the circuit to a different workload condition for all three loops. The corresponding supply voltage and the branch current values indicate that the proposed analog optimizer is capable of responding to varying operating conditions with a wide dynamic range. The analog optimizer block dictates the optimum operating voltage and duration of all three tasks for minimum system energy consumption. It should be noted that measured supply voltages range from 1.2 V to 1.8 V, showing a significant (approximately 35%) variation.
A comparison of measured and simulated branch currents as well as loop supply voltages is provided in Fig. 19 , for 10 different workload conditions for the third network branch (modeling the third task of the TG). Each data point indicates the measured response of the circuit versus to the simulated one at a different workload combination, showing a good agreement between simulated and measured values. The branch current varies between 3 A and 8 A for different workloads during operation, indicating that the duration of each task can be adjusted by more than a factor of 2. Fig. 20 shows the variation of the overall energy dissipation of the same system composed of three sequential tasks as a function of changing workload conditions, calculated from measured voltage/frequency and task duration values. To test the optimality of this solution, the branch current values were perturbed from their actual values (while keeping the sum constant) and the energy surface has been recalculated. The resulting energy surface is clearly higher than the original solution for all workload combinations and for all branch current perturbations, demonstrating that the original solution indeed is the minimum energy surface.
Settling time of the ghost circuit supply voltages is important, due to the fact that circuit should be fast enough to track any changes in the workload conditions for real time optimization. Measured worst case settling time of the ghost circuit supply voltage for step-up response is less than 50 s. Similarly, measured settling time for step-down response is 60 s.
VIII. CONCLUSION
In this work, the energy optimization problem in SoC/NoC applications is discussed with a unique analog implementation approach. The analogy that exists between the energy minimization problem under timing constraints in a general TG and the power minimization problem under Kirchhoff's current law constraints in an equivalent RN is exploited. The principles of mapping an arbitrary task graph to an equivalent resistive network are presented. A fully analog, current-based solution to implement on-line energy minimization in complex multi-core systems under varying workload conditions is demonstrated, which achieves significant overall energy savings compared to the local energy minimization approach. The optimized voltage/frequency assignments are tailored to the instantaneous workload information and fully adaptive to variations in process and temperature. The proposed architecture is oriented towards supporting the challenges of energy management of multi-processing element architectures in SoC/NoC applications. The optimizer unit has a fast response time of 50 s, occupies a silicon area of 0.021 mm task and dissipates 2 mW/task.
The proposed analog optimizer solves the shortcomings concerning lack of ability to adapt the optimization results to changing environmental conditions (e.g., temperature, process variations) and on-line optimization (fast response time), as well as low power consumption, which until today have limited the availability of true on-chip (integrated) solutions for multiple processing elements. As such, the proposed optimizer can be used as a generic building block for on-line energy optimization in complex systems. APPENDIX A As shown in Fig. 3 , and must be executed in parallel on two different PEs, i.e., as a consequence of parallelism these two tasks must have the same duration . Similarly, in a resistive network branch consisting of two series connected resistors, each resistor must carry the same amount of branch current.
Based on this analogy, all parallel tasks can be converted into series-connected branches in the equivalent resistive network.
However, tasks mapped in series on a single PE, e.g., tasks and , can only be run sequentially in time. Consequently, the amount of available time will be split among the sequential tasks according to their actual workload. Hence, the amount of time necessary for execution of task will be split among and according to the actual workload of these two tasks. Similarly, in a RN branch of parallel connected resistors the main branch current will be shared proportionally between the parallel branches according to KCL. Hence, all sequential tasks can be represented by parallel-connected branches in the equivalent resistive network.
From the above explanations, the generalized steps involved for mapping the given task graph into a parallel-resistive network equivalent are as follows:
• Identify and assign the processing elements and the tasks mapped and scheduled on to the given system. • Insert IN and OUT nodes into the given TG.
• Simplify the task graph by replacing the edges representing processing element sharing by equivalent edges capturing the data/control dependencies between PEs.
• Convert all parallel tasks in the simplified TG to seriesconnected resistor branches in the resistive network.
• Convert all series tasks in the simplified TG to parallelconnected resistor branches in the resistive network.
• Replace the IN node by a DC current source modeling the TG period and the OUT node by the ground connection providing the necessary current path in the resistive network. However, not every task graph is in series/parallel configuration. The TG given in Fig. 21(a) is an example of such non series/parallel configuration. Still, an equivalent RN can be mapped from the given TG without violating the corresponding timing constraints as shown in Fig. 21(b) , where the cut-sets are highlighted by dashed lines. Recall that can only start after processing of on the first processing element and and on the second processing element are finished. Similarly, in order to finish the assigned job just in time, execution time of tasks and should be equal to the summation of task durations and , and and respectively.
Here, timing constraints, e.g., , and , are intrinsically satisfied due to KCL constraints in the RN, i.e., , , and respectively. Hence, the equivalent RN of controlled resistors can be mapped for any arbitrary TG, where each device current represents the available time for the corresponding task, the overall available time to complete the job within the defined deadline and the device conductances the corresponding task.
Although the applied mapping scheme has a certain resemblance to creating the dual of a given task graph, it is important to emphasize that the mapping of a given task graph to its equivalent resistive network is based on converting the time domain relation between tasks into equivalent resistive network currents.
APPENDIX B
As already mentioned, the concept of system stability needs to be considered when several components adopt dynamic poli- cies to control energy consumption and performance. Possible oscillations in energy/performance space that could be caused by applied power management policies are undesirable, and should be avoided. In this section it will be shown that the dynamic behavior of each device control loop is governed by a single-dominant-pole transfer function, and that the entire system (the centralized optimizer unit) always converges to a stable operating point for a given set of workload values. In order to derive the transfer function of the feedback control loop, the loop is opened on the resistive network. Hence, the branch current is treated as the input current (variable) and the pseudo-resistor controlling current is treated as the output current. Note that , , , , are constant biasing currents used in the feedback loop.
From the loop dynamics, the output current can be written as in (B1). Note that one can show the small variations in the value of a variable as , where lower case represents the variations in the value of the variable. Using this definition, the output current can be written as given in (B2).
(B1) (B2) Note that we can express the ratios of the current representations of the ghost circuit supply voltage and the operating frequency as well as the current consumption of the ghost circuit and their variations in terms of the ratio of the input current and its variation as given in (B3) [25] .
and and
Finally, the device conductance controlling current being the output current and the related device current being the input current the small signal behavior of the feedback loop can be written as given in (B4), since branch conductances are linearly proportional to their controlling current . Hence, it is shown that the dynamic behavior of each branch control loop (feedback loop) is governed by a single-dominant-pole transfer function.
(B4) (B5) (B6) Now, consider a resistive network consisting of three parallel branches to illustrate the stability properties of the system. If we write the first branch current in the resistive network comprising three parallel branches in terms of the resistive network biasing current and the other branch currents, we get (B5), where and represents the device conductance and the variations in the conductance value respectively. Now that if we replace each quantity in (B5) by , and substituting (B4) where ever suitable, we can finally write (B6).
Consequently (B6) will look like (B7) after doing the necessary mathematical operations in order to determine the characteristic equation of the system.
(B7)
Now, if we rewrite the characteristic equation of the system as given in (B8), we can check the stability of the system by applying the Routh criterion.
(B8) The principal stability criterion for linear systems states that a system is stable if all poles of its transfer function lie in the left-half of the complex s-plane. Equivalently, a system is stable if the real parts of all roots of its characteristic equation are negative. Note that a root of the characteristic equation is synonymous with a system pole. To apply Routh's criterion, the Routh's Table should be created as given in Table II . The Routh criterion is applied by examining the sign of the coefficient in the column headed by . The number of sign changes in the elements of this column, taken in order, is equal to the number of roots of the characteristic equation that have positive real parts. Hence, in order to show that the system is stable we should verify that the sign of the expression is positive, since all the other components of the first column of the routh table are positive quantities. Note that all and quantities are positive real values. Thus, , and are intrinsically positive for all or values. Note that is also always positive for all or values, since definition of guaranties that the multiplicative factor in is always positive. After doing all the necessary multiplications, it is proved that the sign of the mathematical operation is always positive, guaranteeing that the proposed system is stable.
