Abstract-Chip-multi-processor (CMP) utilize multiple energy-efficient processing elements (PEs) to deliver high performance while reducing energy-consumption. Dynamic frequency-Voltage Scaling (DVS) balances performance and energy consumption by varying PEs' frequency-voltage workpoints to save energy while meeting performance requirements. We consider multi-task CMP applications with unknown workloads, and dynamically set workpoints to minimize ET . Heuristic policies for serial/parallel task-graphs are investigated. We compare these policies to a theoretical bound and show that they achieve good results with low complexity. In most cases the simplest policy, which usually assigns constant workpoints, is also the most cost-effective one.
. Transient performance on dynamic output change.
TABLE I PERFORMANCE COMPARISON
Seven series regulators were not included.
This includes the controller and power transistors, but not on-chip output capacitor.
Finally, Table I compares this work with the prior arts through major design matrix. With the largest power range, this design exhibits very competitive performances.
IV. CONCLUSION
In this paper, a new integrated one-cycle controlled switching converter is presented. To allow using standard low-cost CMOS process and to achieve fast transient response, a non-inverting SC integrator is employed. Tight load regulation was achieved with the use of an additional outer loop. Experimental results show very competitive performances in power converter design matrix. Furthermore, the design techniques can be applied to other DC-DC or AC-DC converters.
I. INTRODUCTION
Chip-multi-processor (CMP) architectures integrate multiple relatively small energy-efficient processing elements (PEs), to enable energy-efficient performance scaling [1] - [3] . Dynamic frequency-Voltage Scaling (DVS) is a widely practiced [4] and researched [5] technique for energy-performance tradeoff in which a PE's frequency is altered dynamically to meet performance requirements while consuming limited energy. The supply voltage of each PE is kept at the lowest feasible value for the selected frequency. Scaling the frequency-voltage workpoint (f; V ) can result in near-quadratic energy savings [6] .
In a CMP running multiple dependent tasks, DVS may save energy without degrading performance. At any given time, one or few tasks typically constitute(s) the performance bottleneck while other PEs can be slowed down without affecting total performance. We refer to slowing down non-critical tasks in CMP as slack-utilization [7] , [8] .
When task workloads are known in advance, the DVS policy sets workpoints to utilize precisely all available slack. But when task workloads are unknown in advance, slack-utilization is nontrivial. Assuming worst-case workloads achieves limited energy savings since overly aggressive workpoints are needed. Conversely, overestimating slack can lead to performance degradation. Workload estimation is a primary factor of the efficiency of DVS policies. Measuring DVS policy efficiency depends on the energy/power/performance requirements of target applications. Mobile battery-operated systems strive to extend battery life, while desktop systems and servers typically optimize power rather than energy. While a maximum performance system would always run at the maximum frequency (the f -max policy), a minimal energy system should execute at the minimum frequency (the f -min policy). A system which strives to balance energy and performance could maneuver (f; V ) between justifiable energy consumption at the high end and tolerable performance at the low end. This choice is reflected in the criterion selected to assess alternative DVS policies.
If the required calculations for policy implementation are to be integrated into the system itself and performed in real-time, then it is essential that energy-consumption and delay of the policy calculation itself be minimal. Spending a substantial amount of execution time and energy merely to calculate workpoints may considerably offset the savings aimed at, making it impractical for a performance-and energy-aware system. Thus, low computational complexity is a key concern.
Previous DVS policies [9] - [12] for CMP attempt to formulate optimization problems which have high computational complexity. In contrast, we introduce lightweight heuristic policies, and show that they achieve good results compared to theoretical bounds. We further show that in most cases the simplest policy is the most cost-effective one.
This paper is organized as follows. Section II defines the minimization criterion and formulates the DVS minimization problem. The DVS policies are presented in Section III, and analyzed in Section IV. Conclusions are drawn in Section V.
II. DEFINITIONS AND PROBLEM FORMULATION
Consider a CMP running a multi-task application. Assume for simplicity that at any time, each PE accommodates one software task.
We assume each PE's (f;V ) workpoint is controlled independently of other PEs, although this may not yet be practical in current CMPs. We consider the case where all PEs are identical, although the method of this study may be generalized to heterogeneous-PE systems [3] , [13] , [14] with few modifications [15] .
Each PE can operate in the frequency range f [f min ; f max ] cycles/s, or be in a standby mode. The PE operates at the minimum feasible supply voltage for its frequency, defining a frequency-voltage (f;V ) curve [15] . A continuous frequency model is employed. Furthermore, the rate of changing frequency is limited in practice due to the energy and time penalties of making the transition, and the added complexity of calculating a new workpoint [9] . We neglect workpoint transitions here, and later mention the effect of considering them on the final conclusions.
For a task of unknown workload, the cumulative density function cdfW (w) of the workload W is the probability that the task will be completed within w or fewer cycles: cdf W (w) = P r(W w).
Hence, the probability that the task will take w cycles or more to execute is 1 0 cdf W (w 0 1). Some example distributions are displayed in Fig. 1 , which shows probability density functions pdf W (w) = cdf 0 W (w). These distributions are used in our simulations, as described in Section IV. In practical cases, the pdf is estimated based on previous task instances [15] , [16] .
The total power consumption of a PE running at frequency f is P (f) J/s, where power consists of active and standby power: P (f) = P act (f) + P (0). The active energy-per-cycle of the PE is e act (f) = Pact(f )=f. Suppose that the task starts at time t = 0 and completes by time t = . If it completes before that time, the PE goes to standby until t = . Given e act (f) and cdf W (w) or their estimations, we formulate the expected energy required to execute a task of workload W on a PE p
where f w is the frequency at cycle w. Equation (1) sums the active energy-per-cycle times the probability that the task will still be running at that cycle, adding the total standby energy. Following similar arguments, the expected execution time is
The application is modeled as alternating serial and parallel phases [3] , [17] - [19] (see Fig. 2 ). We focus on DVS policies for the parallel phases. Tasks of equal workloads running on identical PEs at the same In a system of N PEs, the total system energy and the combined execution time of a parallel phase are
All tasks that complete before the execution time of the critical PE waste energy, since they could have run slower without degrading performance. We utilize the slack of non-critical PEs in order to save en-
We employ the criterion of minimal expected value of 2 , which has the useful characteristic of frequency invariance [6] , since / f 2 and / 1=f . 2 is a widely used criterion for design-spaces that consider scaling of frequency-voltage, both at the circuit [20] and system [21] levels. It is therefore specifically suited for fair comparison of DVS policies. We have not found it useful to use with 6 =2. Using < 2 gives inherent advantage to the f -min policy, while > 2 inherently favors the f -max policy. Thus, the optimal policy can be formulated as 
Rigorous analysis of this stochastic optimization problem is beyond the scope of this paper. In the following, we employ heuristic policies to achieve computationally feasible solutions.
III. DVS POLICIES
In this section, we describe a group of simple DVS policies for a CMP. We first describe a common outline (see Fig. 3 ) for all policies and then specify the differences for each policy.
At t = 0 fork-point of the parallel phase in the timeline, all policies perform the same steps:
Step 1) We compute the expected value of the (remaining) estimated work for all PEŝ 
Step 2) The ECP is assigned f max in order to minimize standby power. If standby power were neglected, any frequency could be used because 2 is frequency-invariant. However, when f max is assigned to the ECP the total run time is minimized and thus the contribution of standby power to the total energy is also minimized [15] . The joint-target-time (JTT) for completion of all tasks is the expected completion time of the ECP JTT = W ECP;rem f max :
Frequency assignment of the other PEs differs per policy and is described per each policy as follows.
Step 3) All PEs run at their assigned frequencies until they complete (except in the case of the Interval policy described in the following), or until the ECP completes. The ECP is expected to complete last but since workloads are stochastic this may not be the case.
Step 4) If the ECP completes (or, for Interval, a time-interval elapses) while other PEs still have remaining work, re-estimation is performed, taking into account the work done by each PE so far. The cycle is repeated until all PEs complete their work (see Fig. 3 ). We now consider each individual policy. The Oracle policy is a noncausal, hypothetical policy for which workloads are known in advance. We use Oracle as a best-case for comparing to real policies. Workloads are known, so energy can be minimized by running each PE at the slowest constant workpoint that is still fast enough to finish the task at the JTT. This well established result is due to the convexity of e(f), the energy-per-cycle [22] . The optimal frequencies f p assigned to each 
The Constant policy assigns a constant frequency to each PE. We set the frequencies of non-critical PEs with an aim to complete at the joint-target-time (9) where Wp;rem is the remaining work estimation of PE p,p;rem is the standard deviation of W p;rem , and 0 is the bias parameter. At time t = 0 no work has been done so the remaining work is the total work. For = 0, fp is set so that PE p completes Wp;rem work during the time it takes the ECP to complete W ECP;rem work. However since delay outweighs energy for the 2 criterion, we can achieve better results by setting the bias parameter > 0 [15] .
If the ECP completes while other PEs still have remaining work [step 3)], then the assumptions by which frequencies were assigned in step (2) no longer hold. We, therefore, update the estimations Wp;rem and p;rem to reflect the work completed so far, and repeat steps 1)-4) until all PEs complete. Fig. 4 shows a Constant policy example.
If we regard the complexity of one (f; V ) workpoint assignment as O(1), then the complexity of Constant is O(N), N being the number of PEs in the system [15] .
The Interval policy assigns constant workpoints as in the Constant policy [see (9) ], but the critical PE is rechosen and workpoints are reassigned at intermediate fixed time intervals. At each time interval, estimated remaining workloads are used to choose the critical PE and frequencies for the next interval. Re-estimation may offer a significant advantage since the ratios between the estimated workloads at time t = 0 may vary significantly from those at a later time.
A bias parameter for the Interval policy allows tuning as in the Constant policy. Fig. 5 shows an Interval policy example. The frequency of non-critical PEs typically increases with time, similar to the behavior of PACE (Processor Acceleration To Conserve Energy) [16] , but decreasing frequencies can also occur [15] . Since the complexity of Constant is O(N), the complexity of Interval is O(kN), where k is the number re-estimation intervals.
Energy-performance tradeoff in a single PE has been studied extensively [16] , [23] , [24] . A new Multi-PACE policy presented here is a heuristic generalization of the single-PE PACE policy [16] for CMPs. PACE shows that the optimal frequency function is increasing when workloads are unknown. PACE minimizes energy subject to meeting a deadline D with probability PMD (probability of meeting the deadline). The PACE optimal frequency function is f(w) = PACE([f min; f max ]; pdf W (w); D; PMD): (10) In Multi-PACE, non-critical PE frequencies are set according to PACE with the JTT as the deadline fp(w) = PACE([fmin; fmax]; pdfW (w); JTT; PMD): (11) Note that JTT is computed following (7) and is not an application deadline. Multi-PACE uses the PACE deadline mechanism to synchronize completion times between PEs. PACE does not specifically de- fine which frequency to use for post-deadline cycles (w > wPMD). Multi-PACE runs at f max during the cycles following w PMD to minimize delay past JTT. Fig. 6 shows a Multi-PACE example.
PMD has a significant effect on the 2 criterion, similarly to the bias parameter used in Constant and Interval. Setting it too high incurs excessive energy consumption, while setting it too low increases the probability of missing the JTT, thereby increasing overall execution time [15] . Multi-PACE requires that PEs have a continuous frequency range and be able to change frequency every cycle, which is impractical. Practical methods of implementing PACE, which can apply to Multi-PACE as well, are described in [16] and [24] . The number of frequency changes in Multi-PACE is in practice proportional to some number B (which may represent the number of histogram bins used to collect previous instance data), thus, the complexity of Multi-PACE is O(BN) [16] .
IV. SIMULATIONS AND RESULTS
We simulated the various DVS policies on a system with six identical PEs, each with a continuous frequency range of 0.32 to 1.5 GHz, using the workloads of Fig. 1 . Energy was calculated according to the power ning at an arbitrary fixed frequency, but worse than the ideal case where workloads are known in advance. The Interval policy usually achieves the best results while Constant usually achieves the worst results, and Multi-PACE typically lies in-between them. However, the difference between these policies is generally quite small (4%-13%). Distribution (iv) of Fig. 1 is an example where the Interval policy stands out. If the bimodal tasks in (iv) complete 1 1 10 5 cycles of work and do not finish, their remaining estimated workload changes sharply to a deterministic 9 1 10 5 cycles. The re-estimation performed by Interval takes full advantage of this, setting a slow, constant frequency to complete the task precisely at the JTT. The results show that a few re-estimations over a relatively large period can generally make a significant difference, while additional re-estimations have only a marginal effect. Multi-PACE typically achieves better results than Constant and worse than Interval in the simulated examples. Multi-PACE is more dependent on accurate estimation of the critical task: consider for instance distributions (v) versus (iii).
Regarding computational complexity, observe that a relatively small number of re-estimations k are needed for Interval, and that since Multi-PACE performs no re-estimations it needs a relatively large number of histogram bins B [16] . With k B the relative complexity of the policies is O(N) < O(kN) O(BN).
V. CONCLUSION
In this paper, we presented several DVS policies for CMPs, and demonstrated that simple policies achieve good results compared to more complex ones, and are within approximately 35% of optimal bounds. We started by formulating an energy-performance optimization problem of an application running on a CMP and noted the complexity of the problem, which makes it impractical for implementation. As an alternative, we described several heuristic DVS policies which utilize available time-slack to save energy in a performanceaware manner: Constant, Interval, and Multi-PACE. The frequencyinvariant criterion was employed for comparing the policies. Except for isolated cases, all policies reach comparable results. Increasing the number of re-estimations (using Interval) bears only a marginal improvement. Multi-PACE produces results that are anywhere between Interval and Constant, depending on the distribution. Since the results are usually quite close for all policies, we conclude that the least complex policy, Constant, is usually preferred. Use of Interval is justified only for certain distributions which highly benefit from re-estimation, and should be weighed against the added complexity. Based on these findings, a scheme could be contemplated whereby the number of intervals is chosen dynamically based on certain characteristics of the distribution or on past results. Multi-PACE generally does not achieve better results than any of the other two, and thus is not preferred due to its high complexity.
Taking workpoint transitions into account may degrade the results since each transition incurs performance and energy penalties [9] . When transition costs are considered, simple policies such as Constant become even more attractive because they employ fewer transitions.
Several topics are proposed for future research. More complex task graphs may be considered. Discrete (f; V ) workpoints could be studied. The Interval policy may be enhanced to consider re-estimation at flexible times. Test cases based on real application traces can be employed. Applications may be modified to estimate their own remaining work.
I. INTRODUCTION
Embedded semiconductor memories tend to play an increasingly important role in the operation of integrated circuits and systems. Since advances in memory technology tend to make memory devices more and more complicated (due to the appearance of new defect mechanisms), considerable effort has been put to the direction of efficiently testing such modules [1] - [4] , [14] - [21] . RAMs are typically discerned into bit-and word-organized [4] .
For the testing of embedded RAMs, march algorithms outperform competitive schemes, since they result in simple, yet effective, testing scenarios [5] . A march algorithm comprises a series of march elements that perform a predetermined sequence of operations (read and/or write) in every cell (for the case of bit-organized RAMs) or word (for the case of word-organized RAMs).
Testing of RAM modules is performed both right after manufacturing and periodically in the field. During manufacturing testing, various kinds of tests are applied in order to ensure that the RAM operates normally; typical tests applied during manufacturing testing are march tests. Traditional march algorithms, e.g., [5] - [7] , start with an initial write-all-zero phase, where all the RAM cells are set to 0 in order to ensure that the final signature in the output compactor is known [5] .
Periodic testing is discerned into start-up testing and testing during normal operation. Start-up testing is performed during the start-up of the system and resembles manufacturing testing. In testing during normal operation, the RAM normal operation is stalled (i.e., set out of normal operation), tested and then given back to operation. This kind of testing is applied to circuits where it is difficult and/or impractical to shut down the system since the contents of the RAM cannot be lost. In this kind of testing, traditional march tests cannot be applied since (due to the initial write-all zero phase) the contents of the RAM cells before the test are lost.
In order to confront the previously mentioned problems, transparent built-in self test (BIST) was proposed by Nicolaidis [1] ; in a transparent BIST algorithm, the initial write-all-zero phase is skipped; instead, a signature prediction phase is issued that precedes the normal march series. During this signature prediction phase, a signature is captured and stored. In the sequel, a sequence of carefully selected read and write operations are performed, that leave the RAM contents equal to the initial ones with the same fault coverage of the corresponding traditional march algorithm; the final signature is compared to the one captured during the signature prediction phase and a decision is made as to whether a fault has occurred in the RAM or not. The concept of transparent BIST is further analyzed in Section II-B.
Yarmolik et al. [8] , [9] advanced the field proposing the concept of symmetric transparent BIST. In symmetric transparent BIST, the signature prediction phase is skipped and the march series is modified in such a way that the final signature is equal to the all-zero state, irrespectively of the RAM initial contents. For response compaction of bit organized RAM's, in [8] a single-input shift register (SISR) was utilized whose characteristic polynomial toggles between a primitive polynomial and its reciprocal one during the different march elements of the march series. For the case of word-organized RAMs, it was proven in [9] that a multiple-input shift register (MISR) whose characteristic polynomial is altered in a similar fashion could serve as response compactor for symmetric transparent BIST, resulting in a predetermined (all-zero) state. The concept of symmetric transparent BIST is analyzed and exemplified in Section II-C.
The work of Yarmolik et al., although revolutionary, requires modifying existing registers (or SISRs/MISRs) in order to serve as response evaluators and requires complicated control logic in such way to toggle between the two different polynomials during the application of the march series.
It is widely accepted by the test community that the utilization of modules that typically exist in the circuit, e.g., accumulators [10] or arithmetic logic units (ALUs) [11] , for BIST test pattern generation and/or response verification possesses advantages, such as lower hardware overhead and elimination of the need for multiplexers in the circuit path; furthermore, the modules are exercised, therefore, faults existing in them can be discovered [12] .
In this paper, we propose the use of accumulator-based compaction in symmetric transparent RAM BIST (ASTRA). In modules that contain accumulators, the output of the RAM is either directly driven to the accumulator inputs or can be driven using processor instructions. It is shown that the proposed scheme imposes lower hardware overhead
