Retention registers are utilized in power gating design to hold design state during power down and to allow safe and fast system reactivation. Since a retention register consumes more power and costs more area than a non-retention register, it is desirable to minimize the use of retention registers. However, relaxing retention requirement to a minimal subset of registers can be computationally challenging. In this paper, we adopt satisfiability solving for scalable selection of registers whose retention is unnecessary and exploit input sequence constraints to increase the number of non-retention registers. Empirical results on industrial benchmarks show that our proposed methods are efficient and effective in identifying non-retention registers.
http://dx.doi.org/10.1145/2744769.2744905.
registers that are specially designed to have low-leakage and can retain their values in sleep mode. By replacing all registers in the block with retention registers, full block state can be quickly restored when the block is reactivated -this method is called full retention. However, full retention may require more power than necessary because some registers may not need retention if their values will always be updated before they are first read after power up. Therefore, it is desirable to retain only a subset of all registers. This design methodology is called partial retention.
In [15] , Tian et al. reported that partial retention (19% non-retention) reduced leakage power consumption by 10% even though the new version of the design has 16% more registers. In their flow, designers need to manually inspect the design to select registers that do not need retention, which is time-consuming and error-prone. To address this problem, we present a flow to select non-retention registers automatically.
There have been several prior efforts on retention synthesis. In [7] , Darbari et al. presented a case study using symbolic simulation for assisting the designers to implement selective retention correctly. The main finding is that the programmer visible state or the architectural state needs to be retained, while other micro-architectural enhancements such as pipeline registers, TLBs and caches can be implemented using normal registers without retention. This work provides a useful guideline for selective state retention implementation, but no automated methods are proposed. In [3] , Chen et al. proposed the concept and usage of multi-bit retention register, which can reduce the leakage power and area considerably. However, additional circuitry and clock cycles are needed for mode transition and are not compatible with most design methods. Greenberg et al. recently proposed methods [9, 10] for automatically classifying all the flip-flops (FFs) of a given design into two categories: essential FF (e-FF) and redundant FF (r-FF). [10] performs the analysis according to read/write criteria on the synthesized gate-level netlist with Binary Decision Diagram (BDD) as the data structure, while [9] represented the criteria as a set of formal properties using propositional formulas and then uses common formal verification tools to drive the identification of the e-FFs. Even though the proposed approaches are formally sound, they are difficult to be applied by nonformal experts due to the required effort to provide proper input constraints using BDDs. In addition, They need massive parallel processing to achieve reasonable runtime, leaving scalability a major concern. Compared with [9, 10] , our proposed method verifies power intention and retention scheme in the early phase of design flow by symbolic simulation [4, 5] , which allows us to handle not only gate-level but also RTL designs without logic synthesis. Performing the analysis at the RTL instead of gate-level makes it easier for the designer to inspect the results. Another advantage of symbolic simulation is that the SAT instance is constructed with the state variables retained, making it easy to implement our algorithm and can reduce the effort to analyze additional non-essential FFs introduced during logic synthesis.
In contrast to existing literature, our proposed algorithm [2] uses real test sequences that most designers are familiar with to perform the analysis, making it considerably easier to use compared with formal methods. Our algorithm utilizes high-level symbolic simulation on the RTL design and relies on the Conjunctive Normal Form (CNF) as the formula representation. It can automatically identify the set of non-retention registers during typical VLSI design flow efficiently and effectively.
The rest of the paper is organized as follows. Section 2 provides some preliminaries. Section 3 presents the problem statement and the detailed algorithm of retention synthesis. Finally, experimental results and conclusions are shown in Section 4 and Section 5, respectively.
PRELIMINARIES
As conventional notation, symbols ¬, ∧, ∨ and ⇒ stand for logical connectives negation, conjunction, disjunction, and implication, respectively. The cardinality of a set S is denoted as |S|. Let V = {v1, . . . , v k } be a finite set of Boolean variables. A literal l is either a Boolean variable vi or its negation ¬vi. A clause C is a disjunction of literals. A conjunction of clauses is in the so-called CNF. In the sequel, a clause set C = {C1, . . . , C k } shall mean to be the CNF formula C1 ∧ . . . ∧ C k . A total (partial) assignment σ over V gives every (some specific) variable vi a Boolean value either 0 or 1. Assignment corresponds to cube; total assignment correspond to minterms. A CNF formula is satisfiable if there exists a satisfying assignment such that the formula evaluates to 1; otherwise it is unsatisfiable.
SAT Solving and Incremental SAT
We assume the reader's familiarity with Boolean satisfiability (SAT) solving [13, 8] and the conversion from circuit to CNF formula [16] . A more detailed exposition can be found, e.g., in [11] . Assumptions-based incremental SAT searches for an assignment that satisfies the current set of clauses under the unit assumptions assumps = n i=0 ai. The assignment that satisfies all the clauses and the unit literals ai will be returned if it exists. If the problem is UNSAT under the given assumptions, the subset of those assumptions used in the proof of UNSAT will be returned in the form of a final conflict clause.
Retention Register and Partial Retention
A Retention Register (RR) in general has a control, called RET in this paper, that enables the register to retain a state. When RET is low (sample mode), the register works like a normal register (without retention). When RET is set high (hold mode), the register retains the state that it kept just before RET was held high. Retention registers are used in designs that require fast resumption of operation after wakeup to preserve states in power-down blocks. However, a retention register occupies larger area than a normal register and consumes power in sleep mode. The former also increases wire length that can further degrade design performance.
State Transition Systems
A state transition system consists of a state transition relation T ( x, s, y, s ) and a set I( s) of initial states, where s, s , x, and y are referred to as the current-state variables, next-state variables, input variables, and output variables, respectively. In the sequel, state sets are represented with characteristic functions. We shall not distinguish between a characteristic function and the set that it represents. For a deterministic system as we shall assume, the transition relation T ( x, s, y, s ) can be alternatively treated as the transition function T :
A time-frame expansion of the state transition system T ( x, s, y, s ) is the time unrolling of T into multiple timeindexed copies, denoted T t = T ( x t , s t , y t , s t+1 ), the transition relation at time t. In contrast, in the sequel T * = T ( x * , s * , y * , s * ) denotes a renamed copy of T with variables x, s, y, and s of T substituted with fresh new variables x * , s * , y * , and s * , respectively.
RETENTION SYNTHESIS
The main objective of retention synthesis is to obtain the set of non-Retention Registers (non-RRs) in a power-gating design. In this section, we proposed two algorithms to recognize the non-RRs: one for finding an optimal solution and the other one is a heuristic that is more efficient in both memory usage and runtime. The notation described in Section 2 will continue to be used.
Problem Statement
Given a state transition system with transition relation T ( x, s, y, s ), initial states I( s), power-up sequence σ( x) and power specification (typically in Unified Power Format (UPF)), the retention synthesis problem determines whether nonretention registers exist, and furthermore how to synthesize the new power specification with fewer retention registers. Power-up sequence denotes the specific assignments over primary inputs of several time-frames and is always applied after wakeup.
Existence of Partial Retention
Given a system starting from a known initial state and a power-up sequence, while assuming the design register ri (r * i ) corresponds to state variable s 0 i (s * 0 i ) for i ranges from 1 to the total number of registers, the following proposition states the necessary and sufficient condition that the system can be partial retention. Proposition 1. Given a system with transition relation T ( x, s, y, s ), initial states I( s) and power-up sequence σ( x),
where predicate "=" asserts the bit-wise equivalence of its two argument variable vectors and p denotes the length of σ( x). Assuming the initial state variable vector s 0 and s * 0 are bit-wise equivalent and s * 0 can be divided into two subsets sr and sn such that sr ∪ sn = s * 0 and sr ∩ sn = φ;
∈ sn], whose corresponding state variables belong to sn, can be non-retention registers if and only if the formula
is unsatisfiable (UNSAT).
In the sequel, we shall call Formula (1), ϕ M (p) , the miter formula, which is also shown in Figure 1 . In Figure 1 , T ( x, s, y, s ) denotes the full retention system with k registers and initial state I( s 0 ), while T ( x * , s * , y * , s * ) denotes a partial retention system with n non-RRs whose initial values are not defined. The other m (= k − n) registers have initial values that are the same as the corresponding registers in T . sn and sr represent the initial state variables of non-RRs and RRs respectively. The output of AND gate is asserted to be false.
, then either ( y t = y * t ) for some t or ( s p−1 = s * p−1 ) will be unsatisfied; namely, either the primary output or final state has different response between full retention system T and partial retention system T * . The registers [r * i | s * 0 i
∈ sn] cannot be non-RR. (⇐=) Assume the registers whose corresponding state variables are in sn, cannot be non-RR; hence, by definition the unknown value will affect the primary output or final state variable. At least one of the ( y t = y * t ) or ( s p−1 = s * p−1 ) will be unsatisfied. Then Formula (2) will be satisfied consequently. 
Finding Non-Retention Register
To model the status of registers, additional control variables c * are added such that the following relation hold,
For all the legal i, while c * i is equal to 1, the initial state s * i 0 of register r * i is not restricted; otherwise, the s * i 0 is restricted to the initial state of ri which is in the full retention system. In other words, register r * i is non-RR if and only if c * i is equal to 1. The value of control variable models the retention status of the corresponding register.
We use (5), which is the conjunction of literals of all control variables, as unit assumption while verifying the existence of partial retention system. The ϕA( c * ) can be regarded as assignment over c * and represents a candidate of register status. A system with assignment ϕA( c * ) can be partial retention system if not all the variables are equal to zero.
Formula (6) is UNSAT if and only if the given system can accept the set of register ri, whose corresponding control variable c * i is positive in the unit assumption ϕA( c * )(lit(c * i ) = 1), as non-RRs. All the unit clauses in ϕA( c * ) need to be satisfied when solving the instance (6) . By changing the assignment lit(c * i ) alternately, we can model and verify the existence of different choices of non-RRs without the reconstruction of miter formula; therefore, the clauses learned from the previous SAT solving can be reused. While (6) is UNSAT under a specific assumption, the assumptions-based conflict core denotes the unit clauses which take responsibility. The other variables in ϕA( c * ) are assumed to be positive (non-retention) in subsequent SAT solving.
The pseudo-code of the algorithm is presented in Fig. 2 . Function Initialize(Ψc, c * , sn) constructs Ψc by high-level symbolic simulation [4] . It also sets the initial value of c * to all zeros and sn to empty set. ϕcc denotes the returned conflict core which comprises the subset of unit assumptions in c * . Function Update(ϕA( c * )) chooses a register that has not been verified yet as the candidate of non-RR and modifies the ϕA( c * ) accordingly for incremental SAT solving. The procedure is repeatedly performed until all the registers have been verified. This algorithm takes the advantages of assumptions-based incremental SAT solving that reuse the learned clauses and the conflict core analysis that denote the cause of UNSAT to find the result effectively and efficiently. 
Finding Maximum Number of Non-RRs
The above algorithm is heuristic and can find a set of nonRRs efficiently. On the basis of the result obtained from the above algorithm, we can find the most (optimal) non-RRs by using the cardinality constraints [14] , implemented by an adder network that performs the bit-wise addition of the elements in c * and outputs the sum m. Consider the cardinality constraints ϕC ( c * , m), which restrict the number of control variables that can be set to 1 simultaneously to m, the formula
can be applied to find the optimal solution. According to Proposition 1, there are at least m legal non-RRs if there exist an assignment that satisfies ϕC ( c * , m) and unsatisfies Ψc simultaneously. To find such assignments, we continuously compute satisfying assignments of (7) and add blocking clauses to Fopt, thus preventing these assignments from being enumerated again until Fopt becomes UNSAT. The problem refers to the All-SAT problem [12, 17] , which enumerates all satisfying assignments of a propositional logic formula. Previous work [17] indicates that existing solutions to the All-SAT problem are likely to perform unnecessary work because they produce solutions comprise pairwise disjoint cubes instead of overlapping partial assignments. Our proposed algorithm produce the blocking clauses from efficient partial assignments and thus reduce the execution time and memory usage notably.
Algorithm
Fig . 3 shows the pseudo-code of our proposed algorithm for finding the maximum number of non-RRs. In the algorithm, SSm denotes the solution space that contains different combinations of exactly m registers that should be non-RRs and ϕ block is the conjunction of blocking clauses. Function Initialize(Ψc, ϕC ( c * , m), ϕ block ) is the procedure to construct Ψc and the cardinality constraint ϕC ( c * , m). Function solve(F ) returns a satisfying assignment to F if it exists or a empty set, otherwise. Generalize(ϕassign) transforms the satisfying assignment ϕassign to blocking clause ϕgen. In naive transformation, ϕgen is the complement of ϕassign : ϕgen = ¬ϕassign. Some techniques for efficiency enhancement are stated in Section 3.4.2. Function Update(ϕ block , ϕgen, SSm) adds the blocking clause ϕgen to ϕ block and excludes the solution space covered by ϕgen from SSm. The procedure is repeated until there is no assignment that satisfies (7) . If SSm is not empty, we confirm at least m non-RRs exist. The maximum value of m that satisfies ϕC ( c * , m) and unsatisfies Ψc simultaneously is the maximum number of non-RR and thus is the optimal solution.
The desired assignment over c * can be obtained by solving the following formula,
Efficiency Enhancement
Since the number of satisfying assignments of a Boolean formula can be exponential with respect to the size of the formula, the above procedure suffers from the problem of memory explosion. From [17] , blocking clauses that do not contain the implied variables are extremely short and very beneficial for solver performance. The SAT solving result of our instance depends on the retention status of registers; in addition, the control variables c * imply the other variables. Hence we only need to consider the blocking clauses comprise the control variables. Therefore, instead of the naive technique that constructs blocking clauses from minterms directly, we proposed two new methods: partial and generalized. The partial technique constructs the clauses only from the control variables, while the generalized technique probes each control variable to see whether it can be removed from the blocking clause generated by the partial technique. The procedure Generalize() uses the generalized technique that extracts the partial assignment over the control variables and complement them to construct shorter blocking clause. Empirical comparison among these techniques are provided in Section 4.2. Fig. 3 shows that the cardinality m is increased iteratively. However, in practice the solution space SSm grows sharply as m is close to half the number of control variables m mid . In order not to get stuck in the for-loop due to the immense searching space which comprises enormous number of satisfying assignments, we use a heuristic technique to choose m back and forth. The cardinality m closer to the m mid will be chosen later. Thus the for-loop with higher time complexity will not be executed unless the easier ones can be completed.
Even with the above performance enhancements, the optimal algorithm is still considerably more computational intensive than the proposed heuristic algorithm. This is because the solution space in the optimal algorithm increases exponentially with the number of design registers, producing tremendous counts of SAT solving that leads to unacceptable runtime on industrial designs. From our empirical results we observe that the proposed heuristic algorithm can obtain good results within reasonable time; therefore, we suggest using the heuristic algorithm for industrial designs. Comparison between the two proposed algorithms is shown in Section 4.
Scalability Enhancement
To improve the performance of our algorithm on long sequences and large designs, we use "temporal partitioning" [6] . The long sequence σ( x) is divided into shorter intervals which are analyzed separately. We perform our algorithms on the intervals iteratively with the following criteria. First, a register requires retention if its status is retention in any interval. Second, if the register is over-written before its status becomes retention, it does not need retention. Last, the status of the other registers are inconclusive. To determine the status of inconclusive registers, a different trace that exercises related logic needs to be analyzed. In this manner, the complexity of symbolic simulation and SAT solving can be reduced, thus improving runtime and memory usage. Another improvement is "spatial partitioning", which divides the design into smaller blocks to handle large designs.
Discussions
Our analysis is based on input sequences and the results are correct only for the analyzed sequence. If multiple power-up sequences exist, they all need to be analyzed and the non-retention registers are the intersection of all identified non-retention registers in different sequences. If power-up sequences may vary, then the returned analysis is a suggestion to the designer, and further verification is required. Therefore, the proposed algorithms are meant to be used as an analysis tool for designers instead of a verification tool. Compared with pure-formal methods based on constraints, our proposed solution is easier to use and more easily adoptable to industrial simulation-oriented verification flows. Our empirical results show that the analysis results are highly accurate even if only one sequence is analyzed, which can considerably reduce designers' efforts in identifying non-retention registers. For future work, partial Max-SAT solver may have advantages on finding optimal solutions. In addition, performing structural analysis to guide our selection of candidates may also be useful.
EXPERIMENTAL RESULTS
The proposed algorithms were programmed in the C language within a commercial symbolic simulator called Insight [1] . MiniSat [8] was selected as the SAT solver. The experiments were conducted on a Linux machine with 2 GHz Xeon processor and 24 GB RAM. The reported runtime includes symbolic simulation. In addition to internally-created and public-domain benchmarks, we also report two case studies on industrial designs from our partner.
Optimal and Heuristic Algorithms
To compare the optimal and the heuristic algorithms, we created a set of simple designs and chose some pubic-domain designs, (i2c and ecg) from OpenCores and (s13207 and s15850) from ISCAS benchmark. The benchmark circuits are listed in Table 1 , where the numbers of registers, primary inputs and primary outputs are also shown. Note that the numbers of registers shown in the fourth column are the numbers of word-level variables, not the total bits of registers. The runtime is constrained to 30 minutes. "CNC"represents "can not complete ": in this case the optimal algorithm cannot find the best solution. The symbolic simulator we used performs certain word-level optimizations before the problem is converted to bit-level Boolean expressions. Therefore, registers whose corresponding symbols are not involved in Ψ automatically become part of non-RRs. Hence the total numbers of non-RRs can be more than that returned by the algorithms. In the table, "NON" reports the number of non-retention registers, "RET" shows the number of retention registers, and "Runtime" shows runtime of the algorithms.
The result shows that even though the optimal algorithm can find the maximal set of non-RRs, its runtime is considerably longer than the heuristic algorithm due to its exhaustive nature and iterative scheme. According to Table 1 , the number of non-RRs found by the heuristic method is close to the optimal solution for most cases. Moreover, runtime of the heuristic algorithm is considerably shorter than the optimal algorithm. Therefore, we suggest using the heuristic algorithm for industrial designs.
Analysis of Optimal Algorithm
We use the same set of designs in the previous experiment to compare the enhancements used in the optimal algorithms, which are naive, partial and generalized, respectively. The numbers of blocking clauses and runtime of the three techniques are shown in Table 2 . The naive technique constructs blocking clauses from minterms, while the partial technique constructs the clauses from only the control variables. The generalized technique probes each control variable to see whether it can be removed from the blocking clause generated by partial technique. Blocking clauses with fewer literals are beneficial for finding all satisfying solutions. From Table 2 , we can observe that the number of blocking clauses needed to enumerate all the satisfying solutions and runtime are greatly reduced by the partial and generalized techniques.
Case Studies
We have applied the proposed algorithms to two designs from our industrial partner. In the setup, we make all clocked registers except clock gaters non-retention register candidates. The first design is a gate-level IO block with 5885 FFs. We analyzed a power-up sequence that is 1445 cycles long and identified 1569 non-retention FFs, 4146 retention FFs, and 170 inconclusive FFs. The designer inspected the results and found that all but five non-retention FFs are correct. For the five in question, four of them are confirmed can be non-retention after detailed analysis, and one of them is a control signal that is used before the start of our analysis window of the given trace. Because that signal is no longer needed when our analysis starts, our algorithm determined it to be non-retention.
The second design is a wireless communication block in RTL with 3021 word-level variables (21817 bits) and several clock domains. We analyzed a trace that is approximately 10K cycles long using 3h31m and found 1805 non-retention variables (12819 bits), 495 retention variables (3518 bits), and 718 inconclusive variables (5480 bits). The result suggests that 59% of design FFs do not need retention, which can considerably reduce leakage power when the block is powered down.
CONCLUSIONS
Using partial instead of full retention in low-power designs not only reduces static power consumption but also improves design performance due to smaller area. In this paper we proposed an efficient algorithm to automatically identify non-retention registers with a unified problem formulation. We also proposed an optimal algorithm that can find the maximal set of non-retention registers. Several all-SAT techniques are explored for obtaining optimal solution. Our experimental results show that the heuristic solution provides high quality results that are on a par with the optimal results, and two case studies from our partner show that our techniques can identify non-retention registers effectively and efficiently in industrial designs.
