Latch-based designs have many benefits over their flip-flop based counterparts but have limited use partially because most RTL specifications are flop-centric and automatic conversion of FF to latch-based designs is challenging. Conventional conversion algorithms target master-slave latchbased designs with two non-overlapping clocks. This paper presents a novel automated design flow that converts flipflop to 3-phase latch-based designs. The resulting circuits have the same performance as the master-slave based designs but require significantly less latches. Our experimental results demonstrate the potential for savings in the number of latches (21.3%), area (5.8%), and power (16.3%) on a variety of ISCAS, CEP, and CPU benchmark circuits, compared to the master-slave conversions.
INTRODUCTION
The growing use of portable/wireless electronic systems and Internet-of-Things (IoT) applications motivates the desire of smaller and more energy-efficient designs in today's very large scale integration (VLSI) circuits. One of two devices: edge-triggered flip-flops (FFs) or level-sensitive latches are typically used as synchronization and state storage. It is well-known that latch-based designs can lead to lower power and area than FF-based designs due to time borrowing, smaller cell area, and lower capacitance [1] [2] [3] , particularly when process variation is considered [4] . They are also critical for architecturally-agnostic timing resilient designs [5, 6] which can remove unnecessary margins associated with PVT variations and make near-threshold computing more practical.
As an intermediate between latch and flip-flop based designs, pulsed-latch schemes have also been proposed [7, 8] . These rely on an edge-triggered pulse generator to provide a short transparency window to all latches. To minimize energy overhead, multi-bit pulsed-latch schemes have been proposed that share pulse generators among several latch cells [9] . Pulsed-latches, however, must be used carefully because they are subject to hold problems and pulse width variations that are challenging to predict, control, and mitigate (see e.g., [10] ).
A basic challenge to adopting any form of latch-based design is that most RTL specifications are designed using edge * This work was partially supported by NSF Grant #1619415 and DARPA Contract #HR001119C0070. P. A. Beerel also consults for Galois, Inc. in the area of asynchronous design.
sensitive FFs. Approaches to automatically converting an FF-to latch-based design are thus attractive. Most conversion flows convert the FF-based designs into pulsed-latch designs [11] or two-phase latch-based designs controlled by either master-slave clocks [12] or bundled-data asynchronous controllers [6, [13] [14] [15] [16] .
Optimization of latch-based designs has also been given some attention in the literature. For example, [2] explores using a mix of master-slave latches and FFs/pulsed-latches. Others take advantage of the time borrowing to boost performance and/or reduce area and power consumption [2, 12] . Moreover, retiming algorithms of timing-resilient latchbased designs have been developed that consider not only the number of latches required but also the impact of the amount of needed error-detecting logic [17] .
Whereas two-phase designs are inherently more robust than pulsed-latch designs, we argue they can be overly restrictive and that multi-phase latch-based designs [18] can sometimes be an attractive alternative.
The key contribution of this paper is to demonstrate that a FF-based design can be automatically converted into a robust multi-phase design with fewer latches than a twophase design. In particular, we convert a FF-based to 3-phase latch-based design using a novel Integer Linear Program (ILP) that minimizes latches and retiming to ensure no performance loss. Our experimental results show an overall average reduction in number of latches of 23% compared to the conventional master-slave designs on ISCAS89 circuits [19] , CEP submodules [20] , and three CPU designs (i.e. a 3-stage MIPS CPU Plasma [21] , a RISC-V Rocket Core [22] , and an ARM Cortex-M0 core [23] ). This paper is organized as follows. Section 2 introduces background on multi-phase latch-based designs. Section 3 describes the design constraints we adopt in our conversion algorithm and the area-performance tradeoffs they represent. Section 4 introduces our ILP-baed conversion algorithm and Section 5 presents the experimental results based on a broad range of designs. Finally, some conclusions are drawn in Section 6.
(GSTC). The phases (p1, p2, ... p k ) are ordered in a global time reference: ei−1 ≤ ei; e k = Tc, where ei is the closing time of phase pi. Eij is the forward phase shift from phase pi to phase pj defined below.
Then, the worst-case setup and hold constraints for each phase is defined as follows.
Here, Hi and Si stands for the hold and setup time of the i th latch. The shortest (longest) path delay from the j th latch to the i th latch is denoted as δji (∆ji) and the minimal (maximal) delay value of the j th latch is δj (∆j). dj (Dj) represents the earliest (latest) signal departure time, i.e., the amount of time after the last ej that the next data starts to propagate through the j th latch [18] . Tc denotes the cycle time and we assume all clock phases share the same high pulse width Tp in this paper.
LATCH-BASED DESIGNS
This paper's goal is to convert an FF-based to latch-based design minimizing the number of latches based on a reasonable set of constraints. This section explores the implicit trade-offs associated with these constraints and motivates our three-phase clocking approach.
Minimal Constraints
There are two constraints we adopt that are designed to make the application of latch-based designs easier.
C1: the original position of all FFs must be latched; C2: neighboring latches, connected by combinational logic, must not be simultaneously transparent;
Constraint C1 is designed to make logical equivalence checking between the latch and FF-designs easier. In particular, we will convert every FF to a latch and only add extra latches where necessary to meet these constraints. During logical equivalence checking the fixed latches can be viewed as FFs and the extra latches can be treated as transparent. Ensuring latches are present at the same position as the original FFs also guarantees the ability to reset the circuit in the same state [24] .
Constraint C2 is designed to avoid min-delay problems. In particular, even with min delay paths equal to 0 (δi = δij = 0) the hold constraint is satisfied with zero hold times (Hi = 0).
1 This constraint is particularly important when considering an FF with combinational feedback. If no extra latch is added during conversion, the converted circuit would have a single latch i with combinational feedback which violates C2. This configuration is dangerous because the transparency phase of the latch must be smaller than the minimum delay of the combinational feedback δii to avoid a hold violation. More precisely, the constraint can be formalized as:
The key point is that this constraint guarantees this configuration is not allowed. In particular, any solution that satisfies this constraint will break such combinational feedback by at least two latches that have non-overlapping clocks.
A well-known but non-optimal solution to this problem is to convert every FF into two latches, a master and a slave latch, as in [2, 13] , and retime the slave latches. This masterslave approach satisfies both constraints C1 and C2 but at the cost of doubling the number of sequential elements. That is, before retiming, the extra number of latches added is exactly equal to the number of FFs.
Special Case of Linear Pipelines
It is interesting to consider the special case of a linear pipeline because they have no FFs with combinational feedback that must be considered. Such a pipeline is illustrated in Figure 1 (a) and its cycle time Tc is no shorter than ∆1 + ∆11 + S, where ∆1 represents the FF's clk-toq delay, ∆11 represents the longest data-path delay, and S stands for the FF's setup time.
Such linear pipelines can be converted to a latch-based design adding no extra latches, where we clock alternating pipeline stages with alternating phases of a two-phase nonoverlapping clock, as illustrated in Figure 1 This analysis highlights the fact that there is a trade-off between the number of extra latches added and the performance of the resulting circuit. To avoid this trivial solution in our formulation, we adopt a third constraint:
C3: the converted latch-based design must have the same throughput as the FF-based design assuming the combinational logic is already critical.
We can achieve a latch-based design that meets all Constraints C1-C3 in which we add exactly one extra latch stage for every other original pipeline stage using a 3-phase clocked, as illustrated in Figure 1 (c). Notice that as desired, this solution has the same throughput as the original pipeline having phases p1 and p3 open and close their respective latches at the rising edge of the FF-based clock. We rely on the p3 latches time borrowing to properly capture near critical combinational paths. The p2 latches inserted between the p3 and p1 latches prevent data latched by p3 to violate the hold times of the subsequent p1 latches.
Optimality
A natural question to ask is if 3-phase clocking guarantees optimality in terms of the number of required extra latches. This section proves that it is optimal for linear pipelines but does not guarantee optimality for more general non-linear pipelines.
Theorem I: At least one latch stage has to be inserted between any 3 consecutive stages of a linear pipeline.
Proof by contradiction: Assume there exists three consecutive stages of a linear pipeline for which no extra latch stage is inserted within the combinational logic between stage 1 and stage 2 or between stage 2 and stage 3.
Let time 0 represent the rising edge of the stage 1 clock. According to Constraints C2 and C3, stage 2 clock can only go high during the time window (Tp, Tc − Tp) and must go low no later than Tc.
Case 1: Assume stage 1 data is valid at time 0. Since there is no latch between stage 1 and stage 2, stage 2 clock captures data no earlier than Tc. Then stage 2 clock should be high during the time period (Tc − Tp, Tc). According to Constraints C2 and C3, stage 3 can only go low during the period (Tc + Tp, 2Tc − Tp). This means that stage 3 has to capture data before time 2Tc − Tp. Because there is no extra latch inserted between stage 2 and stage 3, stage 3 must capture the data no earlier than 2Tc. This, however, contradicts the fact that stage 3 must go low before 2Tc −Tp.
Case 2: Assume the data leaves stage 1 at time t (0 < t <= Tp). Then stage 2 needs to sample the data no earlier than time t + Tc. This contradicts the fact that stage 2 goes low no later than Tc.
Next, we present Figure 2 which illustrates an example in which 4-phase clocking is needed to achieve an optimal latch configuration. In particular, Figure 2 (a) illustrates an original FF-based design where the combinational connections are abstracted to wires for simplicity. The optimal 3-phase clocking solution requires at least four extra latches, labeled "2" in Figure 2 (b). However, 4-phase clocking yields a latchbased design that requires only three extra latches (labeled 2 and 4 in Figure 2(c) ).
To the best of our knowledge, it is an open question as to whether there are optimal latch-based designs that require Despite this example, the remainder of this paper presents an conversion algorithm that produces three-phase latchbased designs. The algorithm is thus not guaranteed to be optimal because it does not support more than three clock phases. It is also not optimal as it considers the restrictive case of adding extra latches only directly after required latches. More specifically, we rely on retiming of these extra latches to position the extra latches within the combinational logic and satisfy constraints C1-C3. The separation of these two steps can lead to non-optimal results. Extending our algorithm to support four or more phases and additional latch locations is more complex and is interesting on-going research.
CONVERSION ALGORITHM
Our conversion approach is to automatically decompose the FFs into two groups, ones that will be converted to backto-back connected latches and ones that will be converted into a single latch. The group of FFs converted to a single latch are assigned to clock phase p1. The remaining FFs are converted to latches clocked by either p1 or p3. For this group, an additional latch clocked by p2 is inserted at each latches' output to create a back-to-back configuration. This means that, by construction, there is no direct data path from p3 to p1 latches. Min delay related hold problems are avoided by allowing an FF to be assigned to phase p1 and converted to a single latch only if none of its fanout FFs are also assigned to p1.
Integer Linear Programming (ILP)
Each FF is treated as a node u and its F O(u) is the set of FFs that can be reached from the FF u via only combinational logic. Every node u has two binary parameters, G(u) and K(u). G(u) decides which group of latches to assign node u, either the back-to-back latch group (G(u) = 1) or the single-latch group (G(u) = 0). K(u) determines the node u's clock phase, 1 implies u is clocked by p1 and 0 implies u is clocked by p3. All inserted latches are driven by p2. Our ILP automatically performs this assignment minimizing the number of back-to-back latches as follows:
Subject to:
Here P I stands for the set of all primary input ports and set V contains all nodes in the circuit. To provide consistency to the interface of the design, we assign all primary input ports (PI s) as if they were clocked by p1.
To make the ILP compatible with Gurobi [25] , we convert the conditional equations into inequalities:
The first constraint that implies when K(u) = 0 inequality G(u) ≥ 1 is satisfied is corresponding to the first condition in (3). The second constraint makes sure G(u) = 1 if K(u) and any of its fanout K(v) are both 1, rephrasing the second condition in (3). Applying the assumption that all PIs are clocked by p1 to the second constraint above, we obtain the third inequality.
The Design Flow
The ILP described in the last section is the core step in a design flow that supports FF-based to 3-phase latch-based design conversion. The first step of our design flow is to run standard synchronous synthesis on the given FF-based RTL design. Here, we take care to enable clock gating to minimize the number of FFs with self-loops which would otherwise unduly constrain the optimization problem. To be specific, the gated clock, shown in Figure 3(b) , is set to be the preferred clock gating style, as compared to enabled clocks illustrated in Figure 3(a) .
Using Python and TCL scripts that interface a leading commercial logic synthesis tool to the Gurobi Integer Linear Program solver [25] , we then take the resulting FF-based design, identify the connections between FFs, and formulate the ILP described in Section 4.1. We run the ILP, and, using the results, create the equivalent 3-phase latch-based synchronous design by defining the three-phase clocks and connecting them to their associated latches.
For each latch that are clock gated, we trace the clock signal back through the clock gating logic and replace the clock with p1 or p3. In the case of latches belonging to the same clock gating register bank but driven by different clock phases, the clock gating logic is duplicated and connected to the two clock phases separately, as shown in Figure 4 . We then retime the newly added latches, as described below. The last step in the design flow, left as future work in this paper, is the physical design step which includes implementation of the three-phase clock trees.
Modified Retiming
Retiming re-positions the added latches within the combinational logic minimizing area while satisfying all latch constraints. Unfortunately, many commercial tools have limited support for retiming latches. They do, however, have well-optimized support for the retiming of FFs. Using this fact, [26] proposed to retime latches by mapping it to an FFbased retiming problem. Given a synthesized design with clock period Tc, they replace each FF with two FFs and retime the entire design with a faster clock constraint of half the original period (Tc/2). After splitting the combinational logic, the FFs are converted into alternating transparent low and high latches.
In this paper, instead of halving the cycle, we keep the cycle time unchanged but use back-to-back FFs, where the first FF is controlled by clk and the second clocked by clk inverted (clkbar). The group that is converted to a single latch is replaced with a single FF, also controlled by clk. The 3-phase clocks are mapped to clk and clkbar as shown in Figure 5 . Phase p1 and p3 are mapped to clk and p2 is tied to clkbar. We then retime the circuit only allowing FFs tied to clkbar to move. This splits the combinational logic in the pipeline stages that require an extra latch into two with each part being able to operate at twice the frequency (cycle time Tc/2).
After the relocation of FFs clocked by clkbar, all FFs can be converted back to latches with their designated 3-phase 
EXPERIMENTAL RESULTS
This section quantifies the benefits of the proposed conversion algorithm comparing the resulting 3-phase design to the original FF-based as well as traditional master-slave latch-based designs. The experiments rely on an industrial 28-nm FDSOI CMOS cell library and a range of circuits that include, ISCAS89 benchmark circuits [19] , CEP submodules [20] , and three CPU designs, a 3-stage MIPS Open Core Plasma [21] , a RISC-V Rocket Core [22] , and an ARM-M0 core [23] . We validated both master-slave and 3-phase latch-based circuits by streaming inputs to the FF-based and latch-based designs and compare output streams.
2 These gate-level simulations were also used to determine signal activity used to measure the relative power consumption of our approach. Note, however, that because our results are post-synthesis, our analysis does not consider the power consumption of the clock trees. All experiments were run on two Intel Xeon E5-2450 v2 CPUs with 128GB of RAM.
Note that for a fair comparison, all designs are run at the same frequency and the modified work-around retiming strategy described in Section 4 is also performed on the master-slave latch-based designs. Table 1 summarizes the number of registers (FFs/latches) in the original FF-based, conventional master-slave latchbased, and 3-phase latch-based designs. The right most two columns show the savings of our approach in terms of the number of latches in 3-phase latch-based designs compared to the doubled number of FFs in FF-based and the number of latches in master-slave latch-based designs, respectively.
The results show that the proposed algorithm reduces the number of latches by an average of 23.4% and 21.3% compared to FF-based and master-slave latch-based designs, respectively. Notice that the 3-phase algorithm has the least overall benefit on the ISCAS89 circuits and, in particular, no benefit on s1488 and s1423. According to [27] , s1488 is re-synthesized from a controller and may suggest that our algorithm brings limited benefits to control dominated designs that have a predominance of FFs with combinational feedback.
2 For ISCAS designs we used auto-generated pseudo-random input streams. For CEP and CPU designs, we used the open-source provided testbenches. In particular, Plasma was running the "pi" program, ARM-M0 was running the "hello world" program, RISC-V was running the "rv32ui-vsimple" program, and CEP designs were running the opensource provided self-check programs. Table 2 shows the areas of combinational, sequential logic, and the total for each benchmark for FF, master-slave, and 3-phase designs. It also shows the percentage area reductions for the 3-phase designs when compared to both the FF-and master-slave designs. According to the table, the 3-phase designs achieve an average of 8.4% and 5.8% savings in total area compared to FF-based and master-slave latch-based designs, respectively. Notice that the three CPU benchmarks show a relatively high area reduction over masterslave designs but a relatively low area saving compared to FF-based designs. This is a result of the fact that converting FF-to latch-based designs sometimes increases the combinational logic area depending on the results of retiming. In particular, for the CPU designs, the average area of combinational logic increases by 10.2% and 3.4% for 3-phase compared to FF-based and master-slave latch-based designs. On the other hand, the area of the combinational logic changes less in the ISCAS and CEP designs. To be specific, the combinational logic area of ISCAS and CEP 3-phase designs are increased by 3.5% and decreased by 4.6% with respect to FF-based designs and increased by an average of 1.6% and 2.3% over master-slave latch-based designs, respectively. Note the degree of logic area increase is clockfrequency dependent and re-running these experiments at lower frequencies, reduces this impact. Table 3 reports the power dissipation of the resulting designs based on the specific signal activities determined by our back-annotated gate-level simulations.The 3-phase latchbased designs show an average power reduction of 40.8% compared to the FF-based designs and 16.3% compared to the master-slave latch-based designs. The table shows that the proposed approach can save up to 75% of the power consumption at the same frequency when compared to traditional FF-based designs. The improvement over masterslave latch-based designs are more consistent and not as significant as FF-based designs. In particular, the maximal power deduction is 40%, and an average of 12%, 26%, and Table 4 : Power consumption (mW) based on switching activity in the original flip-flop (FF), converted master-slave latch (M-S), and proposed 3-phase (3-P) latch-based designs Table 5 : Run-times (sec) of our experiments 30% benefit over ISCAS, CEP, and CPU master-slave designs. The overall power savings drop from 41% to 16% in the comparison changing from FF to master-slave designs. This can be explained by the fact that latch-based designs often have less glitching and fewer hold buffers than their FF-based counterparts. Table 4 reports the power dissipation of the resulting designs using switch-activity based power analysis assuming a switching activity of 20% on all inputs (except reset and clocks) and registers. It shows similar savings as in the simulation-based power analysis shown in Table 3 .
In summary, our experiments suggest that while significant saving in area and power is possible with our proposed approach, the amount of savings is variable and likely depends on a combination of factors including 1) the percentage of FFs with combinational feedback that limits the savings in number of latches and 2) the impact in retiming latch-based designs on the combinational logic. We should also note that these results are post synthesis and thus do not reflect the cost of the multiple clock trees nor the savings in hold buffers, both realized during physical design.
The run-time details of the conversion algorithm are reported in Table 5 . The column labeled "FF Total" shows the run-times of FF-based synthesis, the next column corresponds to the run-time of master-slave latch-based design conversion, and the last three columns reports the run-times spent on solving ILP, converting and retiming, and the total for 3-phase latch-based designs. Notice that the run-times for most designs, except for AES, are less than 18 minutes, in which at most 29 seconds is consumed by the ILP solver. This suggests that our proposed approach is computationally practical for at least moderately-sized blocks. AES has the most number of registers (9703 FFs in the original design), and takes the longest time for conversion and retiming, i.e. 1 hrs 23 min for master-slave and 3 hrs 12 min for 3-phase.
CONCLUSIONS
This paper presents an algorithm to automatically convert a FF-based design into a 3-phase latch-based design that uses an ILP to minimize the number of required latches. Our experimental synthesis results on a broad range of benchmark circuits show significant savings are possible in both area and power with practical computational run-times, particularly for pipelined circuits such as multi-stage CPUs when compared to both FF and master-slave latch-based designs.
Our future work includes quantifying these benefits post place-and-route, including capturing the cost of routing multiple clock trees and the benefits associated with higher tolerance to PVT variations and increased robustness to hold failures. In addition, we plan to quantify the advantage of this approach when applied to timing and soft-error resilient templates in which the decrease in latches also reduces the overhead of the necessary error detection logic.
