Abstract. We propose a technology mapping algorithm that takes existing structural technology-mapping algorithms based on dynamic programming [1, 3, 4] and extends them to retime pipelined circuits. If the circuit to be mapped has a tree structure, our algorithm generates an optimal solution compatible with that structure. The algorithm takes into account gate delays and capacitive loads as latches are moved across the logic. It also supports latches with embedded logic: i.e., cells that combine a D latch with a combinational gate at little extra cost in latch delay.
1. Introduction. The problem of retiming for minimal area under delay constraints using transparent latches and reasonably accurate delay models is still unsolved. Leiserson and Saxe[S] developed an optimal solution for edge-triggered D flip-flops, under the assumption that capacitive loads and gate delays are not modified when latches are moved. Ishii et. al. [6] gave an optimal algorithm that also handles transparent D latches. Locklear et. al. [7] extended this solution to take clock skew into account.
All the above algorithms are globally optimal; but they achieve this by compromising their local accuracy. None model load-dependent delays; moving latches changes node loads and thus gate delays. Furthermore, all are restricted to the use of D latches. Many cell libraries contain not only D latches but also latches with embedded logic. Such latches, whose function is of the form Q=latch(clk,f(xl,x2,..,xn)) combine a Permiadon to copy without fee all or part of this material is granted, pmvided that the copies a n not made or distributed for c l & & Furthermore, the application of existing retiming algorithms [5, 6, 7] afer technology mapping is suboptimal. The smallest possible latch movement is a movement across a single mapped gate. But this ignores the possibility of breaking a large gate into smaller pieces and placing the latch between them to meet a delay constraint.
As a consequence, existing global retiming algorithms cannot fine-tune latch placements. By contrast, our algorithm performs retiming during technology mapping, which provides more control on latch placement and leads to a more efficient use of latch-embedded logic. We extend tree mapping to networks that are trees, but can contain latches at interior points. We allow these latches to be moved anywhere within a tree, but not across trees. We preserve the tree mapping algorithm's optimality properties and its linear-time complexity, and handle both edge-triggered and level-sensitive latches. This has several gains:
0
Critical paths are reduced not only by moving latches but also by embedding logic within latches. Areas and loads are correctly modeled as latches move within a tree. We retime at the level of a tree-matching primitive (a 2-input NAND gate or an inverter), which is finer than the gate level.
The rest of the paper is organized as follows.
In Section 2 we extend tree mapping to allow latches and retiming within trees. In Section 3, we deal with level-sensitive and conditional latches. We give results and conclusions in Section 4.
Latch mapping and retiming algorithm.
For ease of illustration, we describe our retiming algorithm in terms of a min-area tree mapper [4] . It can easily be adapted to minimize delay [l] or area under delay constraints [2] .
The algorithm starts with a decomposition of the subject tree into 2- We start by tabulating, at each leaf node n, the number of latches weight@) between n and the tree root. Assume for now that we use edge-triggered flip-flops. Then any retiming which preserves weight(n) for each leaf n is legal. We next extend our data structure. Instead of having one solution per node, we store a solution for each retiming weight per node. A node implementation has a retiming weight of w if, under ithis implementation, there must be exactly w latches between the output of the node and the root of the tree. At any node, only one of these solutions will be used in the final mapping. For each leaf n, we initialize n.sol[weight(n)] with zero area. We then perform tree mapping. For each internal tree node I, we compute the best solution for each retiming weight. For a given weight w we can: When we reach the root of the tree, the optimal implementation is in root.sol [O] .
Consider the NAND-inverter tree in Figure 1 , with a single phase-1 latch at G. (Figure 3) .
At node G, we could implement G.sol[O]
with a NOR2 gate and drive the gate with weight 0 inputs (Figure 4) . By using weight 0 inputs on F and J, we assume that latches are placed before F and J. Alternatively, we could implement G with a latch, driven by weight-1 implementations ( Figure 5 ). In this example the algorithm would examine both solutions and select the one with minimal area.
Complex latches.
Edge-triggered latches are the simplest case. The algorithm can also handle level-sensitive and conditional latches. We must now preserve not only the number of latches between the inputs and the root, but also the correct sequence; this avoids arbitrary swappings of latch positions. In this case, the simple array of solutions is replaced by an array where each entry represents a Cells with reconvergent embedded logic (e.g., multiplexors or JK flops) can only be used by a tree mapper at tree boundaries. Our algorithm retains this limitation. However, if a tree has a mux at its leaves, we can retime a downstream latch back toward the leaves to use a mux-embedded latch, no matter where the latching point originally was. Note that SynFul performs area recovery during mapping; i.e., the added slack introduced by latch movement is heuristically used to reclaim combinational area. The nonoptimality of this heuristic explains the few cases where arbitrary retiming yields increased area.
In conclusion, we have improved tree mapping in several ways. By allowing latching points in the interior of trees, we allow trees to span several phases and thus to be larger. Furthermore, we allow the technology mapper to perform retiming within a tree. This retiming is optimal for a tree, for a given initial decomposition of a network and ignoring reconvergence at tree leaf nodes. Finally, for many libraries, we obtain significant advantage using latches with a logic function embedded within them. 
