We consider the problem of selecting the proper implementation of each circuit module from a cell library to minimize the propagation delay along every path from any primary input to any primary output subject to an upper bound on the total area of the circuit. Different module implementations may have different areas and delays on the paths. Wc show that the latter problem is NP-hard cvcn for directed acyclic graphs with two implementations per module and no restrictions on the overall area of the circuit. Wc present a novel rctiming based heuristic for determining the minimum clock period on sequential circuits. Although our heuristics may handle a bound on the total area of the circuit, emphasis is given on the timing issue.
INTRODUCTION
The circuit implementation problem studied here is related to the technology mapping problem studied earlier by Brayton et al. [I] , Kcutzcr [7] , Pcdram et al. [13] , Touati et al. [16, 17] , Chaudhary et al. [3] and others. The authors above examine the problem of mapping a Boolean network using gates from a finite size cell library.
However, in this paper, wc [15] . Thus, our primary goal is to select module implementations so that we minimize the maximum delay of a given circuit. For the sake of simplicity, we focus on the Timing-Driven General Circuit Implementation (TDGCI) model, where the module implementations are selected without considering the area of each implementation. However, the heuristic solutions presented in this paper can be easily extended to consider a bound on the total area of the implemented circuit.
We consider the pin-dependent MIS library delay model as was formulated in [3] , where the arrival time arrival() at the output go of some module g is a complex function expressed as arrival (go, Cgo)=maxg,inputs(g)(rg,.go d-Rg,,goCgo darrival(g, Cg,)), where rg,.go is the intrinsic gate delay from input g to output go of g, Rg,,g is the drive resistance of g corresponding to a signal transition at input g, Cgo in the load capacitance seen at go, and arrival (gi, Cg,) is the arrival time at input g corresponding to load Cg, seen at that input [3] . The load capacitance Cg depends on the input pin capacitances of the gates it is driving2. Observe that if all pin capacitances of all module implementations are the same we result to a more simplified delay model, the simplified TDGCI problem where every module implementation has local delays on its various paths that do not depend on the delays on paths of other modules in the circuit. That way, the delay along a path can be computed by simply adding the delays on the module edges on the path.
Most of the previous work in the literature is on a simpler model, the BCI problem [2, 11, 10] Chan [2] has shown that the simplified BCI problem is NP-hard for circuits with tree topology. Furthermore, for the simplified BCI model, Chan [2] has given a pseudo-polynomial time algorithm for trees, and a heuristic for basic circuits modeled by directed acyclic grphs (dags). Later, Li et al. [11] showed that the simplified BCI problem in NPhard. They also developed a pseudo-polynomial time algorithm that obtains optimal solutions for basic series-parallel circuits [11] . In addition, they proposed six heuristics for basic combinational circuits under the simplified BCI model, without actually providing that the BCI problem on combinational circuits in NP-hard in the strong sense [11] . We have shown that the latter problem is indeed NP-hard by reducing from the One-InThree 3SAT problem. The reduction is given in the Appendix. The later was also recently, and independently, shown in [10] In another context, the authors in [9] allow similar simplifications when they perform a retiming in a sequential circuit. Retiming is a technique that allows repositioning of the existing flip-flops of a sequential circuit so that its operation is not modified and the clock-period is minimized. This is equivalent to maintaining the same number of flip-flops at each cycle. Leiserson and Saxe have presented efficient retiming algorithms [9] . In fact, in this paper we modify one of the algorithms in [9] Observe that it contains 2 internal nodes (this constraint can be removed but it simplifies the description of the reduction), and 5 internal edges, labeled e/k, 1 < < 5, respectively. Leiserson and Saxe [9] proposed algorithms for clock period minimization, for both circuits with uniform and nonuniform delays on the modules. We found the proposed algorithm for circuits with nonuniform delays on the modules [9] to be complicated to implement. In this paper, we present an approach for the general model, which although asymptotically has the same time complexity with the respective algorithm in [9] , is faster in practice and much easier to implement.
The formulation of our problem requires a retiming technique with some constraints on top of the ones in [9] . More precisely we perform retiming on a graph constructed as follows: Every module's input and output is represented by a node, and every internal edge is substituted by a node and C1, C2 guarantee clock period minimization in [9] . C3 guarantees that no flip-flop is placed on a prohibitive edge.
Next For comparison reasons, we also implemented a variation of the previously described approach. In this version, although we create a graph using the prohibitive-edge scheme described above, we don't break any cycles. Furthermore, we perform retiming once to obtain the minimum clock period before any module changes have taken place. Then we calculate the gain of each module by performing retiming, using algorithm [9] . The gain of a module is equal to the difference of the minimum clock period evaluated before any module changes minus the minimum clock period of the circuit considering the alternative implementation of the particular module. The module with the biggest gain is selected and locked. Thus, for every module change, the heuristic performs retiming once for each unchanged module. The time complexity is O(IV[alElloglV[).
EXPERIMENTAL RESULTS
We implemented both our approach and the straightforward iterative improvement for the combinational circuits in C and run on a Sun Spare System 4/330. We experimented on several ISCAS'85 benchmarks. Since the ISCAS'85 circuits do not include a list of possible implementations for each module, we generated these randomly by using function rand() from the standard library. For simplicity reasons, we considered uniform pin capacitances and the simplified model. We applied mod 10 to all created numbers, so that the delays range from 0 to 9. We treat every cell from the library as a module. Table   I , gives the experimental results for our heuristic, Comb 1, and the straightforward iterative improvement approach Comb2. In Table I , "initial delay" denotes the initial delay of the circuit's longest path, "delay" the minimum delay obtained for the longest path, and "time" the time required for the particular heuristic to terminate. The time here is expressed in seconds.
When we constructed Comb 1, we tried to stay as close to the gain computation as possible, expecting to get a little worse results than those of Comb2 (since the gains were computed approximately) but faster. We observed that in 80% of the gain selection Comb did indeed select the best actual gain. From Table I , observe that Comb l is not only much faster than Comb2 as expected, but is also produces smaller delays. A possible explanation of this behavior is that the suboptimal gains helped escaping local minima.
We implemented our heuristics for the sequential circuits in C and run on a Sun Spare System 2. We experimented on several ISCAS'89 benchmarks. For simplicity reasons, we run our three heuristics based on the assumption that all modules have uniform delays. We used the delays given by the ISCAS'89 circuit data as the delays of the first implementation. We generated the delays of the modules for the second implementation randomly using the function rand() from the standard library. In addition, we applied mod 10 to all generated numbers so the delays range from 0 to 9. As before, we treat each cell from the library as a module. process, we were able to obtain good results by using the prohibitive edge scheme described in Section 4. The results were in practice faster than the approach in [9] 
