We present in this paper an allocation approach which considers the controller's effect on system delay to minimize the system cycle time. Most allocation algorithms ana! conditional resource sharing methods emphasize minimum number of resources or area. Previous works have not modeled the resulting controller's structure and its contribution to the system delay in a controller/datapath system. Our allocation method generates a controller with minimum delay on the system's critical path. Therefore the resulting system cycle time will be shorter than other allocation approaches.
Introduction
Many systems are built from a datapath and a controller. The system cycle time depends on the interactions between the controller and the datapath. A datapath may impose both arrival times on controller inputs and departure times on controller outputs. Latearriving controller inputs may be generated by complex datapath functions, such as ALU carry-out, while early-departure controller outputs may be required to control slow datapath units. If the controller is not designed taking into account arrival and departure times, it may unnecessarily put control logic on the critical timing path.
Unit binding in the allocation process affects not only the datapath configuration but also the controller structure. Our allocation approach builds the resulting controller structure at the same time so that eventually the controller has minimum delay on the system critical path. In a previous paper [l] , we introduced unifiability as a method for reducing controller delay during the scheduling process. However, that algorithm operated only on the controller, considering only the 0/1 values of its primary outputs. This paper shows how to choose datapath allocations to make controller signals unifiable-we show here how allocation choices determine controller unifiability, which in turn determines system cycle time.
In the following discussion, we will first review previous allocation work in Section 2. Our allocation approaches will be presented in Section 3 and Section 4. Experimental results and conclusions are in Sections 5 and 6.
Review of Allocation Approaches
Allocation approaches can be categorized as decomposition approaches, greedy constructive approaches, and iterative refinement approaches [2] . The REAL program uses lifetime analysis and greedy left-edge algorithm for register allocation, which uses minimum number of registers for acyclic scheduled data flow graphs (SDFG) [3] . The EMUCS system uses a global selection criterion to allocate the next element for minimum number of registers, modules, and multiplexers [4] . The STAR package uses branch and bound search for subtask space and performs a constructive binding followed by an iterative refinement for minimum hardware resources [5] . The OAS synthesizer uses integer programming model for scheduling and allocation for embedded VLSI chips [6] .
These approaches minimize the number of resources and interconnection complexities, but do not try to predict the controller structure. Therefore, the controller structure and its delay influence on the combined datapath and controller configuration is not well known during the resource binding. In this paper, we would like to take the interaction between datapath and controller into account, and propose an allocation approach which can construct a controller structure with minimum delay on the system critical path.
Dependency-Driven Allocation
Controller implementation may have a significant effect on the system cycle time. It may lengthen the critical path and delay the execution of datapath operation. Minimum-controller-delay allocation is an allocation method which results in a controller implementation with minimum delay on the existing critical path in the controller-datapath system, and hence a smaller system cycle time. To find an allocation method with minimum controller delay, we will consider several resource binding heuristics in subsequent Table 1 : FSM-0, z0 is distinct for S1 and z l is unifiable for S1.
subsections. Simple examples will be given first. Some experimental results based on published benchmarks will be discussed in Section 5.
Unifiability and Dependency
Unifiability uses don't-care conditions in the controller to eliminate dependencies of primary outputs on primary inputs. The concepts of minimum dependency have been applied in scheduling [l] and encoding [8] . We will first introduce some terms before we explain our allocation approach. An FSM output Zj is dependent on input x k if Zj is a function of xk.
If zj has no dependence on xk, zj is independent of xk. For For example, in FSM-0, there are two transitions associated with S1. One transition specifies that z0 is 1 when input x0 is 0. The other transition indicates that SO becomes 0 when input x0 is 1. Therefore z0 is distinct (not unifiable) for state S1 in this case. However, z 1 is unifiable for S1, because z l is either 1 or don't-care for the two transitions associated with S1. If we assign the don't-care as 1, then zl's value is unified to be 1.
A unifiable output's logic function can be made independent of the primary input. For instance, if we treat the symbolic present state input as another input in addition to primary input x0 and assign the don't-care in the last row of Table 1 to be 1, we can write zl's function as in (EQ l), which is independent of xo: (2) (1) However, the non-unifiable output z0 will depend on primary input x0 as shown in (EQ 2):
We use PDS [8] to perform the above minimum- dependency-driven don't-care assignment and encoding, and then implement the FSM in multi-level logic using SIS [9] . (EQ 3) verifies the relationship between unifiability and dependency, where ps0 is a binary present state variable.
The properties of unifiability and minimumdependency will be used in the following discussions.
Functional-Unit and Interconnection
In this subsection, we will describe how to bind available functional units and interconnection resources to minimize the potential delay resulted from the controller implementation. > 0) ; therefore, the critical path delay is smaller.
arriving input, the critical path delay will be reduced. {m-01) = pso', (m-11) = pso' ( 
5)
We would like to consider the controller structure and datapath binding at the same time. In many cases, conditional resource sharing does not consider controller delay and might be too greedy. Therefore, it could introduce undesired control dependency and longer critical path, as we have seen in Figure l(b) and FSM-1. Generating a controller with unifiable primary outputs during resource binding is one of our approaches to reduce the control dependency and hence system cycle time.
Register Allocation
In the following, we will explain how register allocation changes the controller's dependency on the critical path. Figure 3 We first apply the left-edge algorithm with conditional resource sharing for register allocation: RO = (WO, wl, w2), R1 = (w3), where RO and R1 are registers. The simplified controller-register configuration is illustrated in Figure 4(a) . To simplify the matter, we use a signal c to denote the (v0 -v l > 0) primary input to the controller, and only two other control signals that we are interested in, mux-0 and mux-1, are shown.
The corresponding controller for Figure 4 (a) is shown in Table 4 (a). Because primary output mux-0 is distinct for state SO, mux-0 will depend on the late-arriving input c, namely (v0 -v l > 0 ) , as shown in (EQ 6). Besides, a three-input multiplexer is placed before RO. We now introduce another register allocation configuration to avoid these potential problems.
{muc-0)
(muc-1) = psl In contrast to Figure 4(a) , there is only a two-input multiplexer in front of RO. Table 4 (b) describes the resulting controller. The primary output signals are all unifiable in this case. The logic functions of mux-0 and mux-1 are shown in (EQ 7). Because the primary outputs are not dependent on the late-arriving controller input c, and only a two-input multiplexer is placed in front of RO, Figure 4 (b) will have a shorter critical path than Figure 4(a) .
(muz-0) = psl, (muz-1) = psl
In this subsection, we learn that register allocation for unifiable controller outputs is helpful to find a controller-datapath implementation with minimum controller delay on the critical path.
Binding Non-Unifiable Signals
From the discussion above, we know that unifiable outputs are generally helpful to minimize the controller's delay on the critical path. Because of inherent limitations, sometimes we can not improve the binding configuration and produce a controller with all unifiable outputs. However, heuristically we would like to generate as many unifiable outputs as possible to potentially reduce the system cycle time.
In addition to finite state machine structure, multiplexer assignment is another factor that would affect the number of unifiable outputs. We would also like to reduce the number of multiplexer inputs on the critical path to potentially improve the system performance. However, we will keep the total number of multiplexer inputs as small as possible to minimize the interconnection cost. Experiments show that these heuristics can improve the system performance when some controller outputs are not unifiable.
Algorithms
The goal is to generate a controller with unifiable outputs to eliminate its dependencies on late-arriving controller inputs during the allocation process. Given a scheduled control data flow graph (SCDFG), we can apply our Minimum-Controller-Delay (MCD) allocation approach to existing allocation methods. In this paper, we choose a base algorithm (BASE), as a comparison basis for MCD. BASE uses the greedy left-edge and conditional resource-sharing algorithms for register allocation. A weighted module allocation graph will be built by the preference from register allocation and conditional resource sharing. Then maximum-weight clique partitioning can solve the module allocation. Commutativity has been used in interconnection binding for the point-to-point model.
In comparison with BASE, our MCD algorithm maintains the resulting controller structure during the Table 4 : (a) FSM-2, the FSM derived from the left-edge register allocation algorithm with conditional resource sharing. The mux-0 signal will depend on the late-arriving input c, i.e., (v0 -v l > 0) (b) FSM-2-MCD, the FSM derived from the minimum-controller-delay register allocation. The mux-0 signal will be independent of the late-arriving input c. allocation process. For a SCDFG, the binding of operation nodes relevant to conditional branch directly influences the unifiability of the controller structure. For the conditional branch nodes in the given SCDFG during register allocation, we choose a register-sharing binding for unifiable controller outputs when we proceeds left-edge algorithms after lifetime analysis. Similarly, we assign higher edge weights in the weighted module allocation graph if a module binding generates a unifiable controller.
The unifiability of a controller output can be verified by performing XOR operation on its care output values with respect to branch states. That is, we need only look at the 1's and 0's of a controller output values at conditional nodes. If the XOR result is 0, then the output values must be either all 1's or all 0's in addition to don't-care values. In this case, the output is unifiable and the binding will lead to a minimum dependency structure. On the other hand, if the XOR result is 1, the binding is not desirable.
When some controller outputs can not be unified due to the inherent structure, we will try to reduce the number of multiplexer inputs on the critical path and choose proper multiplexer assignment to increase the number of unifiable outputs to potentially reduce the system cycle time. When the allocation is done, minimum-dependency-driven don't-care assignment and encoding [8] can be used to eliminate the undesired dependencies.
Experimental Results
We have tried our allocation algorithms on several benchmarks, including those from Kim [lo], Maha [ll] and Sehwa [12] . The schedules for these control data flow graphs are similar to those in [13] . To do the experiments, we assume that the conditional node in the fork branch and the following operation after the conditional branch are to be scheduled at the same cycle. Each conditional node contains an operation and will generate a control signal as an input to the controller. For simplicity, we assume that the conditional node operation is a comparison operator and randomly generate the arrival times of the controller inputs to reflect the fact that these signals arrive late. Select signals for multiplexers and load signals for registers are generated as the controller outputs.
The datapath part can be generated by PDL++ [14] . The controller part is in KISS format and generated after the allocation process. These two parts are integrated by SIS [9] . The circuits are optimized by ESPRESSO [15] and delay-driven multilevel logic scripts in SIS. We use mcnc.genlib library and delay-driven options for technology mapping. The cycle time for the whole system is measured using the library model after technology mapping.
We summarize our MCD algorithm results in Table 5 , where RT state means register-transfer state, and BASE denotes the comparison base algorithm as explained in Section 4. In the experiments, two adders and two subtracters are used for all cases. Greedy conditional resource sharing by BASE results in nonunifiable controller outputs in all three cases, which makes the critical path in the whole system longer. On the other hand, MCD is able to produce a controller structure with unifiable outputs and eliminate the controller output's undesireds dependencies on the late-arriving inputs.
In Table 5 , the cycle time comparison treats the result from BASE as a unit delay and shows its corresponding MCD cycle time. On average the system cycle time improvement is 31% ((BASE-MCD)/BASE*lOO%). Similarly, we normalize the BASE area from three benchmarks as one and show the MCD area accordingly. The area improvement is 24% on average. We believe it contributes to unifiable output's minimum dependency structure and the simplification of its logic function. 
Conclusions
Most allocation approaches minimize the number of resources. Greedy conditional resource sharing methods often result in a controller with longer delay interacting with the datapath part of the system. We propose an allocation method to reduce the system delay through controller and datapath by several heuristics, including unifiable controller outputs, minimizing multiplexer inputs on the critical paths, and proper multiplexer assignment. This method is able to build a controller structure with minimum dependency on the late-arriving inputs during the allocation process. The system performance for the whole datapath and controller configuration is hence improved.
