Abstract-Nowadays, variable delay arithmetic units have been used for implementing a datapath of a target system in pursuit of performance improvement. However, adoption of variable delay arithmetic units requires modification of a typical synchronous control unit design methodology. A telescopic arithmetic unit based methodology is one of representative methodologies to design synchronous control units for variable delay datapaths. In this paper, we propose two optimization methods for it. Proposed optimization techniques will be analyzed in order to show their performance improvement effects explicitly.
I. INTRODUCTION
Although synchronous system designers still assume that all the component arithmetic units of a datapath operate with their own fixed delays, variable delay arithmetic units, in short VDAUs, have started to be implemented and used in pursuit of performance improvement qf target systems[l, 2, 31. However, adoption of VDAUs requires modification of a typical synchronous control unit design methodology.
[I, 21 proposed how to synthesize VDAUs, which were called telescopic arithmetic units, in short TAUS, automatically, and how to modify an original finite state machine, in short FSM, into a new FSM which can control a variable delay datapath with TAUS respectively. Although they achieved noticeable and pioneering results, their work is not optimized enough in following two aspects. The first, concurrent execution among VDAUs are not supported enough and unnecessary synchronizations are required. The second, selection of proper clock cycles is restricted. In [3] , a synchronous independent controller is built for each operation, and all the controllers are integrated into a global control unit. Although the method can guarantee that resulting control units are able to preserve original concurrency among operations, resulting control units may suffer from a rapid area increase with the increase of the number of operations in system specifications.
In this paper, we propose two optimization methods in order to ameliorate above two problems of the previous approach[l, 21. The core features of our proposed new optimization methods are "to support concurrent executions among VDAUs com- pletely" and "to reduce idle time of VDAUs by lowering the lower bound for selection range of clock cycles".
PRELIMINARIES

A. Variable delay arithmetic units
A telescopic arithmetic unit[l, 21, in short TAU, consists of the following two parts, an arithmetic unit and a completion signal generator as shown in Fig. 1 . The arithmetic unit part of a TAU is exactly the same as general synchronous arithmetic units. The completion signal generator, which is a distinctive part of VDAUs, generates a completion signal when it decides that computation for input operands is over. For convenience of an explanation, we define two variables LD(Long Delay) and SD(Short Delay). Actually, LD corresponds to the worst case delay of the arithmetic unit. Note that the real computation time varies according to input operands although general synchronous arithmetic units are assumed to have the worst and fixed computation time for easy design. Therefore, we can divide whole input operands into two groups; the first group is the set of input operands requiring computation time not larger than SD, and the second group is the set of remaining input operands not belonging to the first group. Intuitively speaking, a completion signal generator is the set of input operands belonging to the first group. Therefore, it produces ' 1 ' for input operands which can be computed within SD, and thus we For the other case where the TAU requires SD, a state transition, Si -+ Si+l, is generated. In Si + Si, register enable signals are not generated because allocated TAUs do not finish their corresponding TAU operations. In order to make a decision between Si -+ Si+l and Si -+ Si, a completion signal 'C' from the TAU is considered in the input set of an FSM. We call the finally modified FSM a TAUBM FSM. 'In I1 I. the completion signal '1 ' is genented for input operands which can not be processed within SD. However. in this paper, we take an opposite way for convenience.
unnecessary synchronizations with TAUs for other arithmetic units. If the ratio of input operands requiring SD for a TAU, 'P', is big, the synchronization may not be critical. However, otherwise, performance degradation due to the unnecessary synchronization is not negligible. You should note that 'P' is not the variable that system designers can control. That is, the value of 'P' depends on the set of input operands entirely. Therefore, in the optimization method I, we remove all the unnecessary synchronizations and reduce bad effects of low 'P' by supporting concurrent executions among arithmetic units completely. The following is a simple algorithm to derive a new FSM through the optimization method I. 
RecursiveGenerateAllChildStates(Cstates(i)) ; }
The function RecursiveGenerateAllChildStates(&) in above algorithm generates all successor states which can be reachable from the state S, by activating all operations whose input operands and allocated arithmetic unit is available. Therefore, FSMs derived in Algorithm 1 generate all the reachable states according to the delays of TAUS. For better understanding of the concept of the optimization method I, compare two FSMs in Fig. 2(b) and Fig. 3 . They are FSMs, which are derived by applying TAUBM and the optimization method I respectively, for a DFG in Fig. 2(a) . Here, we assume that a TAU styled multiplier is allocated to operations 2 and 3. In TAUBM, since the 2nd stage of a TAU styled multiplier is spent selectively according to input operands, we should not allocate some operations to the same time interval the 2nd stage of the TAU is allocated to. Therefore, operation 1, which is a successor of operation 0, should be delayed until operation 3 is over, although it can be started once operation 0 is over as shown in Fig. 2(a) . However, the optimization method I enables the corresponding FSM to activate operations as soon as possible because it explores all the reachable states according to the delays of TAUs. For example, in Fig. 3 , operation 1 is always activated in the 2nd time step and operations 0, 1 and 3 can be performed in 2 clock cycles irrespective of the delay of TAU operation 3 in both state transition paths SO -+ Sz 4 S3 and SO S1 + S3, while they spend 2 or 3 clock cycles in TAUBM. Operations 2 , 4 and 5 are also same.
For performance analysis, we define the execution latency (1-P)
-1OF3OFlRE3REI (1)
5 CIOF2OF4RE2REI (P)
-1OF5RE5
11)
I C I O F~O F I R E~ (1-P)
8 -1OF2OF5REZRE5 (1) Fig. 3 . A new FSM obtained by the optimization method I for Fig. 2(a) (1 -Pk(')). Here, CC, k(i) and N represent a new clock cycle selected according to the value of SD, the number of TAUs in the i-th time step, and the number of time steps. Note that LTTAU varies according to the value of 'P'. For example, in the case of Fig. 2(b) , the execution latency is (6-2P)CC. In the worst case, 'P' is 0, LTTAU is 6.CC, but in the best case, 'P' In order to check the effects of the optimization method I in aspects of performance and area, we derived FSMs for several DFG benchmarks through TAUBM and the optimization method I respectively, and Although the selection of a small sized clock cycke presents larger flexibility for scheduling and corresponding controller generation[6, 71, a small sized new clock cycle, which is selected according to the value of SD of a TAU, does not always guarantee good performance in TAUBM because wrong selection of S D may make operations spend multiple time steps counterbalancing performance benefits of TAUs. Therefore, although the value of SD can be selected freely and the corresponding TAU can be synthesized according to the selected S D automatically[ 11, the selection of S D should be performed very carefully.
When we select the value of SD, we should consider following two things; the first is that SD should be larger than the fixed delay of other arithmetic units excepts TAUS. Otherwise, arithmetic units whose fixed delays are larger than S D cannot finish their operations within SD and 2"d stages of some TAUS will be always required. The second is that 2.SD is larger than LD because TAUs spend 2 time steps at most. From above two facts, the following selection range of SD can be de-
Here, Maz(FDo, F D 1 , ..) represents maximum fixed delay among other arithmetic units except TAUs. Note here that it is not good to select S D similar to LD since (2. SD)-LD means the idle time of the corresponding TAU. Therefore, it is better to select S D near to LD/2 in order to minimize the idle time. However, Max(FD0, FD1, ..) is a direct hurdle to the selection of SD near to LD/2. In order to remove the hurdle , we can consider following two approaches. The first one is to adopt the new arithmetic units with smaller fixed delays. The second one is to replace the current non TAU styled arithmetic units with corresponding TAUs. The first approach is actually trivial. Therefore, in this section, we consider the second approach and we call it the optimization method II.
We would like to explain how to apply the optimization method I1 through a simple example. For a DFG in Fig. 2(a) , we assume that a TAU styled multiplier 'M', whose short delay, SD(M) and long delay LD(M) Fig. 4(a) . However, assume that we replace the current fixed delay adder with a TAU styled adder 'A', whose SD(A) and LD(A) are lOns and 1511s respectively. Then, things are changed; addition operations also spend 1 or 2 clock cycles selectively according to input operands instead of spending 2 time steps as shown in CASE 3 of Fig. 4(a) . As a consequence, selection range of SD for a TAU styled multiplier changes into Max(LD(R/I)/2, SD(A)) 5 SD(M) 5 LD(h1) from hIax(LD(bI)/2, FD(A)) 5 SD(M) 5 LD(bI), and thus additional minimization of idle time becomes possible with the adoption of a TAU styled adder. Fig. 4 may be Ions, and thus SD ratio 'P' is changed. In our experiment, we assume that 'P' is reduced to '0.9 . P', '0.7. P' and '0.5 . P'. Under the assumption, Table I1 shows that selection of new SD under adoption of additional kinds of TAUs can lead to the performance improvement.
v. CONCLUSIONS AND FUTURE WORK
In this paper, we propose two optimization methods for 
