A new approach to high level synthesis, which simultaneously addresses testability and resource utilization, is presented. We explore the relationship between hardware sharing, loops in the synthesized data-path, and partial scan overhead. Since loops make a circuit hard-to-test, a comprehensive analysis of the sources of loops in the data path, created during high level synthesis, is provided.
Introduction
The traditional goal of high level synthesis tasks, like scheduling, allocation and assignment, has been the optimization of performance underresource constraints or its dual. More recently, the list of goals was expanded to include fault tolerance and testability, the latter being the main topic of this paper.
Automatic test pattern generation (ATPG) of sequential circuits has been recognized as a difficult problem [l] . While full-scan design solves the testability problem, it can be very costly. Recently, partial scan design has gained wide acceptance, since only a subset of flip-flops (FFs) needs to be scanned to make the circuit testable.
The dependencies of the FFs of a sequential circuit is captured by an S-graph, where each node corresponds to a FF. There is a directed edge from node U to node TJ if there is a combinational path from FF U to FF U in the sequential circuit. It has been observed by Cheng and Agrawal [2] that sequential test generation complexity grows exponentially with the length of cycles in the S-graph, and linearly with the longestpath (sequentialdepth) in the S-graph. These [2, 3, 4] to select scan FFs such that all loops in the S-graph, except self-loops, are broken and the sequential depth is minimal.
Recently, high level synthesis techniques have been used to generate easily testable data paths [5, 6, 7, 8, 9] . Chen and Saab [8] used a high-level testability analysis program to identify testable structures and synthesize them to improve testability. Lee et. al. [6, 71 developed a method to minimize the formation of loops in the data path by partial scan and proper register assignment.
It has been widely recognized that the implementation area in hardware shared architectures i s most often dominated by interconobservations me used by several gate-level pastkt1 scan approaches nectrequirements [lo] . In order to efficiently address interconnection cost, we use the registerfile model, where all registers are clustered in a number of registers Eles. While each execution unit can send data to any register file in the general case, each register file is connected to one input of a single functional unit. A number of fully operational high level synthesis systems, such as Cathedral-II and Hyper, are based on the register file model [ 11, 121 .
Since we target computation-intensive application domains, the controller states usually need only a few FFs. In our design for testability framework, we assume all the control signals to the data path to be made fully controllable by scanning the FFs of the controller.
In this paper, we concentrate on generating easily testable data paths from high-level specifications, achieving high resource utilization, and satisfying given performance constraints. We attempt to make the data path easily testable by ensuring that the synthesized d a h path has no loops, except self-loops. In the rest of the paper, we use loops to refer to loops of length greater than one.
Hardware sharing is a widely used methodology to achieve high resource utilization, but it may adversely affect the testability of a circuit by introducing new loops in the data path. However, when hardware sharing is exploited properly in conjunction with the partial scan methodology, improvements in testability can be achieved despite possible introduction of loops. The scan registers can be shared amongst several variables of the CDFG, to break not only the loops in the CDFG, but also the loops introduced in the d a h path by hardware sharing. This paper exploits hardware sharing to minimize the number of scan registers needed to synthesize a minimal-loop data path. Experimental results show that our technique can synthesize very easily testable designs, with low hardware overhead, without compromising the performance of the designs. The partial scan cost incurred by our technique is significantly less than a gate-level partial scan approach.
Motivation
Consider the Control Data Flow Graph (CDFG) of the 4th order W cascade filter shown in Figure l(a) . Assume that each operation in the CDFG takes one control cycle. The critical path is 6 control cycles long. Figure l(a) shows one possible feasible schedule and assignment, satisfying a performance constraint of 6 control cycles, and using minimal number of execution units. For instance, the operation +2 is scheduled in control cycle 2, and assigned to be executed in adder Al, shown in Figure l Table 2 by the row IIR.16 Orig.
The testability of the data path can be improved using partial scan techniques at the gate-level [2, 3, 41 , to break all the loops of the circuit. Breaking all the loops of the S-graph in Figure l(c) needs scanning at least 3 registers, namely LA1, LA2, and LM1, which translates to 3n FFs, where n is the wordsize. However, the associated area and performance overheads due to the large number of scan FFs can be prohibitive.
The example in Figure 1 illustrates that hard-to-test data paths can be generated if testability of the data path is not considered during high-level synthesis. Instead ofpostponing the taskof making the design testable to the gate-level, it is possible to incorporate testability as one of the design goals, besides performance and resource utiliza- Note that this solution is also optimal with respect of the number of execution units used. The corresponding data path and the S-graph, shown in Figures 2(b) and 2(c) respectively, are significantly more amenable to sequential testability. Notice that the S-graph still has loops; however, scanning register RA2 will break all the loops. The resultant data path, with register RA2 scanned, has no loops, and is very easy-to-test. A test efficiency of 100% could be achieved on the resultant data path, as evidenced by the row IIR.16 SFT in Table 2 .
Source of Loops in the Data Path
We present a comprehensive analysis of the source of loops in the data path, and its corresponding S-graph. (3) Sequential False Loops: A sequential loop in the data path is termed false when the loop cannot be sensitized under normal operating conditions. A false loop is a special case of a false path. Figure 3 illustrates the formation of a sequential false loop in the data path. Figure 3 (a) shows segments of two paths in a CDFG, where operations +I and +3 are scheduledin control step 1, and $2 and +4 are scheduled in control step 2. If operations +I and $2 are assigned adders A1 and A2, respectively, no assignment loop is formed. Similarly, assigning operations +3 and +4 to adders A2 and A1 respectively, ensures maximum resource utilization, while avoiding the formation of any assignment loop. However, the resulting data path in Figure 3 (b) reveals the formation of a sequential loop shown in bold. To sensitize the loop, the required control signals to the multiplexors M1 and M2, cl and c2, should be {cl = 1, c2 = 0 ) (or, {cl = 0, c2 = I}) in any two consecutive control steps. However, this necessitates execution of operations +4 followed by +2 (or $2 followed by +4), which is clearly not possible. Consequently, the sequential loop can never be sensitized under normal operating conditions, and is a false loop. 
PI w
Stok previously considered the formation of false loops through resource (module) sharing [13] . His treatment of false loops was limited to combinational loops, generated in the data path when data-chaining is used, that is two or more data-dependent operations h are scheduled in the same control step. However, even when datachaining is not used, hardware sharing can lead to sequenfial false loops in the data path, as illustrated by Figure 3 . Since we assume that the control signals to the data path are fully controllable, the false loops act as real loops during test generation, and contribute to the complexity of sequential ATPG in the same way as other bops. Figure 4 shows the register files of an adder Al, used in the data path of the elliptical wave filter shownin [ 141. Column variables and Lifetime show the left operands of each operation that was assigned to Al, and the lifetimes of the variables. Each variable has to be assigned to some register of the left register-Ele of A l , shown in column Register Assignmenf. Note that multiple registers are required due to conflicts in the lifetimes of the variables. For instance, the variables b, c, e and i are all alive in the 7th control step and receive data from AI; hence they need to be assigned to four different selfloop registers. The left register file has 5 registers: {LI . . . LS}, and the right register file has 4 registers: {Rl . . . R4). The inputs of the registers are shown. For instance, register L1 has a single input coming from module A l , while register R1 has three inputs, from AI, A2 andM1.
Since there are 5 self-loop registers in the left register file of Al, and 3 self-loop registers in the right register file, a clique involving the 8 self-loop registers is formed in the corresponding S-graph.
As each register in a clique is completely connected with all the other registers of the clique, breaking all the loops of a clique of size k requires scanning k -1 registers. This means that formation of cliques not only makes test pattem generation veIy hard, it makes a partial scan solution very expensive.
Algorithms for Efficient Use of Partial Scan and Hardware Resources
A complex set of goals is imposed during scheduling and assignment which simultaneously addresses both hardware resource utilization and testability issues while satisfying the throughput constraints. In order to design a hardware competitive solution, it is mandatory to consider all three components of implementation cost: execution units, registers, and in particular, interconnects. Testability imposes requirements regarding all four types of loops in the data path graph, and, to a lesser extent, sequential depth. Our approach to the allocation, scheduling and assignment problems has three phases. We start with an initial allocation of execution units, targeting exclusively resource utilization. For allocation, we currently use Hyper [12]. In the second phase, all CDFG loops are broken by assigning a subset of variables (scan variables) to scan registers. Each operation which consumes at least one scan variable is assigned to an execution unit (module), and the scan variable is assigned to the associated register Ele. In the third phase, we simultaneously schedule and assign each operation of the CDFG using global resource utilization and testability measures.
Breaking CDFG loops with Minimal Number of Scan Registers
In this section, we discuss the problem of breaking the CDFG loops using a minimal number of scan registers. A similar problem has been earlier addressed in the case of S-graphs of gate-level circuits [2,3], where the minimum number of scan registers required to break all cycles equals the minimum feedbackvertex set (MFVS) of the S-graph. Since the problem of Ending theMFVS is NP-complete, several heuristics were successfully used [2, 3] . When hardware sharing is not used, a solution to the MFVS problem can be directly applied to break the CDFG loops. However, when hardware sharing is used, the variables selected to break the loops (scan variables) can share the scan registers. Consequently, the MFVS of a CDFG may not be a good solution.
The new dimension added by hardware sharing to the problem of breaking loops is illustrated by the CDFG of the ER Elter shown in Figure 1 . A possible solution to the MFVS problem is the edges:
{ (+, , DI), (+5,03)}. However, since the variables D1 and D3 are simultaneously alive in the first control step, they cannot be shared, thus requiring 2 scan registers to break the CDFG loops. On the other hand, if we select as scan variables ( +Z , +I ) and (+6, +5). all CDFG loops are broken. The scan variables can be now stored in the same scan register, since their lifetimes do not overlap. Lee, Jha and Wolf [7] proposed an approach to cut CDFG loops using a subset of boundary variables (variables which correspond to the edges with delays). Since all boundary variables are alive simultaneously, each selected variable has to be assigned to a separate scan register. To maximize likelihood of reuse of the scan registers, they select boundary variables with short lifetimes. Later, during register assignment, they share scan registers among variables to minimize the formation of assignment loops in the data path. While [7] introduces the important idea of sharing scan registers, the technique does not exploit hardware sharing while selecting scan variables to break CDFG loops. All boundary variables are simultaneously alive (in the first control step), and therefore cannot share scan registers. In contrast, considering all variables in CDFG loops as possible candidates for scan variables greatly improves chances of efficient sharing of scan registers. Consider the W filter shown in Figure 1 . Limiting the choice of scan variables to boundary variables results in the use of 2 scan registers. However, considering all variables in the CDFG loops results in a solution which uses only 1 scan register, as described earlier.
Also, the length of lifetimes of variables has only indirect and second order effect on hardware sharing. The necessary and sufficient conditions that two variables can share the same register are that they are not simultaneously alive and that proper interconnect for transferring the two variables is allocated.
The goal of our approach to break CDFG loops is to select scan variables such that the following criteria are simultaneously satisfied:
HSCl All CDFG loops, except self-loops, are broken, HSC2 The selected scan variables can be assigned to a minimum number of scan registers, and,
HSC3
Reusability of the scan registers, to break the other loops formed during the subsequent scheduling and assignment phase, is maximized; We refer to the above problem as the minimum hurdwure-shared cut (HSC) problem. We address the minimum HSC problem by using a random wak-based approach which combines probabilistic and heuristic techniques. We use two measures, the loop cutting effectiveness measure, and the hardware sharing effectiveness measure, to capture the effectiveness of a variable in satisfying the three criteria of the minimum HSC problem. Details of the measures, techniques to compute them, and the algorithm which uses the measures to select the scan variables, can be found in 1141.
After the scan variables have been selected to cut the CDFG loops, at first, a minimum set of scan registers is identified to which all the scan variables can be assigned. As a second step, the scan registers are selected from as many register Eles of as many execution units as possible. The second step increases the chances of reusing the scan registers to assign variables to avoid the formation of loops during scheduling and assignment.
Scheduling, Assignment and Allocation for Testability and Resource Utilization
After the CDFG loops have been broken using a minimal set of scan registers, in the third and final phase, we simultaneously schedule and assign each operation of the CDFG using global testability and resource utilization measures. The a i m is to produce a testable data path, avoiding the formation of the three types of loops mentioned before. However, priority is also given to use a schedule and assignment which satisEes the constraint on control steps and which maximizes resource utilization, so that the Enal design is not only testable, but also competitive in terms of hardware cost.
At each iteration of the algorithm, from the operations that have not yet been scheduled and assigned, an operation op, with least slack (ALAP -ASAP) is selected. The set of (module, control step) pairs, { ( M , , C,)}, to which/ in which the operation can be assigned/' scheduled, are identified. For each pair, the cost in terms of testability, resource utilization and flexibility for scheduling and assignment of subsequent operations, is computed. Subsequently, a pair with the smallest cost is selected. The cost measures will be described in the subsequent sections. We outline the algorithm for scheduling and assignment of operations. scheduleandassign() 1 . while there exists a node which is not scheduled and assigned { 2 .
op, = selectnode(); 3 . 
.

6.
7.
8.
assign all variables in register files
Testability Cost
A measure of the testability cost is given by equation 1.
The 6rst component of the cost measure, (~izeassign~oop + is the cost due to formation of assignment loops, where szzeasslgnioop is the length of loops formed, and c~~t~~~~~~-~~~~ is the cost of using some existing or new scan registers to break the loops. The second and third components deal with the other two types of loops, false loop and register file clique, while costseqdeptj, measures the increase in sequential depth due to the assignment. + ~~~t s e q~e p t h
Cost Due to Formation of Assignment Loop
The cost due to formation of assignment loops is computed as follows. It is first checked whether assigning operation opi to module M3 creates an assignment loop, by traversing the paths in the mansitive fanin of operation opt in the CDFG. For each loop created, attempt is first made to break the loop using any available scan regfster. If successful, then C O S t a~a i l a b l e~c a n is added to the cost of scan registers, costasstgn-scanr depending upon whether the used scan register could have been used by some other operation in the same control steps. If the loop cannot be broken by any available scanregister, and adding a new scan register is allowable by the user-specified limit of Maxscanregsallowed, a new scan register is used to break the loop. The cost of a new scan register, costnew-scan, is added to costass,gn-scan. Note that if a loop is formed by assignment but is broken by using a scan register, then sizeassign~oop = 0.
In case neither an available scan register can break the loop, nor a new scan register can be used, the assignment loop will be formed, and left unbroken, in the data path. To discourage assigning the operation opt to module M3 which leaves a loop in the data path, the size of the loop formed is added to the cost function. In the later case, no scan register is used, and costascign-scan = 0.
The computation of the cost associated with the assignment of op, to module M3 is given below.
costassignmentloop(opi, M 3 )
2. if(assignmentloopintroduced( op,, M3 )) 3. for each loop { 4 .
6.
1. Sizeassignloop = Costassign-scan = 0; i f loop can be broken by available scan register else if (#availablescan~egs < Maxscanregsallowed) {
5.
COStassign-scan+ = COStavailable>can; 7.
9.
} /*end if*/ 10. else 11. 12.
Pend for *I add new scan register to available scan registers; 8. COStaaaign-scan+ = Costnew-scan; I* cannot use any new scan registers; allow loop to form*/ sizeassign-loop+ = size of loop introduced;
The cost due to the formation of false loops and cliques is computed in a way similar to assignment loops. The increase in sequential depth due to an assignment can be computed by traversing the transitive fanins of the operation being assigned.
Resource Utilization and Flexibility Cost
The area overhead for synthesizing a testable design should be minimal, so that the new approach has a significant advantage over gate level design for testability schemes [2,3,4]. We briefly outline the criteria used to achieve high resource utilization. 1. The most difficult operations for scheduling and assignment (the operations which are likely to require additional modules) are handled first, while the number of altematives is still high. 2. Introduction of interconnects which can not be easily reused later is avoided. Strong preference is given to local interconnect over global interconnect. 3. Registers are an important part of the implementation cost. Any new register should have high likelihood to be reused later.
We establish two criteria to predict whether an interconnect will be local or global after the physical synthesis of the design. An interconnect from an unit to itself will remain local after placement and routing. The second criteria is based on the observation that the greatest difficulty in routing often arises due to high congestion in some areas of the chip. To avoid congestion, during the interconnect assignment and allocation phase, we t r y to limit the number o f interconnects which originate from or go to a particular register file.
For a particular assignment and schedule choice for an operation, we assign cost, in an increasing order, to the following resources used: (1) new register, (2) new local interconnect and (3) new global interconnect. When multiple scheduling and assignment choices have the same hardware cost, we prefer a choice which introduces a new resource with higher likelihood for later reuse. Reusability of a new resource is calculated by counting how many unscheduled and nonassigned CDFG nodes can use the resource.
The flexibility cost measures to what extent aparticular schedule and assignment of an operation opi adversely affects the flexibility for scheduling and assignment of subsequent operations in the transitive fanout of opi. It is calculated by summing up the reduction in slacks of the operations in the transitive fanout o f opi.
Experimental Results
To evaluate the effectiveness of the new synthesis for testability technique, we synthesized the following datapath-intensive benchmarks: all-zero FIR wave digital filter (WDF), 4th order cascade W filter, and 5th order elliptical wave digital filter (EWF). In the sequel, Orig refers to the original implementation using conventional high level synthesis techniques to ensure maximal throughput and minimal number of execution units. SFT refers to the implementation using the new synthesis for testability approach. Table 1 shows various parameters of the original and SFT implementations after high level synthesis. The number of execution units (EXU) and control steps (CS) needed are same for both versions of Table 1 : Performance and Hardware Costs of the Designs the designs. In some cases, for example EWF, the SFT implementation needed a few more registers (Reg), multiplexers (Mux), and interconnects (Inter), indicating that testability improvement may result in a small increase in resource requirements. However, on the average, the area overhead is marginal.
The testability of the synthesizeddesigns was evaluated using the gate-level sequential ATPG tool, HITEC [l], shown in Table 2 . The numerical suffix after the name in the Erst column corresponds to the wordsize of the implementation. For each design, besides the rows corresponding to the original and SFT implementations, the other two rows correspond to circuits obtained from the original design Table 2 . The total number of faults, the number of faults aborted by HITEC, the fault coverage and test efficiency achieved, and the ATPG time taken on a SUN Sparcstation 2 are reported. Table 2 demonstrates that the SFT designs obtained by our technique were consistently more testable than the original designs obtained by traditional high level synthesis techniques. To achieve the same level of testability, the gak-level tool OPUS needed to scan a significantly larger number of FFs (GPSx) than required by the SFT designs. For instance, in the case of EWF.20, OPUS needed to scan 300 FFs to break all loops (except self-loops) and achieve the same level of testability as the SFT design, which had only 60 scan FFs.
Moreover, when OPUS was restricted to scan the same number of FFs (GPSn) as in the SFT design, the testability achieved was significantly lower than that of the SFT design. In the case of EWF.20, while the SFT design achieved 100% test efficiency in only 233 CPU seconds, GPSn could only achieve 40% in 14895 CPU seconds.
Conclusions
This paper presented a new technique to exploit hardware sharing to minimize the number of scan registers needed to synthesize a testable data path. Novel algorithms have been proposed to select a minimal number of scan registers to cut CDFG loops, and reuse the scan registers during scheduling and assignment to We thank Professor Jan& Patel and Dr. Vivek Chickermane for providing us with HITEC and OPUS tools.
