Abstract: Linear computations are most widely used type of ASIC computations. Due to their exceptional theoretical tractability and piractical importance, numerous schemes for their implementation have been proposed. The canonical schemes have been widely studied and evaluated according to a number of important criteria, including the number of operations, number of bits, area, throughput and latency, and power metrics. However, until now the testability of linear systems was not studied. After we show that all the most widely used implementation structures require a significant testing related cost, we derive a new structure which is amenable for at-speed testing with no additional hardware overhead. Furthermore, the new structure provides a high throughput, low cost, and low power implementation for an arbitrary linear computation.The key technical novelties of the paper is a novel :approach for use of transformations and coordinate use of transformations and sclheduling for producing highly testable implementations.
INTRODUCTION Motivation
ASIC designs, and in particular DSP ASIC components, form one of the fastest growing segments of the semiconductor market. For example, while in recent years the compound growth of overall semiconductor market was about 20%, the compound annual growth of the DSP ASIC market was almost 40%. At the same time, it has been realized that cost of testing is an increasingly important part of DSP ASIC chip costs' [19] , sometimes as high as 40% of the overall cost. As behavioral level synthesis tools mature, a large percentage of DSP ASIC designs is done using CAD environments. However, until recently very few high level synthesis tools addressed testability. Our goal in this paper is to develope a method for VLSI ASIC realization of linear computation so that final implementations have not only high throughput, low area, and low power, but also are highly testable. By considering simultaneously effects of several transformations and scheduling and assignment tasks, we have developed an approach which enables generation of high performance, low area, and low cost circuits which are highly testable with no test hardware overhead.
0-7803-2612-1195 $4.00 0 1995 IEEE
Linear systems are defined as systems which satisfy two axiomatic properties: homogeneity and additivity [17] . The first property states that if the response of the system to a signal x is a signal y, the response of the system to signal ax is signal ay. The additivity property states that if the response of the system to a signal x is a signal v, and to a signal y is a signal w, the response of the system to signal x + y is v + w. Linear systems are the most widely studied type of systems due to their intrinsic conceptual simplicity. More importantly, linear system form by far the most dominant component of today VLSI ASIC and application specific programmable market 1161. For example, portable phone DSP functions are mainly linear computations [16] . Linear systems are modeled and optimized using linear computations. Linear computations can be characterized as computations which use only additions/subtractions and multiplications with constants. Note that this definition is not restricted to traditional arithmetic algebraic structures.
What is New?
The first procedure for synthesis of at-speed testable linear systems is presented. By considering simultaneously effects of several transformations and scheduling and assignment tasks, we have developed an approach which enables generation of high performance, low area, and low cost circuits which are highly testable with no test hardware overhead.
PRELIMINARIES
targeted computational and hardware models, and am assumed testing approach.
Computational and Hardware Model
We assume a synchronous data flow computation model [12], [14] , which is often used in high level synthesis for DSP applications. This model assumes a periodic computation done on a seminfinite stream of data along the time loop. A majority of DSP, video, control, and graphics applications follow the selected computational model. Synchronous data flow computation model is also well suited for the application and CAD treatment of transformations. This is mainly because the relatively strict timing discipline imposed by the model provides a convenient basis for the evaluation of changes in the structure of the computations, and enables accurate and rapid predictions of the properties of the final implementation.
Besides the assumed computation model, the selection of targeted hardware model also has numerous consequences. We adopt the dedicated register-file model, in which all registers are grouped in a number of register files. Each register file is connected to only one input of an execution unit, while each execution unit can
In this section we outline all relevant assumptions information about the send data to an arbitrary number of registers files. There are several reasons behind the decision to select the register file model of architecture. First, the model dictates grouping of single read, single write register files whiich enables area-efficient layout. This is the main reason that the register file model is widely used in general purpose architectures [19] as well as custom ASIC designs [23] . Next, it has been demonstrated that the dedicated register file model relduces the number of interconnect at the expense of a somewhat higher number of registers. As the technology scales, the reduction of interconnect is becoming more important, making the dedicated register file model even more attractive. Finally, availability of accurate and computationally inexpensive area prediction models: [3] is an added attraction to use the register file model.
Testability Methodology
It is also important to explicitly state assumptions about the targeted testability methodology. We target testability assuming a single stuck-at fault model, and gatelevel sequential ATPG. It is widely accepted that the elimination of all sequential loops, beside self-loops, is most often sufficient to achieve high testability. We target the elimination of a l l non-trivial sequential loops (only in the data-path, assuming full scan of the control logic. For majority of nume:rically-intensive designs, the data-path completely dominates the area and FF requirements of the design [23] , [ 3 ] . An effective DFT technique to make circuit testable is partial scan where a set of flip-flops (FFs) are scanned which breaks all non-trivial loops in the circuit [4] , [13] . we assume employment of partial scan DFT me:thodology, and consider the number of partial scan FFs needed to makecircuit highly testable as the measure of testability overhead.
In the rest of the paper, we will restrict our attention to the synchronous data flow model of computation and sequential gate-level ATPG of the datapath. It is important to emphasize that all the transformations presented in this paper can be directly explored in other popular computational and hardware models, as long as the elimination of non-trivial sequential loops is a dominant measure of testability.
Related work is traced along two Enes of research: high level synthesis techniques for testability and transformations. The mandatory tasks during high level synthesis include allocation, scheduling, and assignment [15] ,Wal91], all of which have been shown to have significant impact OR the testability of the synthesized designs [9] . Existing high level synthesis for testability techniques can be broadly classified according to the testing methodol-ogy targeted: BIST, gate-level sequential ATPG, or hierarchical test pattem generation. Four most notable efforts which target BIST have been reported from Case Western Research University [7] Several research groups have developed high level synthesis systems which target sequential ATPG testability. These systems synthesize data paths without loops, by using proper scheduling and assignment, and scan registers to break data-path loops [8] . A datapath synthesized from a behavioral specification may contain several types of loops, e.g.: CDFG, assignment, and sequential false loops [8]. A CDFG loop is formed in the datapath when there exists a cycle consisting of data-dependency edges in the CDFG. The other types of loops are introduced in the data path during behavioral synthesis, specifically due to hardware sharing. For instance, when operations along a CDFG path from operation U to operation v are assigned to n separate modules, with U and v assigned the same module, an assignment loop of length n is created in the datapath. A comprehensive analysis of the formation of loops, in circuits synthesized by high level synthesis techniques, is presented in [8].
Other works addressing testability during high level design is related to use of architectural information to guide test pattem generation [251,[5] , [61.
Transformations are alternations in the structure of a computation so that a particular objective is achieved, while the initially specified functional and timing depen- Recently, a new transformation technique was developed which increases the complexity of the behavioral description while reducing the structural complexity of the resulting datapath [9] . Application of the new transformation technique, hot-potato, to reduce the partial scan overhead for generating easily testable data paths was demonstrated [9] . Recently, Potkonjak et al. demonstrated a high effectiveness of a transformation-based approach for simultaneous optimization of testability and area under throughput constraints [221. Table 1 shows several design metrics of six different structures of the eighth order IIR Avenhaus bandpass filter [l] . The filter is widely used benchmark in DSP and high level synthesis literature. The table compares various parameters of the resulting implementations: number of registers, area, power, and word-length required for a given frequency response, for a sampling rate of 580ns. None of the different filter implementations were originally testable. Table 1 shows the number of scan flip-flops (FFs) required by BETS [81, a behavioral test synthesis system, to make the corresponding implementations 100% testable. BETS provides algorithms which support testability optimization during resource allocation, scheduling and assignment. BETS targets simultaneously both resource utilization (i.e. area for a given timing constraints) and testability. In all cases a significant test-hardware overhead (scan FFs) is induced. The analysis of the six computational structures indicates that this test hardware overhead is significant and unavoidable, due to the structure of control-data flow graph (CDFG) loops and directed paths in CDFG which have to result in assignment or false loops in the corresponding circuits [8]. The level of unfolding, k, is dictated mainly by the throughput requirements. k has to be high enough not just to satisfy throughput requirements but also that enable that to each adder can be assigned at least one addition which depends only on primary inputs. Note that as the level of unfolding increases the number of primary inputs increases, while the number of states remains unchanged. Therefore, by using unfolding, one can always create as many additions as required which depend only on the primary inputs. Probably the most important relevant observation is that all newly created primary inputs can be assigned to the same physical pins due to hardware sharing.
T R A N S F R O M I N G A N D S C H E D U L I N G LINEAR COMPUTATIONS FOR TESTABILITY
The initial and final CDFG (after the application of the procedure for making linear systems testable) of the second order single input single output linear system are shown in Figures 1 and 2 respectively. Figure 1 shows the second order IIR direct form filter. Figure 2 shows the same filter dter unfolding and the algebraic restructuring. Note that the variables denoted by R1 and R2 are assigned to fully controllable registers. Using hardware sharing those registers are used to store the state variables and therefore provide controllability to all registers in the design. We proved that after the sufficient level of unfolding all nodes in the hardware graph can be made fully controllable and observable by using hardware sharing and connections from primary inputs and to primary outputs. We have applied the new approach to the filter structures shown in Table 1 . In each case, the resultant CDFG had an implementation which is 100% testable without the use of scan (no test area overhead), and hence can be tested at-speed. For example, after the application of the new approach to the parallel 8th order Avenhaus filter, the resulting CDFG has an implementation with an area of 9.08 mm and power consumption of 8.24 nJ/sample. The design is 100% testable with no use of
In, In2
Figure 2: The second order recursive linear system form Figure 1 , after the application of synthesis procedure for high throughput, low cost, low power, and high testability implementation. In the previous section, we presented a method which provides 100% at-speed testable solution for an arbitrary linear computations with no hardware overhead. This solution can be even further improved when a synthesis target is a SISO LTI computation. This is so because the size of the final implementation of the design can be additionally reduced and therefore both the number of faults and test vectors reduced if the required bitwidth and the number of execution and memory units of the design are reduced. To reduce the bitwidth requirements we use a preprocessing step where an arbitrary SISO LTI system is first transformed to the corresponding parallel structure. Figure 4shows an example of a such structure, 8th order I R parallel filter.
SISO LINEAR COMPUTATIONS
Consequently the procedure described in the previous section is applied. The parallel structure has two advantageous properties. Firstly, it is numerically stable and therefore requires a short bithwidth. We Secondly, when unfolding is applied it results in significantly lower hardware requirements, due to the fact that there is no interaction between computations done in parallel branches of the structure, regardless of the used level of unfolding.
CONCLUSION
We introduced a procedure which transforms an arbitrary linear computation in a form which is highly testable. In some sense the procedure provides an ultimate solution for testing ASIC implementations of linear designs, since the resulting implementation can be tested at-speed with no test hardware overhead while satisfying an arbitrarily high throughput requirements. For an important special case of SISQ linear computation, additional optimization steps are providing an additional level for optimization while preserving all advantages of the generic synthesis procedure. Experimental results support the theoretical analysis and optimization algorithms.
