Abstract-In a fundamental paradigm shift in system design, entire systems are being built on a single chip, using multiple embedded cores. Though the newest system design methodology has several advantages in terms of time-to-market and system cost, testing such core-based systems is difficult, mainly due to the problem of justifying test sequences at the inputs of a core embedded deep in the circuit and propagating test responses from the core outputs. In this paper, we first present a design for testability technique for testing such core-based systems. In this scheme, untestable cores are first made testable using hierarchical testability analysis techniques. If necessary, additional testability hardware is added to the cores to make them transparent so that they can propagate test data without information loss. This testability and transparency technique is currently applicable to cores of the following types: application-specific integrated circuits, application-specific programmable processors, and application-specific instruction processors. Other core types can be made testable and transparent using traditional techniques. The testable and transparent cores can then be integrated together with some systemlevel testability hardware to ensure justification of precomputed test sequences of each core from system primary inputs to the core inputs and propagation of test responses from core outputs to system primary outputs. Justification and propagation of test sequences are done at the system level by extending and suitably modifying the symbolic hierarchical testability analysis method that has been successfully applied to register-transfer level circuits. Since the testability analysis method is symbolic, the system test generation method is independent of the bit-width of the cores. The system-level test set is obtained as a byproduct of the testability analysis and insertion method without further search. The test methodology was applied to six example systems. Besides the proposed test method, the two methods that are currently used in the industry were also evaluated: 1) FScan-BScan, where each core is full-scanned, and system test is performed using boundary scan and 2) FScan-TBus, where each core is full-scanned, and system test is performed using a test bus. The experiments show that the proposed scheme has significantly lower area overhead, delay overhead, and test application time compared to FScan-BScan and FScan-TBus, without any compromise in the system fault coverage.
I. INTRODUCTION
Spurred by technology leading to the availability of million gates per chip, and the aggressive need for shorter product cycles and reduced system costs, system-level integration is evolving as a new paradigm in system design, allowing an entire system to be built on a single chip. According to Dataquest, system-on-a-chip applicationspecific integrated circuits (ASIC's) will grow from a market size of $1.1 billion in 1996 (10% of the ASIC market) to $14 billion by the year 2000, representing 60% of the ASIC market. A major key to the success of the new system design methodology lies in the development and use of functional blocks called cores (or intellectual property). A wide range of cores, including processor, microcontroller, digital signal processor, interface, multimedia, and communications/networking cores, are available and being used today in system ASIC's [1] - [3] . Although standard cell libraries have been used in chip design for quite some time, a core is typically much larger and complex compared to standard cells.
Cores can be either soft, firm, or hard [4] . A soft core is a synthesizable high-level or behavioral description that lacks full implementation details. A firm core is also synthesizable, but is structurally and topologically optimized for performance and size through floorplanning (it does not include routing). A hard core is a fully implemented circuit complete with layout. A core-based ASIC is typically composed of a number of large core modules as well as some user-defined logic modules connected together by glue logic.
Typically, hard and firm cores come with precomputed test sets from the core provider. For soft cores, such test sets can be obtained by the user. However, once these cores are connected together in a system by the user, it becomes very difficult to justify these precomputed test sets to the inputs of the cores from system primary inputs (PI's) and propagate test responses to system primary outputs (PO's). A way of getting around this problem is to use extra design for testability (DFT) hardware to facilitate testing [5] .
One existing DFT method, referred to as FScan-BScan, utilizes a combination of full scan and boundary scan [6] . In this method, each core is made testable by full scan while system-level testability is obtained by separating each core using boundary scan, which gives direct controllability (observability) of the inputs (outputs) of the core. Even though this technique gives a very high fault coverage, it suffers from some serious drawbacks of large area and delay overheads, and enormous test application time.
Another existing DFT method, referred to as FScan-TBus, utilizes a combination of full scan and test bus. In this method, each core is made testable by using full scan. An added test bus runs from the PI's of the system to its PO's and uses a series of multiplexers to isolate each embedded core during testing to provide system-level testability. In this case as well, the area and delay overheads can potentially be quite large. In addition, the test bus architecture is unable to test the interconnect that exists between cores. Built in self-test (BIST) can also be used to test embedded cores. However, in addition to the attendant area and delay overheads, it is difficult to analyze the fault coverage of BIST, especially the pseudorandom kind, when the cores' logic models are not available for fault simulation.
In [7] , a DFT method is described to test macro blocks inside a circuit with a combination of different test techniques. It relies heavily on full/partial scan and boundary scan whose disadvantages have been stated earlier. Though functional information of modules is sometimes used to reduce test overhead by utilizing the concept of module transparency, the techniques for introducing transparency are ad hoc.
In this paper, we present a DFT and test generation method for testing core-based systems-on-a-chip. This method is based on the justification and propagation of precomputed test sets [8] and takes advantage of the hierarchical testability analysis (HTA) technique presented in [9] - [13] .
First, if a core is not highly testable, we make it highly testable by adding a small amount of DFT hardware based on the HTA technique. This approach results in a test set with high fault coverage for each individual core. However, the main contribution of this paper is not in the testing of individual cores, but in the system-level DFT and test generation techniques used for testing the embedded cores. For system-level testability, it is frequently necessary to provide the transparency property to all the input-output (I/O) port pairs of each core, which allows test data to be propagated through it without information loss. This property allows an embedded core to receive its precomputed test sequence, independent of which vectors constitute the test sequence. The method is independent of whether the number of outputs of the core is smaller, equal to or larger than the number of inputs.
A system-on-a-chip consisting of an interconnection of testable and transparent cores can be made testable with the addition of a small amount of system-level DFT hardware. To ensure that the internal state of a core does not change between the arrival of two consecutive test vectors from its test set, we freeze the state of the core during the intermediate cycles with an external clock enable signal (such a signal may already be available for low power operation). A modified HTA technique is used to generate the system-level test set and the test schedule.
The testing solution consists of two parts. The first part consists of core-level DFT and test generation to make each core testable and transparent and generate a precomputed test set for the core. This task is to be performed by the core provider in the case of hard and firm cores and the user in the case of soft cores. In this paper, we have provided a solution using HTA which can be used by the core provider to tackle the above problem. However, the core provider may use other DFT techniques as well. The second part consists of system-level DFT and test generation which are to be performed by the user. For this part, all we need is the precomputed test sequence of each core and the number of clock cycles required for transparency through various I/O port pairs of the core.
The test methodology was applied to six example systems. The average area overhead for the system-level DFT hardware was only 5.9%, and the delay overhead was negligible. The test efficiency for all the systems was quite high and greater than 99.5%. Besides the proposed test method, the two methods that are currently used in the industry were also evaluated: 1) FScan-BScan and 2) FScan-TBus. The experiments show that the proposed scheme has significantly lower area overhead, delay overhead, and test application time than FScan-BScan and FScan-TBus without any compromise on the system fault coverage.
The paper is organized as follows. In Section II, we state the motivation behind this work and summarize its novel contributions as well as limitations. In Section III, we describe our method with the help of an example core-based system. We formalize the algorithms and methods for testing a core-based system in Section IV. We give the experimental results in Section V and conclusions in Section VI.
II. MOTIVATION
The major issues in system-on-a-chip testing are: 1) creating adequate and portable core internal tests, 2) providing core accessibility, and 3) creating an integrated test and its control mechanism for the overall system [5] .
In this paper, we use a hierarchical testing and DFT method for the purpose of internal test set generation and transparency of some cores [9] - [13] . This method is more efficient in terms of DFT overheads than existing test solutions like scan and BIST. However, it is only applicable to cores of the following types: ASIC's, applicationspecific programmable processors (ASPP's), and application-specific instruction processors (ASIP's). It is not applicable to cores like digital signal processors, microprocessors and memories. Another limitation of this method is that it can only handle limited amounts of control-data mixing in the ASIC, ASPP, and ASIP type of cores. Cores, which cannot be made testable by the methods given in [9] - [13] or made transparent by the extensions of these methods presented here, can be made testable and transparent using traditional techniques. Note that core-level testability and transparency is not the main focus of this paper. There are various ways of providing peripheral accessibility to embedded cores: parallel direct access (test bus), serial scan access (boundary scan), or functional access [5] . The first two require high overheads. Hence, in our proposed work, we have pursued the third approach. The functional access property allows test data to be propagated through the core without information loss. This requires very little extra logic beyond what is needed to make the core testable. It is based on the concept of finding system-level functional test paths (F-paths) [19] .
The main contribution of this paper is a novel system-level DFT and test generation method based on F-paths and transparent cores. It is also shown how an integrated test and control mechanism can be obtained to perform system-level testing with much lower overheads than existing schemes. For our proposed system-level test approach to succeed, the core internal test can be based on any test method, such as BIST, automatic test pattern generation (ATPG), scan, or functional testing.
III. EXAMPLE: A CORE-BASED SYSTEM
Let us consider the core-based system shown in Fig. 1 . It consists of an 8-bit central processing unit (CPU) core which is modeled after the Intel 8085 microprocessor, but in fact can be any 8-bit CPU. It is connected to a read-only (ROM) core and a random-access memory (RAM) core through address and data buses. Wave digital filter (WDF) is an ASIC core which executes the wave digital filter algorithm used in digital signal processing. It is a data-flow-intensive circuit. greatest common divisor (GCD) [14] is a control-flowintensive circuit that calculates the greatest common divisor of two numbers. The other two modules are programmable data paths. ASPP4 [13] is an ASPP that can execute the behaviors of Paulin and Tseng, two well known digital signal processing benchmarks. There is an external mode input to switch between behaviors. SIMPLECPU [15] is an ASIP (see [16] for details of ASIP synthesis). For the first part of our testing scheme (core-level), we assume that the registertransfer level (RTL) netlist of each core is available to us. We have purposefully selected the four cores in the system from four different categories: data-flow-intensive ASIC, control-flow-intensive ASIC, ASPP, and ASIP, in order to completely illustrate our test approach. All the circuits have 8-bit data paths. Our testability analysis, DFT, and test generation techniques are independent of bit-width. Thus, choosing cores with larger bit-widths would have no adverse effect on these parameters for our scheme, whereas an increase in bit-width can severely impact these parameters for conventional techniques. Since providing testability and transparency to general-purpose CPU's and memories is beyond the scope of this work, the description that follows will mainly concentrate on the area inside the dashed region in Fig. 1 . However, we will discuss some general methods that can handle the CPU and memory as well. Note that the connectivity among the cores in the example does consist of loops which is one of the major reasons for low testability. In fact, the fault coverage of the original system, obtained by running a sequential ATPG tool on it, was only about 10% after two CPU days. However, after applying our proposed method, 99.6% fault coverage could be obtained.
A. DFT for Making Individual Cores Testable
The first task in the testing process is to make sure each individual core is testable and a precomputed test set is available for each core which, if applied to the core, will result in a very high fault coverage for the core. Any existing DFT and test generation techniques can be used to make the core testable and generate the test set. In practice, the core provider is usually responsible for providing a test set for each core. In our work, we have used the HTA technique [9] - [13] to achieve this aim.
Consider WDF first. The initial RTL circuit is very poorly testable. On running HITEC [17] , an efficient gate-level sequential test generation tool, on the gate-level implementation obtained from the original circuit, only 40% fault coverage was obtained after one CPU day. To increase the testability of the circuit, we increase the controllability and observability of certain points in the RTL circuit by adding test multiplexers to the data path, mostly to off-critical paths. To do this, we extract the data and control flow from the RTL data path and controller. We perform HTA of this extracted data/control flow to locate the test points where multiplexers have to be added [11] , [12] . The test architecture for this circuit is shown in Fig. 2 . This architecture needs five test multiplexers within the data path and three extra bits a, b, and c for observing the controller and data path outputs during testing. These control signals are all placed in a test configuration register (TCR) which is directly controllable from the PI's. A reset signal is assumed to be present for the controller register, as is true for most real-life circuits. This signal is also fed to TCR. The registers in the data path are not assumed to be resettable. One extra test input is required to test the circuit. The shaded parts of the architecture represent the testability overhead. The area, delay and power overheads for the HTA-based DFT scheme for WDF are 6.2%, 1.3%, and 4.9%, respectively.
Note that in the above case, the controller and data path are tested separately. The controller test set is obtained by running an ATPG tool on the logic-level controller netlist. This test set is fed to the controller directly from the core inputs and response observed at the core outputs in the test mode by suitably setting the values of signals a, b, and c.
In the case of GCD, there are three status signals fed by the data path to the controller. This circuit does not need any test multiplexers in the data path, and needs a test architecture which is slightly modified from the previous case to tackle the status signals. Fig. 3 shows the final testable circuit [12] . TCR consists of only two control bits in this case. The area, delay and power overheads in this case are 5.2%, 2.7%, and 3.2%, respectively. The average area, delay and power overheads of HTA-based DFT of ASIC's are roughly 3%, 1%, and 4%, respectively, [12] .
The programmable data paths are tackled in a different way. We exploit the programmability of the circuits and insert some test microcode into the control ROM to facilitate testing of hard-totest modules. However, in some unavoidable cases, we might still have to insert some test multiplexers [13] . In the case of ASPP4 and SIMPLECPU, no test multiplexer is needed to test the data path. However, the test architecture does need some extra multiplexers. The test architecture used for testing SIMPLECPU is shown in Fig. 4 . The test architecture for ASPP4 is similar. ASPP4 (SIMPLECPU) needs eight (161) extra lines of microcode for complete testability. Each of the circuits needs a test input for testing. TCR for ASPP4 is 3 bits wide and that of SIMPLECPU is 4 bits wide. The area and delay overheads for the HTA-based scheme for ASPP4 are 2.5% and 0.1%, respectively. For SIMPLECPU the corresponding overheads are 6.5% and 1.5%. The average area and delay overheads for the HTA-based scheme for ASPP's and ASIP's are 3.1% and 0.4% [13] .
Tackling Control-Data Mixing: Though all the above examples have separate controller and data paths, the HTA method can handle some control/data chaining where the data and control are intertwined at certain places in the circuit. In this kind of circuit, an initial preprocessing step is performed that separates the control flow from the data flow of the circuit. First, the controller state table is extracted from the controller netlist. If there is no part in the circuit clearly marked as the controller, then each data path control signal like multiplexer select, tri-state control, register load, arithmetic-logic unit (ALU) select, etc., is examined and traced back to the source. These parts of the circuit are designated as the controller. If these signals are combined through logic gates in the data path, then these logic gates are also marked as the controller circuitry. If the part of the circuit, thus, demarcated as the controller, consists of no more than ten flip- flops (FF's) (as is usually the case), then controller state extraction can be performed on this small section of the circuit. After this, the controller state table is conceptually modified with the chained control/data signals and subsequently pruned to remove don't care I/O combinations. This modified controller state table and the RTL data path netlist are then used to extract test control/data flow graphs on which HTA based justification and propagation can work. Further details of this procedure are given in [18] . Note that the above process is able to separate out limited amounts of control and data. However, if the data path and controller are completely mixed with no scope of separation, then this method is not applicable.
B. Definitions
Before we go into the details, we need to define a few terms.
The general controllability C g of a node is the ability to control it to any arbitrary value from the system PI's. The observability O v of a node is the ability to observe any arbitrary value at this node at a system PO. Similarly, we can define C 1 (controllability to 1), C0 (controllability to zero), and Ca1 (controllability to the allones vector). The verifiability V of a node is the ability to verify its value by either controlling or observing it. The controllability, observability, and verifiability are all Boolean parameters, i.e., they only take the values of one or zero, depending on whether the node has the corresponding ability or not [9] . We extend these definitions by adding another field to these parameters which designates the clock cycle when the particular property of a node is desired. Hence, Cg(2) of a node means that we need to control that node to an arbitrary value in cycle 2 [13] .
C. DFT for Making Cores Transparent
We next have to find ways to justify test vectors of the precomputed test sequence of a core from the system PI's to the core inputs embedded in the design and then propagate the test responses from the core outputs to the system PO's. This means that all the cores used in the design need to be transparent, i.e., a vector at each output of a core can be justified from any input or combination of inputs of that core, and any vector at each input of the core can be propagated to an output or combination of outputs of that core. Thus, it is sufficient to find a set of F-paths between each input and output of the core. An F-path is a functional path from a node in an RTL circuit to another node through which test data might flow [19] . Though this can be trivially obtained by providing a direct path from each core input to each core output with the help of multiplexers, the overhead incurred in such naive solutions is much higher than our method. It is true that all cores may not be required to propagate test data between each input and output depending on how they are located in the system. However, if the core provider provides the transparency mechanism, he/she does not know beforehand in what type of circuits and topologies the core will be used. Hence, in general, providing transparency between each I/O port pair becomes important.
While we have used HTA to provide testability to cores before, we have not used it to provide transparency. We next illustrate how we can extend HTA to provide transparency as well.
1) Making Data Flow Intensive Cores Transparent:
To illustrate our method of making data-flow-intensive cores transparent, let us consider WDF, a wave digital filter which is a data-flow-intensive design. It has a single 8-bit input port and a single 8-bit output port.
Given a vector v at the input of the circuit, we want to ensure that we can obtain v at the output after a few cycles (the smaller the number of cycles, the better, as otherwise test application time is adversely affected). In order to provide transparency in this fashion, we first extract the test control/data flow (TCDF) of WDF from its RTL circuit. The extraction method used here is the same as the one given in [12] . In this method, first we extract the control signals for each cycle from the controller. Then we use these signals to determine the data flow in the data path on a cycle-by-cycle basis. Variables are born when registers are loaded. Thereafter, multiplexer select signals are analyzed to find out the operations in each cycle. We also keep track of the binding information between modules and operations, and registers and variables. Once this is done, we perform an HTA of the TCDF and try to symbolically justify a C g () of an output variable, (i.e., a variable observable at a core output) to a C g () of an input variable (i.e., a variable controllable from a core input). If this justification and propagation becomes difficult, we multiplex certain nodes of the circuit to constants to ensure proper justification through modules. The algorithm for adding test multiplexers is the same as the one given in [11] and [12] . It identifies controllability and observability bottlenecks in the TCDF symbolically using a badness count. It then adds suitable test multiplexers to these nodes. Fig. 5 shows the solution that we obtain for WDF. We find that if we multiplex one of the inputs of a multiplier (MUL2) with constant one and use this value in certain cycles when the multiplication operations are being executed, we can propagate any data at In-port to Out-port in 15 cycles (i.e., = 0 15). The select input of this particular multiplexer is made a PI with respect to the core so that it can be used during system testing.
2) Making Control Flow Intensive Cores Transparent:
To illustrate our method of making control-flow-intensive cores transparent, let us consider GCD, a control-flow-intensive circuit. The RTL circuit of GCD, along with its controller, is shown in Fig. 6 . As before, we perform TCDF extraction from the RTL circuit. Note that control-flow-intensive circuits need to be tackled differently than data-flow-intensive circuits as there is no global TCDF here, but the extracted TCDF depends on the sequence of states in the controller which is dictated from outside during testing by controlling the controller status inputs. After TCDF extraction, we try to justify Cg() of the output variable. Since there are two inputs, transparency can be obtained form either one. However, obtaining transparency through only one input is not sufficient because the system topology that the core may be used in is not known beforehand. Hence, we need to provide transparency for all possible I/O pairs. This condition, though quite stringent, cannot be avoided in certain cores due to the inherently flexible nature of a core-based system. However, in certain systems, this condition may be relaxed after doing an analysis of the I/O ports. For example, in a CPU core it may not ever be necessary to propagate test data from a coprocessor input to a direct memory access (DMA) controller output.
In the case of GCD, we cannot relax the requirement. Hence, we have to run a modified HTA on the TCDF, where, in addition to the requirement of Cg() at the output variable, we also check if an input variable controllable from the desired input is also in the propagation path. The propagation path from y (a variable controllable from input In-port Y ) to out (a variable observable from output port Out-port) is shown in Fig. 7 . It consists of four cycles. Similarly, there is another path from input In-port X to Out-port that consists of five cycles (the obvious solution of three cycles is not possible because the control signals from the controller do not allow such a data flow).
In order to obtain these propagation paths for any input vector, the control flow of the controller needs to be dictated. This means that we have to control the inputs to the controller from an external source. This can be done by making the t inputs of multiplexer Ms in Fig. 3 external to the core. Signal S 1 in Fig. 3 also needs to be an external input to act as the select signal of M s . Hence, TCR of GCD now has only one bit. The status register in the figure is removed as it is no longer required.
3) Making Programmable Data Paths Transparent: To illustrate our method of making programmable data paths transparent, let us consider ASPP4 and SIMPLECPU both of which make use of a microprogrammed controller instead of a hardwired controller as in the case of the circuits described earlier. ASPP4 has three input ports and two output ports. Hence, six I/O pairs are possible. The control signals are obtained from the microcode in a control ROM. To obtain data propagation from an input to an output port, some extra transparency microcode routines are inserted into the control ROM. Fig. 8 shows the RTL circuit of ASPP4. Consider data propagation from In-port 1 to Out-port 1. This can be obtained by loading REG6 in cycle 1 from In-port 1 and simultaneously loading REG1 with constant zero. In cycle two, REG1 can be loaded with the output of ADD1 which adds the contents of REG6 and REG1. Thus, if two lines of microcode are inserted into the control ROM, specifying all the control signals needed to carry out the above operations, transparency from In-port 1 to Out-port 1 can be obtained in two cycles. Formally, these microcode routines may be obtained with the help of a structural connectivity graph (SCG) [13] . Fig. 9 gives the SCG of ASPP4. In the SCG, all multiplexers are abstracted out as edges since we can dictate any desired flow through a multiplexer tree by the test microcode. The I/O ports, constants, registers and modules are all represented in the graph. Modules have two different sets of edges coming into them comprising of their left and right ports. In this graph, the register nodes are replicated to keep the diagram simple.
Here, node i represents REGi. The core input registers (registers directly connected to core input ports) are encircled at the module inputs. The core output registers (registers directly connected to core output ports) are encircled with shaded circles at the module outputs. The constant registers (registers to which constants may be loaded) are encased in a square.
From the SCG, we compute controllability and observability numbers for each register. We call these C-numbers and O-numbers, respectively. The C-number of a register gives an approximate idea as to how many cycles are required to control its value to a symbolic vector from an input port, given complete freedom to choose the control signals of the RTL circuit. Similarly, the O-number of a register gives an approximate idea as to how many cycles are required to observe the symbolic value of a register at an output port, given the above freedom. These C-and O-numbers are calculated using a reachability metric on the SCG and then used to guide the search for the test/transparency microcode. To symbolically justify and propagate vectors in the RTL circuit, we use another modification of HTA, which we renamed RTL testability analysis (RTA) [13] . For example, in the above case, we start with the initial objective of C g of Out-port 1 which is equivalent to C g () of REG1. This is justified upwards through module ADD1 to Cg( 0 1) of REG6 and C 0 ( 0 1) of REG1 (this is one of many choices and backtracking may be required in the search). Cg( 0 1) of REG6 is equivalent to C g ( 0 1) of In-port 1 and is, therefore, readily satisfied. C 0 ( 0 1) of REG1 is also readily satisfied because constant zero is available at the input of REG1. Finally, setting the value of to two, we can calculate all the symbolic cycle values. Now, there is a one-to-one mapping between operations in the data path and the control signals required to perform them. Thus, a transparency microcode routine can be obtained. Using the above approach, fourteen additional lines of microcode can be shown to be needed for the six I/O transparency routines required for ASPP4 to provide transparency for the six I/O pairs (three inputs, two outputs). Once these routines are in place in the control ROM, we can use the branch inhibit signal in the testable core (see Fig. 4 ) to step the program counter in the controller to point to the required transparency routine in the ROM. This routine can be put in a loop to enable it to continue propagating data from an input to an output during the system testing phase. The branch inhibit signal is made a PI with respect to ASPP4 for the system test. Hence, TCR of ASPP4 now has one bit less.
The RTL circuit of SIMPLECPU and its SCG are shown in Fig. 10 . The techniques mentioned above can be used to generate a transparency routine for this circuit as well. The transparency routine first loads a zero into one of the registers of the register file using the output of the multiplier in the ALU. Then it can propagate any vector at In-port to the Out-port through the adder of the ALU by setting the right input of the adder to zero with the help of the above register in the register file. Hence, just two extra lines of microcode are required to achieve transparency in SIMPLECPU. Once the routine is executed, SIMPLECPU can output data from its In-port to Out-port in zero cycles through the combinational path. We again make the branch inhibit signal a PI of SIMPLECPU and reduce the number of bits in its TCR by one.
The testable and transparent cores that are finally obtained are shown as blocks in Fig. 11 . In some processor cores, test microcode insertion in the control ROM may not be possible. In such cases, we need to use processor instructions to achieve transparency. Small transparency programs can be placed as interrupt service routines in the processor main memory. When we need transparency through the processor, we can interrupt it to jump to the correct transparency routine. For example, in Intel 8085, a series of read interrupt mask (RIM) instructions can be used to transfer data from the serial input port to the data bus.
D. Testing the Core-Based System
Once the cores are made testable and transparent, the next step is to make the complete core-based system testable. This may require some system-level DFT hardware. A system-level extension of the HTA technique can be used to identify the locations at which the extra hardware is inserted as well as to derive the system-level test set. The system test architecture for our example is shown in Fig. 12 . In this architecture, all required test and transparency inputs of each core are included in a global system test configuration register (STCR). STCR gets its inputs from the data bus of the CPU through output ports. By suitably loading STCR during the testing phase, individual cores can be tested while other cores are made transparent to propagate the test data. Thus, the CPU, by outputting data to the added ports 4 and 5, does the global test scheduling of the system. Finally, one more problem remains to be tackled. All the cores in the system are sequential circuits themselves. Hence, a precomputed test set of a core consists of test sequences rather than independent test vectors. This means that test vectors need to be applied to the inputs of a core in the order given. If the state of the sequential circuit changes between two consecutive vectors, the test set may no longer be valid. However, in a core-based system, if we have to propagate test data through other cores to the inputs of the core under test, it may not always be possible to feed the desired test sequence to it in consecutive cycles. For example, consider the input of GCD fed by WDF. WDF loads and propagates an input vector every 15 cycles. Hence, it is impossible to feed consecutive test vectors to GCD in consecutive cycles. The state of GCD should not change in between the two test vectors. To ensure this, we qualify the clock of each core with an external enable signal which we use to freeze the state of a core in the cycles when test vectors are being propagated to its input and when test responses are being propagated from its output. It will be clear later that we only need to clock a single core at a particular cycle during system testing. Hence, these enable signals may be decoded to reduce the extra pin overhead. In general, if there are n cores in a system, we will need [log(n)] + 1 extra input pins for testing. The one extra pin is required to switch the system between the normal mode and the test mode. The decoder and the three extra pins needed in the system are not shown in Fig. 12 to keep it simple.
Let us next see how HTA can be extended to the system level and applied to this circuit. Consider the core ASPP4. In order to test this core, the initial objectives are Cg() at a, b, and c and Ov( + 1) at nodes d and e. Cg() at a is propagated through GCD in the transparency mode to C g ( 0 5) of g (because it takes five cycles to propagate data from In-port X to Out-port in GCD) or C g ( 0 4) of d. Since d is an output of ASPP4 and data cannot be propagated through the core under test, we drop the objective at d and pursue the objective at g. C g ( 0 5) of g can be propagated to C g ( 0 20) at k through a transparent WDF. C g ( 0 20) at k is propagated to Cg( 0 21) at PI-port 1, where we utilize a simple CPU program to input data from Port 1 and output data to Port 3. Cg( 0 21) at PI-port 1 is trivially achievable.
Cg() at b is propagated to Cg() at j through SIMPLECPU. This is trivially achievable from PI-port 2. Cg() at c is trivial due to its direct connection to PI-port 3. Ov( + 1) at e is trivial because of its connection to PO-port 2. O v (+1) at d is propagated to O v (+4) at a through GCD. However, a is an input to ASPP4 and should not be used to propagate the test response. Hence, we have to add the systemlevel DFT multiplexer M and control it through a bit in STCR. Thus, O v ( + 2) of d is achievable through M. We next make 021 = 1, or = 22, and calculate all the symbolic cycle values accordingly. Thus, a symbolic justification and propagation path is obtained for ASPP4 which we refer to as the test environment. From the test environment, a test schedule for testing ASPP4 can now be obtained.
The testing is done as follows. Suppose we need vectors x, y, and z at some cycle at a, b, and c, respectively. First, we assert the branch inhibit signal of SIMPLECPU and clock it until its program counter points to its transparency routine. Once the routine is executed, any data at the input of SIMPLECPU is available at its output. Then we freeze the clocks of all the cores. We input x at PI-port 1 and use a small CPU program to transfer it to Port 3 (for Intel 8085, this can be done by an IN Port 1 followed by an OUT Port 3). Once the data is ready at Port 3, we clock WDF for 15 cycles while feeding it the required transparency signals by loading STCR with the correct bit patterns. The loading of STCR is done using the CPU which executes a similar program as above. After 15 cycles, x is available at the input of GCD. We next freeze the clock of WDF and clock GCD for five cycles while again loading STCR with the bit patterns necessary for transferring data from In-port X to Out-port of GCD. We simultaneously feed y at PI-port 2 and z at PI-port 3. Thus, after 21 clock cycles, ASPP4 has the necessary vectors at its inputs. We ensure that STCR feeds it the correct test signals and clock ASPP4 once. Then we observe the output at e through PO-port 2. Finally, we toggle the select bit of M in STCR to observe the response at d through PO-port 2.
The above process needs to be repeated for every test vector in the test set of ASPP4. The other cores are tested similarly. If it is necessary to propagate more than one vector through a core during the same test environment to more than one output port, then it has to be done in a sequential manner. If there is an output register at a port, then the vector can be frozen there using DFT while a second vector is propagated to the second output port. If there are no registers, then additional latches might be needed to hold the vectors. All this needs to be done because there is no guarantee that test data can be simultaneously propagated to two or more outputs of a single core.
E. The Final Test Strategy
Having explained the strategy of testing individual cores in an example system, we give a brief overview of our global test strategy. When the system is powered up, the CPU executes the program in the ROM-basic I/O system (BIOS). Our strategy is to put a testing routine in the ROM-BIOS and make the CPU jump to it with the help of an interrupt. However, first of all, the ROM-BIOS needs to be tested itself. This can be done through BIST of the ROM-BIOS. As the ROM-BIOS is small, and efficient BIST strategies for ROM's exist, the overheads are small. The testing routine first tests the RAM using the CPU. For this, popular RAM testing algorithms, like Marching 1's and 0's [20] , may be used. The test responses of the RAM may be observed at PO-port 1 using the CPU OUT instruction. Alternatively, BIST schemes for regular structures like RAM's are very efficient and popular. Thus, in the system-level test, all memory cores can be tested using BIST only. Once the RAM is tested, a specific test schedule, like the one for ASPP4 explained above, may be loaded into the RAM at another interrupt location. The loading can be done by a program loop placed at the end of the RAM testing routine. The loop gets data from PI-port 1 and stores them in the RAM. This test schedule is encased in a loop so that it can iterate as many times as the number of test vectors in the test set of the core under test. When the core testing routine is completely loaded, we interrupt the CPU again so that it can now execute the core testing program. At the end of the core testing program, the control can jump back to the ROM so that a fresh testing program for testing another core may be loaded. This loading of test schedule routines is similar to the boot-strapping technique used in computers. The testing routine burned into the ROM-BIOS will remain as an overhead. For an Intel 8085, it amounts to around 100 bytes for our example system.
To test the CPU itself, a combination of BIST, scan and functional testing is usually used. The memories in the CPU, like the register files and the cache, can be tested using BIST. The controller inputs and outputs in the CPU can be scanned in and out to test the CPU controller, which is usually a complicated structure. After this, some test programs may be loaded into the RAM to test the ALU's and address generation logic using precomputed test sets. The pipeline protocols and cache control protocols can be tested using functional tests that exercise all the modes. Even though we cannot test CPU cores with our method, a CPU is used as an aid-in-test. It is usually straightforward to derive transparency paths through a CPU core using a couple of CPU instructions that read in data from one input port and dump it at another output port.
IV. THE TESTING PROCEDURE
In this section, we briefly formalize all the algorithms that we use for testing a core-based system. Due to the large number of complicated procedures and routines used in the complete system-ona-chip testing scheme, it is not possible to provide complete details of all procedures in this paper. For further details on some of these procedures, one should refer to the various references, as indicated. Fig. 13 shows the top-level pseudocode that we use to make the cores testable in the system. It takes as input the RTL circuit of a core, along with a module library that contains the precomputed test sets of each of the modules used in the RTL circuits. As explained earlier, different types of cores are tested in different ways. If it is a dataflow-intensive circuit, it runs procedure extract_TCDF to extract the TCDF graph executed in the RTL circuit. This procedure is explained in [12] . It then runs the design for hierarchical testability (DFHT) algorithm on the extracted data flow graph (DFG). DFHT places test multiplexers at certain points in the circuit and then does an HTA on the testable circuit to obtain a test set for the circuit using the module test sets. It then modifies the circuit using the test architecture so that the test sets may be applied correctly. The test architecture is as shown in Fig. 2 . The DFHT algorithm is explained in [11] .
A. Testing Individual Cores
If the core is a control-flow-intensive circuit, then it runs the procedure extract_TCDF on the circuit to get one particular TCDF that may be executed in the circuit. It uses this control-data flow graph (CDFG) to test as many modules as possible. This procedure is repeated to generate new TCDF graphs each time until all modules have been tested using them. Finally, it modifies the circuit with the test architecture, as shown in Fig. 3 . Note that for a control-flowintensive circuit, a single CDFG may not suffice as a large number of CDFG's may be executed in the circuit depending on the sequence of states in the controller.
In case of programmable data paths, it first constructs an SCG of the data path. Then it applies procedure RTA to the SCG and the circuit to derive a set of test microcode routines for the hierarchically untestable modules. Finally, extract_TCDF and DFHT are run on this modified circuit as above. These procedures are explained in [13] . After this, the circuits are modified with the test architecture as shown in Fig. 4 . Note that testability in individual cores can be possibly achieved by other DFT techniques and our proposed system test strategy can be applied to cores made testable using any DFT technique.
B. Achieving Transparency in Individual Cores
Fig. 14 shows the pseudocode of the method used for achieving transparency in each type of core. In this case also, the different types of cores are tackled in different ways. For a data-flow-intensive circuit, first the data flow executed in the circuit is extracted with the help of procedure extract_TCDF. Then for each I/O pair of the circuit, the initial objective is set to Cg() of o1 where o1 is a variable observable at O. Then, transparency DFHT (TDFHT) is executed on the DFG until a path from i 1 (a variable controllable from I) to o1 is found. This TDFHT routine is just another version of the DFHT routine so that it can tackle this special case. Transparency multiplexers, which multiplex certain points of the circuits with constant values, are placed in the circuit by the TDFHT procedure, if necessary. Finally, the inputs to these multiplexers are made PI's of the core. This is done in the modify_circuit routine.
In case of a control-flow-intensive circuit, the method is similar. However, since only a partial CDFG is extracted each time, care must be taken to ensure that variables mapped to the (I/O) pair in question exist in the extracted CDFG. Otherwise, a path can never be found. Once such a CDFG is found, again TDFHT is applied as above. The status inputs of the controller are multiplexed with a set of t_inputs which are made PI's of the core (see Fig. 3 ). This is done to ensure that the control flow of the controller can be controlled as desired. The select inputs of the extra transparency multiplexers also become PI's.
For programmable data paths, the SCG is formed and for each I/O pair, the initial objectives are set as before. Then transparency RTA (TRTA) is executed on the circuit to obtain a path from input I to output O. TRTA is a slightly modified version of RTA. This leads to a transparency microcode routine that is placed in the microcode ROM of the core. Finally, after all the microcode routines are placed in the control ROM, the branch inhibit signal is made a PI of the core. Fig. 15 shows our system-level HTA and DFT placement algorithm. The inputs to the procedure are the core-based system with the connectivity specified among the different cores, the transparency mechanism for each core, and the precomputed test sets for the cores. The procedure first introduces the system-level DFT. It qualifies the clock of each core with an enable signal and ties these signals to the outputs of a decoder. The inputs of the decoder are made external PI's. It introduces a global system test pin which overrides the enable signals during normal operation. It then puts all the test and transparency inputs of each core into STCR. It connects this register to the data bus of the CPU with the help of output ports. All this is done in procedure Apply_system_test_architecture.
C. System-Level Test Generation and DFT
Next, for each core in the system, the procedure finds out the system-level test schedule in a sequential manner. The set_initial_objectives procedure sets the initial objectives to Cg() of each of the core inputs and O v ( + 1) of each of the core outputs [the objective may be Ov() if a combinational path exists from the input to the output]. The procedure system-level hierarchical testability analysis (SHTA) does justification and propagation of these objectives through other cores using the transparency mechanism provided in the cores. It uses a branch-and-bound algorithm. In case of conflicts, it backtracks to explore other paths. If all paths are exhausted without a solution, a system-level test multiplexer is added at a suitable point to aid the process and its select signal is placed in STCR. The place where a test multiplexer needs to be placed is based on a badness counter which is attached to each node in the system. Whenever justification or propagation fails at a node, its badness counter is incremented. Finally, the node with the worst badness counter is multiplexed to a system PI or PO with a system test multiplexer.
Once a set of justification and propagation paths is found for a core, it is returned as a test environment. Procedure generate_schedule uses this test environment to generate a test schedule for the core which is a program that is to be loaded into the RAM to test the core. This procedure is quite straightforward as a one-to-one mapping exists between the test environment and the test schedule program. This was explained in Section II-D. Finally, procedure generate_testset uses this schedule and the precomputed test set of the core to derive the system-level test set. This routine just takes a core's test vector, breaks it according to the core inputs, and plugs appropriate subvectors into system PI ports at different cycles using the test schedule. It also determines the clock enable inputs to be applied when the core is to be tested using the test scheduling routine.
V. EXPERIMENTAL RESULTS
The experimental results for the various schemes were obtained by a mixture of manual and automated techniques using various university-level tools. The testing and DFT of RTL cores using HTA was implemented in C++. The program outputs the core-level test sequence as a by-product. The transparency mechanisms were obtained by minor modifications to the above HTA routines. Another program was implemented that takes as input the RTL system-ona-chip connectivity, the transparency mechanism through each core, and the core-level test sets. It does symbolic system-level justification and propagation and recommended test insertion at different places in the system. It also transforms each core-level test set to a systemlevel test set. The testable and transparent cores were synthesized to layout using SIS [29] and the Octools package from the University of California, Berkeley. A series of scripts were used to perform this synthesis. Finally, all these layouts were connected together manually along with the recommended system-level DFT, and a block placeand-route done using the tools Puppy and Mosaico from University of California, Berkeley. This allowed the system-level overheads to be computed. The part of the system-on-a-chip excluding the memory and CPU was synthesized at the logic level and fault simulated with the generated system-level test set using PROOFS [30] .
For obtaining the overheads in the other existing methods, we used a full scan and boundary scan insertion tool at the logic level for each core and synthesized each core to layout using Octools. Then we manually connected the test bus and the boundary scan chains before doing a system-level place-and-route using Puppy and Mosaico. Fault coverage for the different systems were obtained by running HITEC [17] and PROOFS at the logic level.
We tested the proposed algorithms on six example core-based systems. The first example, System 1, is the one depicted in Section II. The second example, System 2, is another 8-bit system consisting of a CPU core connected to a memory as in the first system. Additionally, there are cores for discrete cosine transform named DCT_Lee [11] , a bar-code reader named Barcode [21] , and an ASPP called ASPP3 [13] which can execute the behaviors of Elliptic (the fifth-order elliptic wave filter), Dist and Chemical (IIR filters used in the industry). System 3 is an embedded system taken from [22] . It is used to scan barcodes from objects. System 4 is another example system from [22] consisting of a graphics processor core [23] , a GCD core [24] , and an X25 protocol core [25] . System 5 consists of an ASPP called ASPP1 [13] , a differential equation solver called Paulin [26] and a control-flow-intensive circuit called Test1 taken from [27] along with a CPU and memory. System 6 is another core-based system consisting of the circuits Tseng [26] , ASPP5 [13] , and an ASIP which is a multiplexer-based implementation of the architecture TMS32010 from Texas Instruments and taken from [28] . All the experimental results exclude the CPU and memory for reasons stated earlier.
The final test architecture of all the systems require three external test pins. Table I compares the area overheads of the different schemes used in testing of core-based systems. The area numbers are actual layout numbers in 2 ( is the process parameter) which are generated after placement and routing. Columns 2 and 3 show the number of flip-flops and the number of literals in the original system. These columns give an idea of the size of the systems used. In Column 4, the original area of the system is given where even the individual cores are not modified using DFT. In Column 5, the area of the system is given after each core has been modified for testability using the HTA scheme given in Section II-A. Note that this area does not include the system-level DFT hardware. The overhead incurred is shown in Column 6. The area and the incurred overhead when the test bus scheme is used, after the individual cores are made testable by HTA, is given in Columns 7 and 8. Columns 9 and 10 give the corresponding numbers when test bus is used, but the individual cores are made testable using full scan. When full scan is coupled with boundary scan to test the system, the area and the overhead incurred are as shown in Columns 11 and 12. Finally, in Columns 13 and 14, we show the area and incurred overhead when the system as well as the cores are tested using our HTA scheme. It can be seen from the results that the full scan and test bus combination incurs the worst area overhead while our scheme incurs the lowest area overhead. Note that the area overheads consist of two components: 1) the initial area overhead incurred to make individual cores testable and 2) the system-level DFT overhead which includes the overheads required to make cores transparent and the system DFT circuits added on the chip outside the cores. Though the average area overhead for our scheme is about 13.3% calculated on the basis of the original circuit, the system-level DFT overhead is only about 5.9%. Since the core providers provide some form of testability for the cores, e.g., full scan, it is the latter system-level testability overhead that is of primary concern. Table II shows the corresponding delay numbers. Delay numbers signify the clock period of the system in ns and are obtained through SIS after technology mapping. The columns in Table II are similar to the ones in Table I with area numbers being replaced by delay numbers. Again, our scheme does better than the other schemes in terms of delay overheads. Though the average delay overhead for our scheme calculated on the basis of the original circuit is 2%, there is almost no delay overhead due to the system-level DFT. Table III , testability results for the systems are presented. In Column 2 of the table, the fault coverage obtained by running HITEC, an efficient gate-level sequential test generation tool, on the original system is given, i.e., in this system there is no core-level or systemlevel DFT present. In Column 3, the corresponding test application time is given in cycles. As expected, the fault coverage of the original system is very poor. In Columns 4 and 5, the corresponding fault coverage and test application time are shown for the system in which each core has been modified by HTA, but system-level DFT has not been used. Again, HITEC has been used to generate the results. This time the fault coverage is slightly better, but still very poor, reiterating the fact that a core-based system may be poorly testable even if individual cores are highly testable. The fault coverage and test application time obtained by using the test bus on the cores made testable by HTA are given in Columns 6 and 7. Here the fault coverage is very good as all cores are tested individually with their precomputed test sets. When individual cores are fully scanned and the test bus is used, the fault coverage is very good but the test application time suffers. This can be observed from Columns 8 and 9. The fault coverage for the test bus scheme is slightly lower than other schemes as the interconnect faults are not tested. On the other hand, if boundary scan is used along with full scan, the test application time is prohibitive, though fault coverage is still very good. This is observed from Columns 10 and 11. In our scheme of system and core-level HTA (shown in Columns 12 and 13), the fault coverage is as good as the other schemes. Note that in case of System 3 and System 4, the fault coverage is slightly lower because the circuits are relatively small, and because of the presence of some untestable and redundant faults. However, the test efficiency for these two circuits is close to 100%. The fault coverage is obtained by fault simulating our system-level test set on the testable circuit using PROOFS. The test application time is an order of magnitude better than the full scan/boundary scan technique, but worse than the HTA-test bus scheme. This is because the test vectors need to be propagated through other cores which consumes cycles. Some more core-level DFT may be necessary to minimize the test application time. For example, the input port of a core can be directly connected to an output port with the help of a test multiplexer during the propagation mode. However, this will lead to a trade off between the overhead and test application time.
VI. CONCLUSIONS
In this paper, we presented an effective and practical technique for testing systems composed of predesigned embedded cores. The method first makes each individual core highly testable using some core-level DFT hardware and precomputes a test set for each core. Then it does some further analysis on each core and adds more DFT hardware, if necessary, to make it transparent so that test data can be propagated through it during system test without information loss. Finally, the system is modified with some global test hardware, and a test schedule is found for testing each embedded core with its precomputed test set. This system-level testing strategy is independent of the way in which the individual cores are tested. On applying the test strategy to some example core-based systems, we found that the average area overhead for the system-level DFT (5.9%) is much lower than existing system-level DFT techniques for core-based systems. The average delay overhead is negligible. The test efficiency obtained on the part of the circuit which we could accurately fault simulate is very high (>99.5%). The test application time is much lower than what would be required by current industrial techniques of employing cores with full scan and either test bus or boundary scan for system-level DFT.
Future work involves obtaining extensions of the hierarchical testing technique to allow it to make cores other than ASIC's, ASPP's, and ASIP's testable and transparent.
On Comparing Functional Fault Coverage and Defect Coverage for Memory Testing

Von-Kyoung Kim and Tom Chen
Abstract-The manufacturing of high-quality and reliable semiconductor memories is very important. Many memory testing algorithms have been proposed to improve the quality of semiconductor memories by screening out different memory functional faults. However, the relationships between memory function fault types and the types of defects which cause the functional faults are not well understood. Therefore, the effectiveness of memory testing algorithms based on the functional fault models cannot be realistically determined. This paper evaluates the effectiveness of the memory testing algorithms based on the defect coverage by comparing the defect coverage of known memory testing algorithms and the functional fault coverage of the same testing algorithms using the same defect statistics. The experimental results show that the differences among the defect coverage of the 11 memory testing algorithms other than checkerboard and sliding diagonal tests were not significant as previously believed using memory functional fault coverage as the coverage metric.
Index Terms-Defect coverage, defect coverage estimation, effectiveness of memory tests, fault coverage, memory testing, memory testing algorithms.
I. INTRODUCTION
The rapid development in memory technology demands new memory testing algorithms not only to reduce memory testing costs but also to enhance the defect detection capability, and therefore, the quality of the product. Abadir et al. [1] and van de Goor [2] introduced various memory testing algorithms, including zero-one (also known as memory scan or MSCAN) [3] , checkerboard, walking 1/0, algorithmic test sequence (ATS) [4] , modified ATS (MATS) [5] , MATS+ [5] , and MATS++ [2] . It is commonly believed that these algorithms are designed to detect simple functional faults such as stuck-at faults. Other memory testing algorithms including March C-, March X, GALPAT were also proposed [2] to detect more complex functional faults such as coupling faults. Dekker et al. [6] proposed a new march testing algorithm, 9N, which detects stuck-at, stuck-open, transition, coupling, and data retention faults. Other march algorithms include 10N algorithm was proposed by Oberle et al. [7] .
Veenstra et al. [8] evaluated the fault coverage performances of various functional memory testing algorithms. Van de Goor [9] investigated the effectiveness of the march tests over the previous testing algorithms in terms of fault coverage. Riedel et al. [10] measured functional fault coverages of several memory testing algorithms by generating and evaluating test patterns for a 3 2 3 memory array. All of the previous studies measuring the effectiveness of the memory testing algorithms used the functional fault coverage as the measure for test effectiveness. The functional fault coverage has certain advantages in that the functional fault models are developed to make them easier to handle and understand. Although many existing memory testing algorithms claim good fault detection capabilities, their ability to detect physical defects may not be accurately reflected Manuscript received January 23, 1998; revised July 1, 1999. This work was supported in part by Hewlett Packard Laboratories. This paper was recommended by Associate Editor S. Reddy.
V.-K. Kim is with Sun Microelectronics, Palo Alto, CA 94303 USA (email: kvk@eng.sun.com).
T. Chen is with the Department of Electrical Engineering, Colorado State University, Fort Collins, CO 80523 USA (e-mail: chen@engr.colostate.edu).
Publisher Item Identifier S 0278-0070(99)09465-8.
0278-0070/99$10.00 © 1999 IEEE
