At-speed testing of high-speed circuits is becoming increasingly difficult with external testers.
shows the cost of silicon manufacturing vs. cost of testing as projected by 1997 SIA [1] and 1999 ITRS roadmap [2] . The top curve shows the fab capital per transistor cost reduction (Moore's law). The bottom curve shows the test capital per transistor (Moore's law for test). From the 1997 SIA roadmap it was clear that unless fundamental changes to test are made, it might cost more to test the chip than to manufacture it in the future [2] . Figure 1 also shows the historical trend in the test paradigms. The high cost of manually developed functional tests and difficulties in translating the embedded component tests to the chip boundary where the automatic test equipment (ATE) interface exists are making these tests infeasible even for very high volume products. On the other hand, even if automatically developed structural tests are available, their test application using ATEs poses challenges because the testers performance is increasing at a slower rate than the device speed. This translates into an increasing yield loss due to external testing since guardbanding to cover tester errors results in a loss of more and more good chips. In addition, high-speed testers are very costly. S tru ctu ra l te s tin g (s c a n , A T P G ) S tru ctu ra l te s tin g (s c a n , A T P G )
B u ilt-in s e lf-te s t (e m b e d d e d H W te s te r) B u ilt-in s e lf-te s t (e m b e d d e d H W te s te r) E m b e d d ed S W -b a s e d s e lf-te s t (e m b e d d e d S W te s te r) E m b e d d ed S W -b a s e d s e lf-te s t (e m b e d d e d S W te s te r)
Test paradigm s: S tru ctu ra l te s tin g (s c a n , A T P G ) S tru ctu ra l te s tin g (s c a n , A T P G )

Test paradigm s:
C o s t o f S ilic o n M fg a n d T e s t C o s t o f S ilic o n M fg a n d T e s t
Figure 1: Fab vs. Test Capital.
Built-in self-test (BIST) and design-for-testability (DfT) have been regarded as possible solutions for changing the direction of the bottom curve in Figure 1 by the 1999 ITRS roadmap.
BIST solutions eliminate the need for high-speed testers and offer the ability to apply and analyze atspeed test signals on chip with greater accuracy than that of the tester.
Existing BIST techniques belong to the class of structural BIST. Structural BIST, such as scanbased BIST techniques [3] [4] [5] , offer good test quality but require addition of dedicated test circuitry (such as full-scan, LFSRs for pattern generation, MISRs for data analysis and test controllers). Therefore, they incur non-trivial area, performance and design time overhead.
Moreover, structural BIST applies non-functional, high-switching random patterns and thus, causes much higher power consumption than normal system operations. Also, to apply at-speed tests to detect timing related faults, existing structural BIST needs to resolve various complex timing issues related to multiple clock domains, multiple frequencies and test clock skews that are unique in the test mode.
A new embedded software-based self-testing paradigm [6] [7] [8] has a potential to alleviate the problems due to the use of external testers as well as embedded hardware tester problems described above. In this testing strategy, it is assumed that programmable cores on the SoC (such as processor, DSP, and FPGA cores) are first self-tested by running an automatically synthesized test program which can achieve high fault coverage. Next, the programmable core is used as a pattern generator and response analyzer to test on-chip buses, interfaces between cores or even other cores including digital, mixed-signal and analog components. This solution is sometimes also referred to as functional self-testing.
The concept of embedded software-based self-testing is illustrated in Figure 2 using a bus-based SoC. The IP cores in the SoC are connected to a Peripheral Interconnect (PCI) [9] bus via the virtual component interface (VCI) [10] . The VCI acts as a standard communication interface between the IP core and the on-chip bus. First, the microprocessor tests itself by executing a set of instructions.
Next, the processor can be used for testing the bus as other non-programmable IP cores in the SoC. In order to support the self-testing methodology, the IP core has a test wrapper around it. The test wrapper contains test support logic needed to control shifting of the scan chain, buffers to store scan data and support at-speed test, etc. In this example, the on-chip bus is a shared bus and the arbiter controls access to the bus. There are several advantages of the embedded software-based self-test approach. First, it allows reuse of programmable resources on SoCs for test purposes. In other words, this strategy views testing as an application of the programmable components in the SoC and thus, minimizes the addition of dedicated test circuitry for design-for-testability or self-test.
Bus
Second, in addition to eliminating the need for costly high-speed testers, it can also reduce the yield loss due to tester accuracy problems. Self-testing offers the ability to apply and analyze atspeed test signals on chip with accuracy greater than that available with the tester.
Third, while the hardware-based self-test must be applied in the non-functional BIST mode, software-based self-test can be applied in the normal operational mode of the design, i.e., the tests are applied by executing instruction sequences as in regular system operations. This eliminates the problems created by application of non-functional patterns that can result in excessive power consumption when hardware BIST is used.
Also, functional self-test can alleviate the over-testing and yield loss problems due to the application of non-functional patterns during structural delay testing (through at-speed scan or BIST). Experiments have shown that many structurally testable delay faults in the microprocessors can never be sensitized in the functional mode of the circuit [7] . This is because no functionally applicable vector sequence can excite these delay faults and propagate the fault effects into destination outputs/flip-flops at-speed. Defects on these faults will not affect the circuit performance and their testing is not necessary. However, if the circuit is tested by applying non-functional patterns, these defects could be detected and the chip could be identified as faulty resulting in yieldloss.
Software-based fault localization tools are on the high-priority list according to 1999 ITRS roadmap [2] . In addition to self-testing, functional information can also be used to guide diagnostic self-test program synthesis.
Testing of analog circuits has been a costly process because of the limited access to the analog parts and testers required to perform functional testing. The situation has become worse due to the trend of integrating various digital and analog cores onto the SoC, in which testing the analog parts becomes the bottleneck of production testing. Using DSP-based approaches for self-testing of onchip ADC/DAC and analog components is a promising direction toward alleviating these problems and reducing the test cost for such components [37] .
In this paper, we discuss embedded software-based self-testing and self-diagnosis methods for core-based SoC designs. We start by describing processor self-test methods targeting stuck-at faults and delay faults. We also give a brief description of a processor self-diagnosis method. Next, we continue with a discussion on methods for self-testing of buses and global interconnects and well as other non-programmable IP cores on SoC. Also, we describe instruction-level DfT methods based on insertion of test instructions to increase the fault coverage and reduce the test application time and test program size. Finally, we briefly summarize DSP-based self-test for analog/mixed-signal components.
Embedded Processor Self-Testing
While logic BIST may perform well on industrial application specific integrated circuits (ASICs), it is less feasible on microprocessors. This is because the design changes required for making a microprocessor BIST-ready (e.g., immune to problems such as bus contentions even when pseudorandom test patterns are applied) may come with unacceptable cost, such as substantial manual effort and significant performance degradation. In addition, microprocessors are especially random pattern-resistant. Due to timing-critical nature of microprocessors, test points may not be acceptable as a solution to this problem, as they could degrade circuit performance by introducing extra logic on critical paths. Deterministic BIST, on the other hand, may lead to unacceptable area overhead, as the size of the on-chip hardware for encoding deterministic test patterns depends on the circuit testability [11] .
A number of approaches have been proposed to generate functional tests for microprocessors [12] [13] [14] [15] [16] . Some propose to apply the tests with external testers [12] , while others allow the processors to tests themselves with self-test programs [13] [14] [15] [16] . A common characteristic of approaches in [13] [14] is application of randomized instructions to the processor under test.
However, although processors are more amenable to random-instruction tests than to random-pattern tests, it is difficult to target structural faults by applying random instructions at the processor level.
Approaches in [15] [16] use structural ATPG to generate tests for stuck-at faults in the processor.
They use the RTL information of the processor to form a set of RTL-module equations that can realize the generated gate-level test. Solving the set of equations specifies the required instruction sequences and operands. All of the above approaches target only stuck-at faults and the methods cannot be easily generalized for delay faults.
Unlike hardware-based self-testing, software-based testing is non-intrusive since it applies tests in the normal operational mode of the circuit. Moreover, software instructions have the ability of guiding the test patterns through a complex processor, avoiding the blockage of the test data due to non-functional control signals as in the case of hardware-based logic BIST.
Embedded software-based self-test methods for processors have been proposed in [6] [7] [8] .
These methods consist of two steps: the test preparation step and the self-testing step. The test preparation step involves generation of realizable tests for components of the processor. Realizable tests are those that can be delivered using instructions. Therefore, to avoid producing undeliverable test patterns, the tests are generated under the constraints imposed by the processor instruction set.
The tests can then be either stored or generated on-chip, depending on which method is more efficient for a particular case. A low-speed tester can be used to load the self-test signatures or the predetermined tests to the processor memory prior to the application of tests. Note that the inability to apply all possible input patterns to a microprocessor component does not necessarily map to low fault coverage. If a fault can only be detected by test patterns outside the allowed input space, by definition, the fault is redundant in the normal operational mode of the processor. Thus, there is no need to test for this fault.
The self-testing step, illustrated in Figure 3 , involves the application of the realizable tests using a software tester. The software tester can also compress the responses into self-test signatures that can then be stored into the memory. The signatures can later be unloaded and analyzed by an external tester. The assumption here is that the processor memory has been tested with standard techniques such as memory BIST before the application of the test and the memory is assumed to be fault-free. In the following, we describe in more detail the embedded software-based self-test method for testing stuck-at [6] and path delay faults [7] [8] in microprocessors using their instruction set.
Stuck-at Fault Testing
The software-based self-test method proposed by Chen and Dey [6] targets structural faults in a processor core using a divide-and-conquer approach. First, it determines the structural test needs for sub-components in the processor (e.g., ALU, program counter) that are much less complex than the full processor, and hence more amenable to random pattern testing. Next, the component tests are either stored or generated on-chip and then, at the processor level, delivered to their target components using predetermined instruction sequences.
To make sure that the test patterns generated for a sub-component-under-test can be delivered by instructions, the test preparation step precedes the self-test step.
Test preparation. To derive the realizable component tests (i.e., tests deliverable by instructions), the instruction-imposed constraints have to be first derived for each component. These constraints can be divided into input and output constraints. The input constraints define the input space of the component allowed by instructions. They describe the correlation among the inputs to the component and can be expressed in the form of Boolean equations. The output constraints define the subset of component outputs observable by instructions. To obtain a close prediction of fault coverage in component-level fault simulation, errors propagating to component outputs that are unobservable at processor-level are regarded as unobserved. Also, the constraints imposed by the processor instruction set can be divided into those that can be specified in a single time frame (spatial constraints) and those that span over several time frames (temporal constraints). Temporal constraints are used to account for the loss of fault coverage due to fault aliasing, in the cases where the application of one test pattern involves multiple passes through a fault inside the component.
If component tests are generated by automatic test pattern generation (ATPG), the spatial constraints can be specified during test generation with the aid of the ATPG tool. Alternatively, they can be specified with virtual constraint circuits as proposed in [17] . Similarly, temporal constraints can be modeled with sequential virtual constraint circuits. Different from the case of ATPG, if random tests are used for components, random patterns can only be used on independent inputs.
Component-level fault simulation is used for evaluating the preliminary fault coverage of these tests.
The final fault coverage can be evaluated with processor-level fault simulation once the entire selftest program is constructed. Although component tests are generated only for the subset of components that are easily accessible through instructions (e.g., ALU, program counter, etc.), other components such as the instruction decoder are expected to be tested extensively during the application of the self-test program. Using manually extracted constraints, the above scheme has been applied to a simple Parwan processor [20] . The results have demonstrated the feasibility and effectiveness of the software-based self-test method by generating a high-coverage test program for the simple processor. 
Delay Testing
Ensuring that the designs meet performance specifications requires application of delay tests. These tests need to be applied at-speed and require two-vector patterns to activate and propagate the fault effects. A software-based self-test method aiming at delay faults in processor cores has been proposed by Lai et al. [7] [8]. As in the case of stuck-at faults, not all delay faults in the microprocessor can be tested in the functional mode. This is simply because no instruction sequence can produce the desired test sequence which can sensitize the path and capture the fault effect into destination output/flip-flop at-speed. A fault is said to be functionally testable if there exists a functional test for that fault. Otherwise, the fault is functionally untestable. To illustrate functionally untestable faults consider the microprocessor datapath in Figure 4 (Parwan processor [20] ). It contains an 8-bit ALU, an accumulator (AC) and an instruction register (IR). The data inputs, A7-A0 and B7-B0, of the ALU are connected to the internal data bus and the accumulator, respectively. The control inputs of the ALU are S2-S0. The values on S2-S0 instruct the ALU to perform the desired arithmetic/logic operation. The outputs of the ALU are connected to the inputs of AC and the inputs of IR. It can be shown that for all possible instruction sequences, whenever a rising transition occurs on signal S1 at the beginning of a clock cycle, AC and IR can never be enabled at the end of the same cycle. Therefore, paths that start at S1 and end at the inputs of IR or AC are functionally untestable since delay effects on them can never be captured by IR or AC immediately after the vector pair has been applied. The goal of the test preparation step is to identify functionally testable faults and synthesize tests for them.
Test preparation. The flow of the test program synthesis for self-test of path delay faults in a microprocessor using its instructions is shown in Figure 5 . Given the instruction set architecture and the micro-architecture of the processor core, the spatial and temporal constraints between and at the registers and control signals are first extracted. Next, a path classification algorithm, extended from [21] [22], implicitly enumerates and examines all paths and path segments. If a path cannot be sensitized with the imposed extracted constraints, the path is functionally untestable and thus, eliminated from the fault universe. This helps reduce the computational effort of the subsequent test generation process. As the experimental results in Table 2 show, a high percentage of the paths are functionally untestable [7] . The results are given for Parwan processor [20] and DLX processor [23] for non-robustly (NR) testable and functionally sensitizable (FS) faults [24] . Neither of these microprocessors is pipelined.
I n s t r . S e t A r c h i t e c t u r e , m -a r c h i t e c t u r e & N e t l i s t T e s t P r o g r a m S y n t h e s i s T e s t P r o g r a m S y n t h e s i s C o n s t r a i n t E x t r a c t i o n C o n s t r a i n t E x t r a c t i o n C o n s t r a i n e d S t r u c t u r a l A T P G C o n s t r a i n e d S t r u c t u r a l A T P G P a t h C l a s s i f i c a t i o n P a t h C l a s s i f i c a t i o n T e s t P r o g r a m I n s t r . S e t A r c h i t e c t u r e , m -a r c h i t e c t u r e & N e t l i s t T e s t P r o g r a m S y n t h e s i s T e s t P r o g r a m S y n t h e s i s C o n s t r a i n t E x t r a c t i o n C o n s t r a i n t E x t r a c t i o n C o n s t r a i n e d S t r u c t u r a l A T P G C o n s t r a i n e d S t r u c t u r a l A T P G P a t h C l a s s i f i c a t i o n P a t h C l a s s i f i c a t i o n T e s t P r o g r a m Next, a subset of long paths among the functionally testable paths are selected as targets for test generation. A gate-level ATPG for path delay faults is extended to incorporate the extracted constraints into the test generation process and it is used to generate test vectors for each target path delay fault. If the test is successfully generated, it not only sensitizes the path but it also meets the extracted constraints. Therefore, it is most likely to be deliverable by instructions (if the complete set of constraints has been extracted, the delivery by instructions could be guaranteed). In the test program synthesis process that follows, the test vectors specifying the bit values at internal flip-flops are first mapped back to word-level values in registers and values at control signals. These mapped value requirements are then justified at the instruction level. Finally, a pre-defined propagating routine is used to propagate the fault effects captured in the registers/flip-flops of the path delay fault to the memory. This routine compresses the contents of some or all registers in the processor, generates a signature and stores it in memory. The procedure is repeated until all target faults have been processed. The test program is generated off-line and will be used to test the microprocessor atspeed. 
Embedded Processor Self-Diagnosis
In addition to enabling at-speed self-test with low-cost testers, software-based self-test eliminates the use of scan chains and the associated test overhead, making itself an attractive solution for testing high-end microprocessors. The elimination of scan chains, on the other hand, poses a significant challenge for fault diagnosis. Though deterministic methods for generating diagnostic tests are available for combinational circuits [25] 
Self-Testing of Buses and Global Interconnects
In SoC designs a large amount of core-to-core communications must be realized with long interconnects. As gate delay continues to decrease, the performance of interconnect is becoming increasingly important in achieving a high overall performance. However, due to the increase of cross-coupling capacitance and mutual inductance, signals on neighboring wires may interfere with each other, causing excessive delay or loss of signal integrity. While many techniques have been proposed to reduce crosstalk, due to the limited design margin and unpredictable process variations, the crosstalk must also be addressed in manufacturing testing.
Due to its timing nature, testing for crosstalk effects needs to be conducted at the operational speed of the circuit-under-test. However, at-speed testing of GHz systems requires prohibitively costly high-speed testers. Moreover, with external testing, hardware access mechanisms are required for applying tests to interconnects deeply embedded in the system. This may lead to unacceptable area or performance overhead.
A BIST technique in which an SoC tests its own interconnects for crosstalk defects using on-chip hardware pattern generators and error detectors has been proposed in [28] . Although the amount of area overhead may be amortized for large systems, for small systems, the amount of relative area overhead may be unacceptable. Moreover, hardware-based self-test approaches, as the one in [28] , may cause over-testing and yield loss, as not all test patterns generated in the test mode are valid in the normal operational mode of the system.
The problem of testing system-level interconnects in embedded processor-based SoCs, which are the most dominant type of SoCs, has been addressed in [29] [30] . In such SoCs, most of the systemlevel interconnects, such as the on-chip buses, are accessible to the embedded processor core(s). The proposed methodology is software-based and it enables an embedded processor core in the SoC to test for crosstalk effects in these interconnects by executing a software program. The strategy is to let the processor execute a self-test program with which the test vector pairs can be applied to the appropriate bus in the normal functional mode of the system. In the presence of crosstalk-induced glitch or delay effects, the second vector in the vector pair becomes distorted at the receiver end of the bus. The processor can then store this error effect to the memory as a test response, which can be later unloaded by an external tester for off-chip analysis.
Compared with external testing, self-testing is a more feasible solution for at-speed crosstalk testing, as it does not impose any performance requirement on the external tester. Also, compared to the hardware-based self-test approaches for crosstalk testing, software-based self-test can be applied in the normal operational mode of the processor. Therefore, no extra hardware is needed. In addition, only the test patterns valid in the normal operational mode of the processor are applied.
Thus, the system will not be over-tested. In a core-based SoC, the address, data and control busses are the main types of global interconnects with which the embedded processors communicate with memory and other cores of the SoC via memory-mapped I/O. Chen et al. [29] concentrate on testing data and address bus in a processor-based SoC. The crosstalk effects on the interconnects are modeled using the MA fault model.
Testing Data Bus.
For a bi-directional bus such as data bus, crosstalk effects vary as the bus is driven from different directions. Thus crosstalk tests need to be conducted in both directions [28] .
However, to apply a vector pairs (v1, v2) in a particular bus direction, the direction of v1 is irrelevant. Only v2 needs to be applied in the specified direction. This is because the signal transition triggering the crosstalk effect takes place only when v2 is being applied to the bus.
To apply a test vector pair (v1, v2) for the data bus from an SoC core to the CPU, the CPU first exchanges data v1 with the core. The direction of data exchange is irrelevant. For example, if the core is the memory, the CPU may either read v1 from the memory or write v1 to the memory. The CPU then requests data v2 from the core (a memory-read if the core is memory). Upon the arrival of v2, the CPU writes v2 to memory for later analysis.
To apply a test vector pair (v1, v2) to the data bus from the CPU to an SoC core, the CPU first exchanges data v1 with the core. Then, the CPU sends data v2 to the core (a memory-write if the core is memory). If the core is memory, v2 can be directly stored to an appropriate address for later analysis. Otherwise, the CPU must execute additional instructions to retrieve v2 from the core and store it to memory.
Testing Address Bus.
To apply a test vector pair (v1, v2) to the address bus, which is a unidirectional bus from the CPU to an SoC core, the CPU first requests data from two addresses (v1 and v2) in consecutive cycles. In the case of a non-memory core, since the CPU addresses the core via memory-mapped I/O, v2 must be the address corresponding to the core. If v2 is distorted by crosstalk, the CPU would be receiving data from a wrong address, v2', which may be a physical memory address or an address corresponding to a different core. By keeping different data at v2 and v2' (i.e., mem[v2] ≠ mem[v2']), the CPU is able to observe the error and store it to memory for analysis. Figure 7 illustrates this process. For example, in the case where the CPU is communicating with a memory core, to apply test (0001, 1110) in the address bus from the CPU to the memory core, the CPU first reads data from address 0001. The CPU then reads data from address 1110. The feasibility of this method has been demonstrated by applying it to test the interconnects of a processor-memory system. The defect coverage was evaluated using a system-level crosstalk defect simulation method.
Functionally Maximal Aggressor Tests. Even though the MA tests have been proven to cover all physical defects related to crosstalk between interconnects, Lai et al. [30] observe that many of them can never occur during normal system operation due to constraints imposed by the system. Therefore, testing buses using MA tests might screen out chips that are functionally correct under any pattern produced under normal system operation. Instead, Functionally Maximal Aggressor (FMA) tests meeting the system constraints and being possible to be delivered under the functional mode are proposed [30] . These tests provide a complete coverage of all crosstalk-induced logical and delay faults that can cause errors during the functional mode. The flow of the self-test strategy for testing crosstalk-induced faults on on-chip buses is shown in Figure 8 . Given the timing diagrams of all bus operations, the spatial and temporal constraints imposed on the buses can be extracted. These constraints are imposed by the functionality of the bus protocol or by the processor core. Next, the FMA tests are generated. They represent vectors that are applicable in the functional mode and can enable the maximal number of aggressors to a victim wire.
T im in g d ia g r a m s o f a ll b u s o p e r a tio n s F M A te s t g e n e r a tio n F M A te s t g e n e r a tio n T e s t p r o g ra m s y n th e s is T e s t p r o g ra m s y n th
A covering relationship between vectors extracted from the timing diagrams of the bus commands is used during the FMA test generation process. For example, consider the three vector pairs, P1, P2
and P3, in Figure 9 . All three patterns target a d f fault on wire Y 3 . Pattern P1 is the best since it is an MA test. Pattern P2 is a better test than pattern P3 since the falling transition on Y 2 in P3 weakens the combined strength of the aggressors on the victim wire Y 3 . In contrast, the constant value (i.e., 00 or 11) on Y 2 in pattern P2 does not increase or decrease the combined strengths of the aggressors.
Therefore, any crosstalk defect that can be detected by P3 will also be detected by P2. Therefore, pattern P1 covers P2 and P2 covers pattern P3. If pattern P1 cannot be applied due to functional constraints and we need to choose between P2 and P3, pattern P2 will be a better choice. The same covering relationship was previously used in [32] to evaluate the crosstalk fault coverage achieved by any given test set. Each of the FMA tests can be directly mapped into an instruction sequence. However, the simple mapping approach could lead to a lengthy program that might dramatically increase the test application time. To reduce the length of the test program, the FMA tests can be first compacted [30] . Since the resulting FMA tests are highly regular, the test program can be synthesized in an algorithmic way by a software routine.
The synthesized test program is highly modularized and very small. Experimental results have
shown that a test program as small as 3K bytes can detect all crosstalk defects on the bus from the processor core to the target core.
Next, the synthesized test program is applied to the bus from the processor core and the input buffers of the destination core capture the responses at the other end of the bus. Such responses need to be read back by the processor core to determine whether or not any faults on the bus occurred.
However, because the input buffers of a non-memory core cannot be read by the processor core, a
DfT scheme is suggested to allow direct observability of the input buffers by the processor core. The
DfT circuitry consists of bypass logic added to each I/O core to improve its testability.
With the DfT support on the target I/O core, the test generation procedure first synthesizes instructions to set the target core to the bypass mode and then it continues with synthesizing instructions for the FMA tests. The test generation procedure does not depend on the functionality of the target core.
Self-Testing of Other Non-Programmable IP-Cores
Testing non-programmable cores on an SoC is a complex problem with many unresolved issues [33] . Industry initiatives such as the IEEE P1500 Working Group [34] provide some solutions for IP core testing. However, they do not address the requirements of at-speed testing.
A self-testing approach for non-programmable cores on an SoC has been proposed in [33] . In this approach, a test program running on the embedded processor delivers test patterns to other IP cores in the SoC at-speed. The test patterns can be generated on the processor itself or fetched from an external ATE and stored in on-chip memory. This alleviates the need for dedicated test circuitry for pattern generation and response analysis. The approach is scalable to large size IP cores whose structural netlists are available. Since the pattern delivery is done at the SoC operational speed, it supports delay test. A test wrapper (shown in Figure 2 ) is inserted around each core to support pattern delivery. It contains test support logic needed to control shifting of the scan chain, buffers to store scan data and support at-speed test, etc.
The test flow based on the embedded software self-testing methodology is illustrated in Figure   10 . It offers tremendous flexibility in the type of tests that can be applied to the IP cores well as in the quality of the test pattern set without entailing significant hardware overhead. Again, the flow is divided into a pre-processing phase and a testing phase. In the pre-processing phase, a test wrapper is automatically inserted around the IP core under test. The test wrapper is configured to meet the specific testing needs for the IP core. The IP core is then fault simulated with different sets of patterns. Weighted random patterns generated with multiple weight sets or using multiple capture cycles [5] after each scan sequence are used in [33] .
These patterns can achieve desired fault coverage with shorter test length as compared to pseudorandom testing. Also, since they are software-generated they do not incur hardware overhead as in the case of weighted random testing in hardware BIST. Next, a high-level test program is generated.
This program synchronizes the software pattern generation, start of the test, application of the test and analysis of the test response. The program can also synchronize testing multiple cores in parallel. The test program is then compiled to generate a processor specific binary code.
In the test phase, the test program is run on the processor core to test various IP cores. A test packet is sent to the IP core test wrapper informing it about the test application scheme (e.g., single or multiple capture cycle). Data packets are then sent to load the scan buffers and the PI/PO buffers.
The test wrapper applies the required number of scan shifts and captures the test response for the programmed number of functional cycles. The results of the test are stored in the PI/PO buffers and the scan buffers and from there they are read out by the processor core.
Testing delay faults requires application of vectors at-speed. The embedded software-based selftest methodology in [33] allows application of patterns to an IP core at-speed between scan boundaries using the multiple capture test scheme [5] which produces better results than pseudorandom testing of delay faults.
Instruction Level DFT/Test Instructions
While self-testing manufacturing defects in an SoC by running test programs using a programmable core has several potential benefits including, at-speed testing, low DfT overhead due to elimination of dedicated test circuitry and better power and thermal management during testing, such a self-test strategy might require a lengthy test program and might not achieve a high enough fault coverage. These problems can be alleviated by applying a DfT methodology based on adding test instructions to an on-chip programmable core such as a microprocessor core. This methodology is called instruction-level DfT.
Instruction-level DfT inserts test circuitry in the form of test instructions and should be a less intrusive approach as compared to the gate-level DfT techniques which attempt to create a separate test mode somewhat orthogonal to the functional mode. If the test instructions are carefully designed such that their micro-instructions reuse the datapath for the functional instructions and do not require any new datapath, the overhead, which only occurs in the controller, should be relatively low. This methodology is also more attractive for applying at-speed tests and for power/thermal management during test, as compared to the existing logic BIST approaches.
Instruction-level DfT methods have been proposed in [13] [36] . The approach in [13] adds instructions to control the exceptions such as microprocessor interrupts and reset. With the new instructions, the test program can achieve fault coverage between 87% and 90% for stuck-at faults.
However, this approach cannot achieve a higher coverage because the test program is synthesized based on a random approach and it is not able to effectively control or observe some internal registers that have low testability.
The DfT methodology proposed in [36] systematically adds test instructions to an on-chip processor core to improve the self-testability of a processor core, reduce the size of the self-test program as well as reduce its run time (i.e., reduce the test application time). To decide which instructions to add, the testability of the processor is analyzed first. If a register in the processor is identified as hard-to-access, a test instruction allowing direct accessing of the register is added. The testability of a register can be determined based on the availability of data movement instructions between registers and memory. A register is said to be fully controllable if there exists a sequence of data movement instructions that can move the desired data from memory to the register. Similarly, a register is said to be fully observable if there exists a sequence of data movement instructions to propagate the register data to memory. Given the micro-architecture of a processor core, it is possible to identify the fully controllable and fully observable registers. For the registers that are not fully controllable/observable, new instructions can be added to improve their accessibility.
In addition to these test instructions, test instruction can be also added to optimize the test program size and run time. This is based on the observation that in the synthesized self-test program some code segments (called hot segments) appear repeatedly. Therefore, addition of few test instructions can reduce the size of hot segments. Test instructions can be added to speed up the process of preparing the test vectors by the processor core, retrieving the responses from the on-chip core under test and analyzing the responses (by the processor core).
When adding new instructions, the existing hardware should be "reused" as much as possible, i.e., to reduce the area overhead, adding extra buses or extra registers to implement new instructions should be avoided. In fact, in most cases, a new instruction can be added by introducing new control signals to the datapath without adding extra hardware.
Adding test instructions to the programmable core does not improve the testability of other nonprogrammable cores on the SoC. Therefore, instruction-level DfT cannot increase the fault coverage of the non-programmable cores. However, the test programs for testing the non-programmable cores can be optimized by adding new instructions. In other words, the same set of test instructions added for self-testing the programmable cores can be used to reduce the size and run time of the test programs for testing other non-programmable cores.
For pipelined designs, instructions can be added to control the registers buried deeply in the pipeline that are hard to control.
Program length Run time Test generation time Area overhead
PARWAN
-34% -39% -36% 4.7% DLX -15% -21% -51% 1.6% Table 4 : Results for testing processors.
The experimental results on two processors (Parwan [20] and DLX [23] ) are given in Table 4 .
These results are obtained after adding four test instructions to each of the processors. As it can be seen, test instructions can significantly reduce the program size and program run time with a reasonable area overhead.
Self-Testing of On-Chip ADC/DAC and Analog Components Using DSP-Based Approaches
For mixed-signal systems that integrate both analog and digital functional blocks onto the same chip, testing of analog/mixed-signal parts has become the bottleneck during production testing.
Because most analog/mixed-signal circuits are functionally tested, analog/mixed-signal testing needs expensive automatic test equipment (ATE) for analog stimulus generation and response acquisition.
With the advent of the CMOS technology, DSP-based BIST becomes a viable solution for analog/mixed-signal systems as the required signal processing to make the pass/fail decision can be realized in the digital domain with digital resources.
An efficient BIST architecture for testing on-chip analog and mixed-signal components has been proposed in [37] . It employs the delta-sigma modulation technique for both stimulus generation [38] and response analysis. Figure 11 illustrates this delta-sigma modulation-based BIST architecture. A software delta-sigma modulator converts the desired signal to one-bit digital steam. The digital 1's and 0's are then transferred to two discrete analog levels by one-bit DAC followed by a low-pass filter that removes the out-of-band high-frequency modulation noise, and thus restores the original waveform. In practice, one extracts a segment from the delta-sigma output bit stream that contains an integer number of signal periods. The extracted pattern is stored in on-chip memory, and periodically applied to the low-resolution DAC and low-pass filter to generate the desired stimulus.
Similarly, for response analysis, a 1-bit Σ−∆ modulator can be inserted to convert the analog DUT output response into a 1-bit stream which is then analyzed by digital signal processing (DSP)
operations performed by on-chip DSP/microprocessor cores.
Note that the software part of this technique, i.e., software Σ−∆ modulator and the response analyzer, can be performed by on-chip DSP/microprocessor cores, if abundant on-chip digital programmable resources are available (as indicated in Figure 11 ), or by external digital equipment.
Conclusions
Embedded software-based self-testing has a potential to alleviate many of the current external tester-based and hardware BIST testing techniques for SoCs. In this paper, we give a summary of the recently proposed techniques for self-testing and self-diagnosis for system-on-chips. One of the main tasks in applying these techniques is extracting the functional constraints in the process of test program synthesis, i.e., deriving tests that can be delivered by processor instructions. Future research in this area must address the problem of automating the constraint extraction process in order to make the proposed solutions feasible for general processors. The software-based self-testing paradigm can be further generalized for analog/mixed-signal components using DSP-based testing techniques, Σ−∆ modulation principles and some low-cost analog/mixed-signal DfT.
