Abstract
Introduction
This paper describes our research and experimental results in the field of testing the complex circuits designed according the IEEE 1500 standard [9] . Testing of complex System on Chip (SoC) circuits has to overcome the following challenges: To reach better test quality than can be obtained by pseudorandom test set application, to reduce the tester memory requirements, to reduce the amount of data transferred to/from the tested chip, to keep the test time short and to keep the hardware overhead acceptably low.
Built-in pseudorandom or weighted random testing can be a solution of the problems with the memory for storing deterministic test patterns but still there remain random resistant faults, which have to be tested from an ATE with deterministic patterns. Mixed-mode testing uses built-in pseudorandom pattern generators, which are usually used for generating first several thousands of test patterns (10k+ patterns) and deterministic patterns are applied after the pseudorandom testing phase in order to test the random resistant faults. The deterministic patterns can be compressed, the decompression is usually done in the same automaton as it was used for generation of pseudorandom test sequence; the seeds are stored in an ATE. Linear feedback shift register (LFSR) reseeding methods [13] assume that a large portion of bits in the test patterns are unspecified. The on-chip LFSR is seeded with seeds that guarantee that the bit sequence generated by the LFSR matches the deterministic patterns at the specified positions. The number of bits stored in a tester memory is relatively small but the total number of clock cycles, which is needed for testing, may be high. Random part of mixed-mode test is time and energy consuming.
The usefulness of a test compression method is influenced not only by the compression ratio but also by the complexity of the decompressing automaton and by the computational complexity of the algorithm for finding the compressed test sequence. In order to keep the decompressing hardware minimal it is possible to compress test patterns by overlapping the patterns that are serially shifted into scan chains (SC). If the SCs do not contain internal flip-flops that can be rewritten by the test responses, the patterns can be decompressed during the test session with no additional hardware by simple performing one or more SC shifts. This approach was firstly described in [6] and later in [24] . The test pattern (TP) compression uses an algorithm for finding contiguous and consecutive maximally overlapping scan chain vectors for the actual scan chain vector. These vectors are checked whether they match with one or more of TP, which were previously generated and compacted and which were not employed in the scan chain sequence yet. In [19] we presented an improved algorithm, which speeds up the computation of overlapped patterns by searching for the successors of the all zero seed only and which improves the compression efficiency by fault simulation, which is performed during the phase of finding the overlapped patterns. The fault simulation enables the algorithm to delete the patterns corresponding with the already covered faults This algorithm uses test vectors with don't care bits instead of the compacted ATPG test vector test set, which enables us to combine TP compaction and compression to be well suited with the decompression in a scan chain or in parallel SCs. The algorithm was implemented in the COMPAS (COMpressed test PAttern Sequencer) software tool. COMPAS is intended to be used for preparation of test sequences of cores under test (CUT) that are designed according the IEEE 1500 Std. [9] .
A test session can be controlled by an embedded BIST controller. As the RAM size is limited, the test set has to be as small as possible. Further testing speed improvement could be obtained by minimizing the amount of data transferred between the processor and the tested cores. From this reason it is worthwhile to send the compressed data from the processor to the decoders that are placed closely to the tested cores and to leave the decoders to decode the patterns independently on the processor activity. Another problem arises when using cores with the SCs that contain internal flip-flops; if we have to guarantee not corrupting test patterns by CUT responses and simultaneously catching all test responses we have to scan in and scan out the whole test pattern after each system clock application. The RESPIN (Reusing Scan Chains for Test Decompression) test architecture [7] solves both pattern decompression and reducing the data traffic between tester and CUT. This architecture reuses scan chains of different cores for updating the tested core scan chain content (Fig. 1) . The RESPIN architecture temporarily divides the circuit into the core under test (CUT) and the embedded tester core (ETC). The data transfer mechanism between the tester and ETC can be denoted as a narrow Test Access Mechanism (TAM) as the demanded transfer capacity is low. The TAM between the ETC and CUT is wide as the data transfer is done parallel and on a higher clock frequency. The CUT and ETC have several parallel scan chains. The ETC chains are concatenated into a serial scan chain. They remain connected in this mode through the whole test session. A feedback tap connects the ETC last chain output with the first bit input through a multiplexer. According the Fig. 1 . ETC and CUT in the RESPIN architecture multiplexer control input, ETC can either load a bit from the tester or shift the scan chain circularly. The parallel chains of the CUT are connected with the parallel ETC chain outputs. This TP updating mechanism guarantees that the patterns, which are shifted through a CUT SC during several test steps, are not mixed with the CUT responses. When the ETC loads bit from the tester, the CUT is in the capture mode and when the ETC shifts its scan chain circularly, the CUT is in the shift mode. The CUT and ETC modes are controlled by WSC signals form the corresponding wrappers.
An additional multi input MISR connected to the SC outputs can be exploited for capturing all the test responses. The conditions for effective testing are: the ETC has at least the same number of chains as the CUT; the CUT chains are not longer than the corresponding ETC chains and the number of scan cells of the CUT and the total number of ETC scan cells incremented by one have not a common divider. If it is not possible to find an ETC core that fulfils the above mentioned conditions, more than one ETC can be used and the wide TAM structure will interconnect two or more ETCs with the CUT. Possible solutions of ETC scan chains connection are presented in [10] .
A number of TAM architectures have been proposed and published. The basic solutions are: architecture with multiplexing and distributed architecture. Various combinations of those architectures may co-exist. The architecture called TestBus [11] was developed from multiplexing and distributed architecture by their combination. Another possible architecture is TestRail [16] which tries to combine strengths both of the test bus and boundary scan test. In this paper we propose to use a partial reconfiguration for switching the diagnostic busses instead of the classical architectures.
In case of FPGAs the final functionality of the circuit depends on the configuration bitstream loaded into the device from external memory. The novel FPGA circuits are dynamically reconfigurable at runtime. These dynamically reconfigurable FPGA circuits (D_FPGA) have a capability to change the behaviour of one part of the circuit; the rest part is fully operational without changes and without interruption. In the currently known dynamically reconfigurable devices the techniques "partial configuration" or "Multiple-context configuration memory" are used. As the Atmel FPGAs can efficiently perform the fine grained reconfiguration we decided to use it for an implementation of the self-testable SoC (System on Chip) design. The test system uses RESPIN architecture, which is based on the IEEE 1500 standard. The partial reconfiguration is used for connection among ETCs, CUT and the feedback multiplexer. The main advantage of the proposed solution is that all the reconfiguration bitstreams are stored inside the chip. Thereafter the reconfiguration process can be controlled by the embedded processor and the only communication between the tested SoC and the external test supervisor is a request for execution the test and checking the results of the done tests.
Test Pattern Compaction and Compression
The self-testing system uses compressed patterns that are prepared offline by the COMPAS algorithm [19] . This algorithm was speeded up in order to be able to handle with greater cores in acceptable CPU time. In this section we describe the main principles of the improved COMPAS version. The algorithm minimizes the overlapped test sequence length in which all TP that detect at least one fault undetected by resting patterns appears at least once in the SC. The test sequence has to be decompressed in the ETC cores so that each pattern can be applied in the CUT with several parallel cores. At the beginning the Test Pattern List (TPL) -prepared by the ATPG together with the corresponding Undetected Fault List (UFL) is used. Three state test vectors with values 0, 1 and X (X means don't-care) has to be generated for the CUT. An ATPG tool that enables generating non-compacted test patterns has to be generated for each fault. In this way we can distinguish, which pattern belongs to which fault.
The main loop of the algorithm is described in Fig. 2 . Let us suppose that the SCs are reset before testing, which means that the all zero pattern is considered to be used as the first one. The fault coverage of this pattern is simulated and the detected faults are deleted from the UFL, test patterns corresponding to the detected faults are deleted from the TPL. Then the algorithm tries to compact the test set by overlapping resting patterns with the actual vector. The algorithm finds, whether the logic value 0 or logic value 1 is better to be used as the next most left chain bit. To do this the algorithm finds positions of all patterns, in which the actual chain bits maximally overlap the pattern and for which the actual bit to be introduced into the scan chain has not a don't care value. Simultaneously the algorithm determines for how many future clock cycles of the SC it is not necessary to recalculate the position of the pattern. This information is used for skipping pattern recalculation for cases when don't care bit groups are present in the patterns. Skipping pattern position and usefulness recalculation is enabled by using a concatenated list of pattern pointers. The pointers indicate the position of the SC where the pattern can eventually influence the SC content. Chaining the pointers reduces the complexity of enumerating their absolute position in the SC. After finding the position the algorithm has to count the usefulness of the treated pattern. The usefulness is characterized by the number of don't care bits and the possibility of overlapping the pattern. It is computed according the formula given in [19] . An example of positioning the vectors is given in Table 1 . Vector 1 overlaps the SC in two bits but it has a don`t care bit on the actual position. This don`t care bit is followed out by 2 other don`t care bits. This means that this vector will not be evaluated for two next algorithm cycles. The usefulness criterion prefers patterns that have high number of care bits and simultaneously that have maximum number of the care bits overlapped with the scan chain. This way of setting the actual bit guarantees that a maximum number of the most useful patterns could be encoded. When searching for the most useful pattern it checks whether the exercised pattern matches with bits, which will be necessary to be generated in the future clock cycles because of some previously selected patterns. These bits are stored in a Future Array (FA) together with their effectiveness and pattern identification numbers. If some position of FA is reserved for a logic value that is clashing with the exercised pattern bit value, the algorithm compares the usefulness of both patterns and the winner is used in the future content of the FA, the other bit is deleted from the FA but the corresponding pattern is kept in the TPL. The fault simulation is performed and the faults and patterns, which correspond to the covered faults, are removed from the lists. If there are not remaining faults in the UFL the algorithm is finished.
The system COMPAS can be remotely run on the web site [10]. In the current version of the COMPAS we use test patterns generated by the Atalanta ATPG tool [14] and the Hope fault simulation tool [15] is used for fault simulation. After accelerating the algorithm by using concatenated pointer lists enabling omitting recalculations of the patterns with don't care bits (for larger circuits more than 99 % of test pattern bits are don't care bits) the CPU time for pattern compression is proportional to the circuit complexity.
For the core of the benchmark circuit s38417 the CPU time was 435 s. on the PC with Intel Pentium IV, 2,8 MHz machine. This fact is illustrated in the graphs in Fig. 3 , where we compared the CPU time for different ITC and ISCAS benchmark circuits [4] . The first and the last graph demonstrate that the CPU time is approximately linearly dependent both on the number of circuit gates and the number of generated bits. The fault simulation time (the second graph) of one test vector (corresponding to one generated bit) scales also approximately linearly with the circuit size. These comparisons demonstrate that the total CPU time spent on test pattern compression grows acceptably slowly with growing circuit size.
The memory requirements for COMPAS are low as for each test pattern only two pointers are stored and after detecting a fault the corresponding pattern is removed from the memory. The requirements decrease during performing the algorithm. Considering the 32 bit width of the memory word the maximum consumed memory for a circuit with one million of faults is equal to 11.5 MB. COMPAS runs both on Unix and Windows. Table 2 shows the resulting CPU time (given in seconds) of finding the compressed test sequences and the test sequence lengths lengths for ITC99 benchmark circuits. The last column of the Table 2 gives the CPU time spent on generating 1000 SC bits (in seconds). We can see that this time is not directly dependent on the circuit size and the test sequence length but it depends on the circuit internal structure. Fig. 3 shows several CPU time comparisons for different circuits. In Table 3 we have compared the numbers of stored bits of the greatest ISCAS circuits for some well known test pattern compression methods and for the proposed algorithm. This comparison was not possible to be done for more complex benchmark circuits as the results of other methods were not available. In the second column we plotted the test data volume for ATPG vectors compacted only [3] .
Next columns shows the number of stored bits for statistical coding of the test patterns from the previous column [1] , combination of statistical coding and LFSR reseeding [12] , parallel/serial scan chains [20] , frequency directed codes, method of EDT [22] , RESPIN++ architecture given in [23] and COMPAS [19] . The total number of clock cycles for each method is given by the number of applied pseudorandom patterns, number of deterministic patterns and by the length of the longest parallel scan chain of the CUT. The test time will directly depend on this number.
We have compared the obtained COMPAS test time with other mixed-mode testing methods and we have found that COMPAS provides substantially lower numbers of clock cycles than other methods while the numbers of necessarily stored bits are comparable. 
Experimental SoC self-testing system
The experimental test system was built on the FPSLIC TM AT94K40AL circuit. It is a dynamically reconfigurable programmable SoC, which integrates Atmel SRAM, FPGA and an 8-bit AVR processor. The communication between these main parts is entrusted to 8 bits bus and 16 internal interrupts (see datasheet). Beside these two main parts 36 kB SRAM, UART and JTAG interface, watchdog timer, two counters etc. are placed on the chip. The user interface allows loading the AVR control program and initial bitstream form a PC to the RAM. The onchip AVR manages also the reconfiguration process. The data stored in the FPGAD register are used for programming the FPGA configuration SRAM cells according the address from the registers FPGAx. The content of FPGAX/Y/Z/D registers is set by AVR (see Fig. 4 ).
Circuit name
MinTest [3] Stat.Coding [1] Reseeding [12] Illinois Scan [20] FDR Codes [5] The FPSLIC circuit is connected to PC through JTAG interface. User is able to program both main parts of IC -program for AVR processor and/or static content of FPGA. Testing with the RESPIN architecture requires to reconfigure circuit cores several times during the test. Each core in the SoC is surrounded by the structure called wrapper [17] . The wrapper allows a connection of the core with defined surrounding either in the function mode or in the test mode. TAM takes care of the on-chip test pattern transport. TAM together with wrappers forms the infrastructure for access to individual cores providing tests of all cores. An obvious TAM architecture uses multiplexers for reconfiguring ETC and CUT diagnostic data paths connections. Every CUT chain input should allow connection with every ETC scan output and every CUT scan output should be connected with the dedicated input terminal of the sink. Every ETC chain output should allow connection with the input of the first ETC chain input through the additional multiplexer which connects the feedback tap. The multiplexer is controlled by the instruction register of the TAM, which is handled by the Wrapper Serial Control (WSC) signal. In case of multiple embedded cores the multiplexers have to switch between corresponding test modes. This approach leads to high area cost for connecting and multiplexing all core terminals and the control circuitry for these multiplexers will grow substantially with growing number of cores on the chip.
Partial FPGA reconfiguration seems to be an efficient way how to form the low area-cost TAM for multiple embedded core SoC design. The FPGA consists of number of generic cells called LUTs. In our system the LUT is used for connecting the test core terminal and a LUT of the TAM. By this arrangement two LUTs are needed to form one wire interconnection between 1-bit core test input and output terminal in the FPGA. Two TAM architectures, the TAM architecture with multiplexed access and partially reconfigurable TAM architecture, were designed, resulting hardware overheads of the TAMs are compared in Fig. 6 . We considered In both cases the TAM using dynamic reconfigurable feature of the FPGA has much less hardware overhead than the multiplexer based TAM. The test system uses an 8-bit processor AVR, an SRAM memory and a dynamic reconfigurable FPGA accessible both from the processor and from the FPGA. In the FPGA we programmed wrapped cores, the MISR, the controller and detached area of the TAM. The AVR processor was used for data processing, for handling the data with the hardware controller and for partial reconfiguration of the TAM before initiation of the core test. Test patterns together with TAM configurations were stored in the embedded SRAM. The processor controls the test scheduling and communicates with the hardware controller. The RAM is used for storing the compressed test sequence. For each test pattern the processor gives the controller a command to run the test cycle independently on the processor. This arrangement enables the hardware controller and the processor to work concurrently and to speed up the test. During the test cycle the AVR transports one test bit from the memory to the port tdi and informs the controller about availability and suitability of test data (Fig. 1) . At the end of the test session, the processor shifts data through the port tdo from the MISR where the responses were accumulated and compares the resulting signature with the sample one from the RAM. After finishing the first CUT test the TAM is partially reconfigured and the next core is assigned as a CUT and it is tested through a newly reconfigured ETC. As the granularity of configurable blocks of the FPGA is relatively fine only a small part of the configuration memory has to be replaced by a new content (In Fig. 4 denoted by gray color). Three ISCAS benchmark circuits (S298, S382 and S444) were used as cores in the experiment. The hardware of the test system including these cores was designed and represented 34% of the AT94K40 resources. For three cores S1423 contained in the SoC 73% of the FPGA AT94K40 resources were exploited. Reconfiguration takes several thousands of clock cycles of processor. Number of clock cycles depends on the design to be reconfigured. In our case the reconfiguration time is less than 1 ms in case of 4 MHz processor clock. The circuit has 36 Kbytes of available RAM memory. The size of one reconfigurable bitstream, which was used in the test system, was 2 Kbytes. The test time depends on the longest parallel chain and on the number of bits of the compressed test. In our case the test time is about 0.3 ms for the best possible clock frequency of the FPGA (40 MHz).
Conclusion
The proposed system uses highly compressed test patterns; according to our knowledge the compression ratio is better than for other comparable methods. The compression consists of test pattern overlapping. The overlapped patter sequence can be obtained by the COMPAS software tool. For the use of the compressed test sequence in the multi scan chain system the sequence is reordered in order to be correctly decompressed within the RESPIN architecture. We have solved the problem of long CPU time for enumerating the compressed test sequence by multiple usage of test bit usability evaluation during the process of finding the test sequence and by skipping pattern recalculation for cases when don't care bit groups are present in the patterns. This was enabled by using a concatenated list of pattern pointers. We have verified that the proposed system is applicable on a SoC. We have placed the system together with simple functional cores on the AT94K circuit. The test system uses the dynamic and partial reconfiguration feature of the embedded FPGA. This is advantageous because it saves resources of the FPGA devoted for switching the TAM busses. All the test hardware with reconfigurable TAM can remain on the original FPSLIC circuit and only the wrapped cores will be placed on the additional FPGA or ASIC. Other possibility is to implement the system on the large Xilinx FPGA circuits with embedded processor and RAM memory block. This will be the future work of our team.
