Abstract: Various techniques can be used to reduce the test time and cost of chip development, some of which achieve their objective by reducing the test data volume through the implementation of compression technologies such as XOR-based decompressors. In the presence of XOR decompressor, the delivery of acceptable (encodable) test patterns might not be possible. The AlignEncode technique which manipulates the distribution of care bits in the test pattern could increase the delivery of more encodable test patterns. Yet a sequential algorithm, which computes the delay values for a test pattern given the XOR decompressor specifications, faces major drawbacks when applied on large test patterns. In this paper, we propose a parallel version of the Align-Encode algorithm which is designed to work on distributed memory architecture. It exploits the nature of the problem in order to make significant improvements in performance with respect to time as well as the number of encodable test patterns produced, and in test data compression as a result.
Introduction
The screening out of the defective chips necessitates the test of every single chip subsequent to manufacturing, introducing an additional costly step to the product development cycle. This additional step not only pressures the stringent time to market deadlines but furthermore exacerbates IC development costs as well. Throughout this process, an external tester equipment is utilized to apply a set of pre-generated test vectors to the chip being tested, while the same tester collects the chip responses. A comparison of the chip responses against the pre-computed, expected ones helps judge whether the chip is defective.
Logic cone structure of typical designs lead into don't care bits (x's) in test vectors, paving the way for test data squashing techniques. This type of a redundancy innate in test data can be exploited by compressing the test vectors. The compressed stimuli is stored and transmitted from the tester to the chip being tested, and is expanded on-chip. In a typical scan architecture that supports compression, a decompressor expands a few scan-in channels into a larger number of scan chains. While test data volume and test time are thus reduced, the underlying structure of the stimulus decompressor determines the encodability of a test pattern. In the case of combinational decompressors, for instance, the test vector fragment to be delivered into a scan slice1 is analyzed to judge whether the care bits of the fragment can be obtained [1, 2] , is based on the horizontal move of stimulus fragments inserted into scan chains. Such a capability can be attained by inserting controllable delay elements on the scan-in path of scan chains. By inserting a delay element on the selected chains, effectively the stimulus of the corresponding chain is shifted, offering an alternative distribution that may possibly be encodable. The work in [1, 2] has demonstrated the beneficial application of horizontal move of stimulus fragments to boost the encodability of fan-out decompressors.
In this paper, we extend our preliminary work in [3] by also proposing a distributed implementation of the AlignEncode algorithm, in order to further improve the effectiveness of the XOR decompressor utilized in conjunction. We also illustrate the application of Artificial Intelligence (AI) techniques in solving this challenging problem. As a result, we attain significant runtime improvements in finding the delay configuration to make a pattern encodable. Due to the runtime improvements, the quality of results is also improved. A more efficient and parallel exploration of the search tree enables the distributed approach to increase the number of encodable patterns, some of which remain unencodable due to runtime limitation of the serial algorithm. The distributed algorithm enhances the compression levels of the accompanying decompressor, consequently.
Conceptually, stimulus manipulation techniques (Align-Encode being the only one in the literature) can be utilized in conjunction with any combinational decompressor, in order to boost the encoding capability of the decompressor. Various decompressors were outlined in [4] . These techniques include fan-out decompressors [5, 6, 7] , XOR decompressors [8, 9] , multiplexer/ fanout decompressors [10] , and switch-based decompressors [11] . Other solutions include LFSR re-seeding [12, 13, 14] , a combination of single input shift registers, clock gating logic and an XOR network [15] , scan tree architectures [16, 17] , and sequential decompressors [18] . To compensate for the defect/fault coverage losses due to unencodable patterns, these techniques either employ an additional compression-free phase [5, 6] , or they utilize test generation as well [8, 18] , searching for alternative encodable test vectors for the missed faults.
2
The Align-Encode Algorithm -Parallel Version The problem's nature is based on searching a huge state space where no dependency exists between different solutions, i.e. we can divide the search space into equallysized parts and process these parts independently. Furthermore, the position of the solution affects the time required to find it dramatically. For instance, if the first solution lies in the 4 th quarter of the search tree meaning that the first %75 of the search space should be exhausted first before starting with the fourth quarter when a sequential algorithm is used. Therefore, being able to initiate the search anywhere in the search tree would enhance the performance. From the tests done, the sequential algorithm tends to be impractical for large test patterns (≥64x64). Distributed memory architecture [19] is used since each instance of the application will run on a separate node, where some kind of interprocess (and inter-machine) communication has to take place to exchange data. For this purpose, Message Passing is used for sending messages between nodes. The algorithm implementation uses MPI (Message Passing Interface) as a standard to exchange data messages between nodes. The adopted model of the algorithm is the Master-Slave model. It is divided into two major modules:
• The controller node's (Master) module: This part of the application is executed on the controller node which is responsible for preparing input data, distributing the work (partitioning the search space) over working nodes and gathering the results. Synchronizing working nodes is also one of the controller's tasks.
• The working node's (Slaves) module:
This part of the code receives input from the controller, searches the given search space part sequentially and reports result back to the controller waiting for other tasks. The controller node is dedicated to manage the working nodes, synchronize them, distribute work and gather results. After reading the test pattern, the controller does a limited depth first traversal on the search tree up to the (k- 1) th level building a bit sequence made up from the bits representing the branch chosen at each step (0 for left branch, 1 for right branch). When the deepest level is reached the controller sends the test pattern data and the currently constructed bit sequence prefix to the corresponding working node initiating the processing as in Figure 1 . The following pseudo-code describes the parallel XOR Align Encode algorithm: We compared the test data volumes of the architecture without versus with Align-Encode in a two-phase test application process. We provide percentage reductions in test data volume delivered by the decompressor alone and by the decompressor along with Align-Encode, both with respect to the base case. Test data volume of the base case is computed as a product of the number of scan chains (chains), the number of test patterns (T), which is 250, and the scan depth (depth_long). Test data volume of the first phase in the case of no Align-Encode is computed as the product of the number of scan-in channels (V), the number of encodable patterns (Torg) and the scan depth (depth_short). Test data volume of the first phase of Align-Encode is also computed similarly, except that the penalty incurred due to delay information and to one additional cycle per pattern are also included for the patterns that became encodable due to Align-Encode (TAE). The second phase test data volume is computed identically for both cases; it is computed as the product of the number of scan chains (chains), the number of unencodable test patterns (without AE: T -Torg , with AE: T -Torg -TAE), and scan depth (depth_short): We can summarize the test data volume reductions for our test cases as shown in Tables 1 and 2 that show the reduction value with (AE) and without applying Align-Encode (NO_AE). Tables 3-5 summarize the test data volume reduction differences between decompressor alone and decompressor together with Align-Encode.
We are interested in the finding the reduction in test data volume without versus with applying Align-Encode then find the difference between the two reductions to analyze how this difference changes as the number of nodes is increased. To find the reduction in test data volume we use the following formula: To visualize the effect of changing different factors on test data volume, we plot in Figure 3 test data volume reduction difference of Align-Encode and no Align Encode versus the number of processing nodes (n) for R=0.90 and 64x64 test patterns: 
Conclusion & Future Work
In this paper, an important feature -the ability to exploit parallelism on the data level -has been observed and implemented trying to improve the performance of the original algorithm. Both the sequential and parallel algorithms have been tested on a variety of test cases. Significant speedup has been gained when applying the parallel implementation of the algorithm on relatively hard problems where deep search should be performed and just less frequent pruning is done. With such hard problems, increasing the number of nodes showed improvements in the execution time and the number of solved test patterns. As the problem gets relaxed, the sequential implementation tends to be the best among the others where the execution time is similar between the sequential version and the parallel version except for the added communication time in the parallel implementation.
In some cases where a node returns a result of "No solution", such nodes remain idle until the next test case is provided. One solution to this possible inefficiency is to apply the idea of load balancing. With load balancing, a busy working node can pass part of its work to an idle node either directly or through the controller node. Therefore, once a node got idle, it sends a "need work" request. Based on certain criteria, a busy node is chosen and part of its work is passed to the idle one. With this enhancement, node's maximum utilization would be achieved.
The current version of the algorithm has a restriction on the number of nodes. It should be a power of two. Our intention is to generalize the algorithm to work with any number of nodes n. The idea relies on making the work distribution occur in two levels. Another direction is to improve over the performance of the algorithm on large-scale configuration such as 128 or 256-chain test pattern using heuristic knowledge obtained during the process of delay bit distribution among the bits of the test patterns.
