Taking advantage of multicore architectures can provide significant improvement for many design automation problems. However, the parallelization procedure introduces challenges, such as workload duplication, limited search space exploration, and race contention among different threads. In this article, we propose a parallel framework for automatic test pattern generation using shared memory multicore systems that support test generation (TG) for both single-detect and multiple-detect fault models. The framework follows a two-epoch approach, each focusing on a different category of faults, during which a test seed generation is followed by compatibility merging. Various optimization techniques are incorporated in each epoch, designed to achieve higher speedup for the overall TG procedure without impacting much the test set size. A cluster-based approach is also presented, extending the proposed framework to consider multiple-detect fault models without affecting its efficiency. The obtained experimental results demonstrate increased speedup rates compared with the state-of-the-art multicore-based tools while, at the same time, the test inflation problem is restrained. For the multiple-detect extension, these properties are maintained despite the increased workload and the additional constraint of retaining the number of detections for each fault while merging.
I. INTRODUCTION
T ECHNOLOGY shrinking in the integrated circuit manufacturing process allowed the implementation of multiple processing units (cores) on a single chip and large amounts of on-chip memory. These developments offer extensive processing power that can be used in various computationally intensive problems, including popular electronic design automation processes. However, the distributed fashion of this processing power guides toward the development of parallel methodologies that scale well as the number of cores per chip is expected to increase beyond a few dozens to hundreds.
Automatic test pattern generation (ATPG), a well-known NP-hard problem (can be solved in nondeterministic polynomial time), becomes more demanding as devices under test are becoming larger and more complicated and as emerging defects require new fault models of higher complexity. While previously proposed procedures are very effective, see [1] and [2] , among many others, they are inherently nonparallel, and thus, cannot rely on automatic parallelization using sophisticated compilers. Proper problem decomposition, workload distribution, and final test set recomposition are essential to guarantee the quality of the results while maintaining fault coverage and other test set characteristics such as test size. Since, typically, each computing core does not consider the entire search space, parallel approaches tend to choose local optimal solutions resulting in the test set increase [3] , known as the test inflation problem.
Parallel ATPG has been studied before the on-chip multicore era, by either applying bit-level parallelism or distributing ATPG components among multiple processing units, not physically on the same chip [3] , [4] . These approaches were designed to avoid/minimize communication overhead and were constrained by the machine's word size. In current on-chip multicore architectures with shared memory, on-chip communication is much faster, significantly reducing the cost of intercore communication. Furthermore, a high level of memory coherency is guaranteed, and the number of available cores keeps increasing. These new developments and trends motivate toward the investigation of parallel ATPG approaches capable of achieving speedup scalability as the number of on-chip cores increases, while overcoming new challenges, such as shared memory contention, and efficient workload distribution of parallel threads.
ATPG parallelization for on-chip multicore environments exploits a variety, often a mixture of parallelism dimensions, such as fault parallelism, structural (circuit) parallelism, and algorithmic (including search-space) parallelism. Moreover, the goal of utilizing parallelism often varies. For example, Czutro et al. [5] exploit algorithmic parallelism via SAT (Boolean satisfiability problem) solver parallelism for maximizing fault coverage with limited speedup with respect to the corresponding serial process. Similarly, Liao et al. [6] apply bit-level parallelism to generate multiple test patterns concurrently that meet different quality metrics to achieve higher physical-aware coverage. Static fault parallelism is explicitly considered in [7] using a master-slave architecture to reduce interprocess communication which achieves sublinear speedup up to eight cores but suffers from increased test set sizes (test inflation). Furthermore, recent applications of test generation (TG) algorithms in security and reliability of integrated circuits employ parallel approaches. For example, the recent work of [8] proposes a side-channel-ware parallel TG approach which aims to statistically increase hardware Trojan sensitivity.
Parallelization speedup rates and test set inflation for shared memory architectural models are investigated in [9] - [13] . Shared memory is utilized in [9] as a low-latency communication means with high capacity to leverage synchronization and communication of the process with the goal of minimizing test inflation. Fault coverage is maintained while pattern count is reduced but at the expense of linearly scalable speedup. The work in [10] proposes a low communication circular pipeline parallel ATPG procedure, which emulates the deterministic execution of a serial ATPG in order to be able to reproduce the same test set every time the parallel algorithm is executed. This also leads to limitations in speedup scalability. The series of works in [11] - [13] target both parallelization speedup and test inflation minimization strategies, incorporated in stateof-the-art commercial tools. In particular, Cai et al. [11] achieve speedup by applying dynamic fault partitioning and depth-first-search (DFS)-based compaction. The work in [12] extends [11] for distributed multicore hybrid architectures, while the work in [13] incorporates a copy-on-write technique for private data protection in order to reduce memory locking when the same part of the memory is used concurrently by more than one cores. Similar to the above-mentioned approaches, the work proposed here is targeted toward achieving high degree of speedup, as the number of available cores increases, and at the same time, limiting test set inflation. A more elaborated comparison with the results of the works of [9] - [13] can be found in Section V-A.
Parallel approaches have also been proposed targeting graphics processing units (GPUs)-based architectures. In contrast to the fault simulation problem where the GPU model can be effective due to its concurrent nature, which can directly adopt the single-instruction multiple-data approach of GPUs [14] , [15] . ATPG parallel threads often require interthread communication in order to achieve high speedup rates and avoid test inflation. This communication can be effectively facilitated by shared memory, which is, however, very limited in GPU-based architectures. Existing approaches for ATPG using GPUs/GPGPUs suffer from limited speedups or high-test inflation rates [16] - [18] . For example, the recent method in [18] reports overall speedups (with respect to a serial ATPG approach) in the order of 0.71× to 40.7× when using a GPU with 2880 processing cores.
In this article, we propose a parallel ATPG methodology, for shared-memory multicore systems, geared toward high speedup and test inflation containment. The methodology takes advantage of fast and low cost shared memory communication, inherent in the underlying architecture, in order to coordinate the main steps of the ATPG to avoid redundant work. The approach dynamically allocates workload, while minimizing memory contention caused by multiple cores (threads) accessing shared data. A TG flow is proposed in which hard-to-detect faults are targeted first, followed by a parallel fault simulation-based merging process to maximize fault coverage. This process employs a series of newly proposed parallelization heuristics to explicitly avoid simultaneous consideration of the same faults by two or more cores, in order to minimize extra work and thread idle time. Any remaining undetected faults are targeted during the following phase, in a similar manner.
The proposed parallel approach is applicable to fault models of linear size with respect to the circuit size, where the faults can be enumerated (if needed), and is demonstrated in this article using the well-known single stuck-at fault model. Furthermore, we extend the approach to n-detect models that require n different tests per fault in order to increase the defect coverage of a test set [19] - [21] , at the expense of an increase in the test set size. In this case, the scalability and/or the quality of the proposed partitioning-based parallel approach for single-detect fault models may be impacted as the fault list partitioning does not result in mutually exclusive sublists. To address this problem, we extend the proposed parallel ATPG methodology to a clustered-based approach for n-detect TG. Specifically, the generation of the different detections for the same fault is systematically assigned to different processing cores, and test merging for different faults is performed in a restricted manner in order to avoid merging multiple detections for the same fault. Moreover, each generated test for the same fault is optimized to be highly different, as it is well known that this can increase the defect coverage of the overall test set [22] - [24] . To the best of our knowledge, this is the first parallel ATPG approach to explicitly target high-quality n-detect test set generation. The obtained experimental results demonstrate the effectiveness of the proposed approach in maintaining the scalability of the ATPG process and provide comparisons with relevant recent work.
The rest of this article is organized as follows. Section II presents a high-level description of the proposed parallel ATPG while Section III focuses on particular parallel optimizations used to reduce the test inflation problem and favor speedup. Section IV describes the new challenges for n-detect test sets and proposes a cluster-based approach for parallel ATPG. Section V presents and discusses the experimental results, and Section VI concludes this article.
II. PROPOSED HIGH-LEVEL ATPG FRAMEWORK
Typically, a parallelization procedure consists of three basic steps: 1) decomposition (domain or functional); 2) parallel execution; and 3) final result assembly.
Step 2) can result in a significant compromise of the quality of the obtained results and, at the same time, not offer the expected speedup. An efficient parallel algorithm should effectively overcome challenges, such as memory contention and imbalanced workload distribution. The proposed ATPG method appropriately designs all three steps to ensure that these challenges are treated efficiently. Specifically, two conceptual approaches are adopted: 1) problem partitioning to avoid executing the same work concurrently in different cores and 2) fine-grained granularity of each step to provide a dynamic distribution of work. This section presents the TG flow of the proposed methodology, which is based on these two concepts; various parallel optimization heuristic based on these concepts are discussed in Section III.
The proposed methodology relies on an initial test-perfault step, for a limited number of faults, to obtain an initial seed test set over which the algorithm evolves. The many degrees of freedom allowed in a test seed by our single fault ATPG process provide the desired granularity that allows mutually exclusive distribution of work in the different cores. However, this distribution may result in a large amount of unnecessary work when not taking advantage of fault dropping. Fault dropping plays a critical role anyway, as it can significantly affect test set size. In parallel TG, the inefficient dropping of faults can also restrict speedup, although the main process for identifying faults to be dropped (fault simulation) can be implemented very efficiently in parallel environments [14] , [15] . A fair tradeoff between high granularity and fault dropping consideration is to develop a methodology based on distinct test epochs, one targeting hardto-detect faults and the following one targeting the remaining undetected faults. Fig. 1 presents the high-level description of the proposed methodology. First, the circuit netlist is analyzed to obtain a collapsed fault list F for the underlying fault model M. Consequently, the fault list is sorted in a DFS order (based on their location in the netlist) in an attempt to implicitly group faults with structural similarities in F. This fault locality property of the input fault list benefits fault dropping after F is partitioned to the available cores. The next step identifies hard-to-detect faults to be targeted by the first test epoch (Epoch I) of the methodology. We use random test pattern generation, which is a simple, quick, and acceptable way to classify faults; however, other more sophisticated methods can be incorporated, such as [25] and [26] . Hard-to-detect faults are identified using a multiple detection approach where 10% (set by experimental exploration) of the faults in F with the fewer detections are considered as hard and used as the input 
III. PARALLELIZATION METHODOLOGY
AND OPTIMIZATIONS This section describes in detail how the TG process is partitioned and discusses the decisions taken to address the main parallelization challenges. Section III-A describes the major steps undertaken during a test epoch, discussing dynamic fault partitioning and core synchronization, while Section III-B describes a number of optimizations proposed to overcome parallelization issues. During the first step (seed-based TG in Fig. 2 ), each available core performs test seed generation (TG with maximal don't care bits) for the next undetected fault f i in the list using a PODEM-based process optimized to identify tests with a Algorithm 1 Dynamic Merging for Core k large number of unspecified bits (proven to be beneficial for a plethora of application, e.g., test set compaction [24] and low power testing [27] ). The order of the selection of the next fault(s) is not important here, as the partitioning is designed to work in an independent manner and produce standalone results. The system shared memory holds the updated fault list (faults not yet targeted) and, therefore, duplication of work is avoided as each core works on a distinct fault. For each test seed t i generated, parallel fault simulation is performed and all faults detected (including those in F − F H ) are stored in a list d i . Faults in d i are not immediately dropped as this information is used during the following step. This first step terminates when all faults in F H have been targeted. T PF contains the test seeds and D PF contains the corresponding fault simulation results which are both kept in the shared memory ( Fig. 2 between the two steps).
A. Test-Epoch Parallelization
The next step is invoked (dynamic test merging in Fig. 2 ) in order to merge compatible test seeds and reduce the size of T PF . Each core selects the test seed from T PF with the larger detection list d i and marks it (core's primary seed) so that other cores cannot select it. A detailed description of this selection is given in Section III-B. This merging step is dynamic due to the efficient communication of the merged tests through the shared memory. Thus, in each iteration, the number of candidate tests seeds for merging is reduced at a fast rate.
Algorithm 1 outlines the merging process undertaken by each core while the shared memory accommodates information about faults detected and test seeds discarded. The input to this merging process is the test seed generated in the previous step (T PF ) and their corresponding faults detected (D PF ) and the fault list F H . This process is similar for the two epochs of the methodology hence, without loss of generality, here, we describe the method for Epoch I. Each core considers a distinct subset of the test seed set T PF in order avoid utilizing the same seed for merging in more than one cores. Core k (out of the m available) considers only (|T PF |/m) seeds for primary selection in the range k × (|T PF |/m), . . . , (k + 1) × (|T PF |/m) − 1, denoted here as T PF (k : k + 1). Once a seed t i in this range is selected (line 02) as a primary seed to be merged, all the detected faults are immediately dropped from the globally maintained fault list F H (line 03), and the seed is removed from the given seed test T PF (line 04). Then, the hamming distance between each of the remaining seeds in the considered range [i.e., T PF (k : k + 1)] and t i is calculated and saved in list P i (lines 06 and 07). Observe that, in subsequent iterations, this range changes for the secondary seeds (line 05) so that all seeds in T PF are considered for merging with the primary t i . Then, the seed with the minimum hamming distance is selected (line 08) and merged (line 11) with t i . The merged seed is removed from T PF (line 12), and its detected faults are dropped from further consideration (line 13). When no more seeds in the range can be merged with t i , the algorithm continues to the next range of (|T PF |/m) seeds (lines 09 and 10). Lines 14-24 are invoked when all the secondary seeds are considered and, hence, no more merging is possible, in order to identify detections of faults not in F H (i.e., in F − F H ). Lines 14-24 are skipped in Epoch II since all the remaining faults are placed in the given sublist F R . First, a bit still with unspecified value in t i is randomly selected (line 14), fixed to 0 (line 15), and simulated for faults, not in F H (line 16). Then, in lines 17 and 18, bit fixing and fault simulation are repeated for value 1. Based on which bit fixing detects more faults, t i is updated accordingly (lines 19 and 20 for 0, lines 22 and 23 for 1), and a list of faults F C accumulates all the coincidentally detected faults (line 21 for 0 and 24 for 1). All these faults will be dropped from the global fault list F at the end of this process (line 26). Finally, the obtained test t i is inserted in a test set for core k (line 25) that contains tests to be placed in the output test set of the epoch, i.e., T H = k=0:m−1 T k PF . Fig. 3 presents the dynamic test merging procedure with an example. First, t 3 is selected since it detects the most faults (11) among the other seeds in T PF (left top table in Fig. 3 ). Next, the hamming distances between t 3 and all other seeds in T PF are calculated as the sum of the bitwise distances per bit pair indicated in column 3 of the bottom tables. For example, the hamming distance between t 3 and t 5 is 1 + 0 + 0 + 1 + ∞ + 0 + ∞ + 1 + 0 + 0 = ∞ indicating that no merging between them is possible. The hamming distances between t 3 and all other seeds in T PF are saved in P i (shown under P i column of 1st iteration in top table). These values indicate that the best seed to be merged with t 3 is either t 1 or t 2 with the former selected. Merging of t 3 and t 1 is shown in the T PF column under first iteration (changed bits are shown in red) and follows the rules on bottom tables of Fig. 3 . During the second iteration, the hamming distances are recalculated in P i . Observe that t 2 and t 6 are now incompatible with t 3 after its merging with t 1 . Seed t 4 is selected to be merged with the current seed. The resulting seed has no compatibility with the remaining seeds, and hence, no further merging is possible. During Epoch I, bit fixing together with fault simulation follows the process shown here to detect coincidental faults. When all unspecified bits have been fixed and fault simulated, the test is advanced to the output test set T k PF , and all the corresponding faults detected are dropped from the global fault list.
B. Parallelization Optimizations 1) Detection-Based Primary Test Selection:
In the merging step of Algorithm 1, test selection is very important for the efficient evolution of merging since it sets the constraints for the consequent merging iterations and fault simulation. Practice in ATPG suggests that early fault dropping plays a more important role than having fewer constraints (more unspecified bits) in the test seed. For this reason, the primary test t i during dynamic merging (merging seed) is selected based on its number of detected faults in d i . Recall that the fault simulation process performed at the end of the first step of the test epoch (Fig. 2 ) does not drop faults; instead, it is used as a metric for this selection during the second step. Tests to be merged (with the primary test) are then selected based on their Hamming distance to the primary test. In the (often common) case where more than one tests have the same Hamming distance to the primary test, their fault detection metric (number of coincidentally detected faults) is used to decide which test will be merged. This optimization greatly assists in dynamic workload balancing and minimization of unnecessary work since early fault dropping reduces the faults for which explicit TG is needed.
2) Balanced Workload Distribution: Distribution of workload to the available cores can significantly impact the speedup of a parallel methodology. The TG and fault simulation processes have unpredictable execution times due to the nature of the problems and fault dropping. The core idle time is minimized by dynamically selecting: 1) the next fault to be targeted in seed-based TG in each epoch (Fig. 2) ; 2) the next test to be used as primary in test merging (Fig. 2) ; and 3) the tests to be merged with the primary test seed (Algorithm 1). Since the fault list and the test seeds are stored in shared memory, they are easily accessible by all cores, and can provide a punctual way of determining how the workload will be selected at each step and by each optimization mechanism of the approach.
3) Scalable Parallel Fault Simulation: Fault simulation is used in many cases in the proposed methodology and, thus, its performance significantly affects the overall performance. Specifically, fault simulation is used on two separate occasions in each epoch in the following.
1) To find the number of detected faults per test seed without fault dropping. This information is given as input to the merging procedure in order to avoid unnecessary simulations after each merging. 2) At the end of the merging step in order to detect as many coincidental faults as possible and, hence, minimize the test set size. In case 1), the fault simulation is performed after test seeds have been generated for all faults. Since generation in the various cores is executed independently, the cores finish this step at different times, resulting in idle cores. These cores can be utilized for fault simulation in a parallel fashion. The challenge here is that the number of idle cores is changing (increasing) as more cores finish and, hence, the simulation should utilize them as well. To take full advantage of the idle cores, we have fully incorporated the highly scalable parallel fault simulation of [28] . This fault simulator has been shown to provide linear speedup as the number of cores increases and can be dynamically adjusted to the number of available cores. In case 2), the fault simulation should proceed within one core since, after the merging step, the obtained test is simulated to identify co-incidental fault detections (see Fig. 2 ). Recall that in this step, the test seeds are dynamically acquired by cores from the shared fault list and merged with other seeds until no more merging is possible. Hence, fault simulation cannot run in parallel to get maximum benefit. Nevertheless, the fault simulations exploit bit-parallel simulation principles presented in [28] , where many faults are simulated with a single circuit traversal. This results in a considerable speedup of this step.
4) Test Set Private Consideration:
The search for the best candidate tests to be merged involves high interaction of each core with the shared memory. Specifically, selecting the primary test and computing the pairwise compatibilities with the remaining tests in T PF , inherently involves memory contention since all cores are searching T PF . This issue is addressed by dynamically partitioning T PF in m private subsets (m being the number of available cores), one for each core. Each core can only select tests from its own private subset of T PF (and the corresponding D PF ) which can be safely moved to its own private cache. This implicitly minimizes concurrent memory access requests from different cores that can result in inefficient memory utilization due to memory contention. Moreover, it implicitly minimizes duplication of work as each core considers a distinct part in T PF . When a core finishes with the merging process within its private part of T PF , it is allowed to work on the entire set in order to ensure workload balancing. At this point, concurrent memory accesses can occur, and however, their impact is minimal as the bulk of the merging process has already occurred during the private consideration, and hence, the size of T PF is by this point significantly reduced.
5) Test Provisional Marking:
During compatibility merging, the list P i , which holds pairwise compatibilities between tests, requires updating after each merging. This updating is highly demanding in processing resources as it is of cubic complexity in the worst case. To avoid this issue, the proposed methodology calculates and ranks compatibilities only once for each test t i . If a test t j is selected to be merged with t i , it is provisionally marked in T PF so that it is not merged in another core, explicitly avoiding imposing unnecessary constraints in another thread that performs merging. If compatibility between t i and a test t j in P i is invalidated by a previous merging, merging between t i and t j is not completed and the provisional marking is cleared. Otherwise, provisional marking indicates permanent discarding of t j from T PF . 6) Shared Memory Contention Avoidance: Access to the shared memory must be efficient and well-targeted in order to avoid memory contentions. The proposed method accesses the share resources thoughtfully using the following ways: 1) during seed-based TG and fault simulation phases, shared memory access is avoided since cores work independently on a test seed basis. However, during the dynamic merging phase, shared memory access can affect the efficiency and the quality of the TG method. Appropriate bookkeeping with shared fault list F ensures that no test is simulated twice for the same fault and that faults will be dropped immediately after they are detected. 2) Due to the dynamic nature of the proposed method, the best candidate test seeds would be attractive by many cores. During the merging phase, cores are initially working independently on their own private space for seeds assigned to them (T PF ) and updating of the shared memory is only done at the end (test set private consideration). 3) During the dynamic merging phase, pairwise compatibilities between tests seeds are calculated once per test seed for T PF (stored in P i ). In this case, shared memory access is not necessary since all T PF belongs to the private memory space of the cores. Shared memory is only accessed toward the end of processing P i when very few tests are left to be considered.
IV. n-DETECT PARALLELIZATION METHODOLOGY
The proposed methodology can be easily extended to consider any linear fault model. In this section, we extend the parallel framework to generate test sets with multiple detections for each fault, known as n-detect test sets.
The main challenge here is to ensure the n-detect property. The challenge applies both to seed generation and seed merging processes. Test seed generation should guarantee the generation of n different seeds per fault in the given fault list (F H in Epoch I and F R in Epoch II). Furthermore, in order to ensure the high quality of the resulting test set, test seeds should have significant difference since this was shown to detect more defects [16] , [21] - [23] . In addition, the merging process should be constrained in order to guarantee that the seeds generated for the same fault are not merged in the same final test, and hence, reduce the number of detections for that fault. In Section IV-A, we propose a TG technique that satisfies these requirements and can be incorporated in the framework presented in Section II. Section IV-B describes a technique for partitioning test seeds that ensures the n-detect property and maintains speedup increase as the number of cores increases. The steps of the two-epochs methodology proposed in Sections II and III remain the same.
A. Multiple Test Seed Generation
The PODEM-based TG of Section III-A is extended to produce multiple (n) test seeds per considered fault. The method rolls back in the decision tree of the algorithm altering taken decisions to ensure n incompatible test seeds are generated. Procedure 2 shows the four-step process invoked for each fault.
Hence, following the TG of the first test seed for a fault [step 1)], the process looks back at the various decisions made during this initial seed generation [step 2)]. A decision here refers to alternative circuit path segments that the algorithm selects in order to generate the seeds. For example, in the circuit of Fig. 4 , TG for fault f x Sa0 is presented. In order to enforce value 1 to f x (activate fault), we need at least a 1 at any of the OR gate inputs (i. e., a, b, and c) . For the first seed, the algorithm assigns value 1 at input c and justifies it with a backward traversal on the circuit. For the second seed, the algorithm takes the alternative decision assigning 1 at line a instead of c. Similar decisions are taken for the propagation of the fault effect to a primary output. The closer the decision to be altered is to the fault site, the higher the difference will be between the seeds. In step 2) a decision tree is progressively constructed to record all the taken decisions and their alternatives. Fig. 5 shows the decision tree for the activation of the Sa0 fault at line f x . The different paths of the tree indicate the different decision combinations for the primary test seed and for the alternative (n) seeds for the specific fault under investigation.
Algorithm 2 Multiple Test Seed Generation
In step 3) each generated seed is checked for compatibility with previously generated seeds (compatible seeds have no conflicting bits). For each pair of compatible seeds, only one is kept, that with the fewer number of specified bits allowing more room for merging. Discarding compatible seeds ensures the n-detect property of the final test set throughout the merging process of the methodology. Specifically, it prevents the generation of the same final test two or more times reducing the number of detections for some faults below n.
Step 4) of Procedure 2 is invoked only when all the decision combinations for a fault have been exhausted and only when fewer than n seeds have been generated. In this step, the seeds with the fewer specified bits are modified by specifying bits to conflicting values, to derive two or more incompatible seeds. The output of this process is n different sets of seeds T PF 1 , T PF 2 , . . . , T PF n each containing one seed generated per fault together with the corresponding number of faults detected by each seed saved in D PF 1 , D PF 2 , . . . , D PF n . These sets consist of the input to the following part of the methodology.
We explain this modified seed generation with a comprehensive example summarized in Fig. 6 . Consider again the circuit of Fig. 4 and assume seed generation for fault f x stuckat-0. The values inside parenthesis denote the controllability values for 0 and 1, respectively [29] . The generation algorithm selects the line with the smaller controllability metric for logic value 1, i.e., line c to get value 1. In order to justify c = 1, the algorithm performs another decision, i.e., i 6 = 1 while directly implying that i 8 = 1. In the decision tree of Fig. 5 , all direct implications for each decision are shown in a dashed outline next to the decision node. Hence, the seed XXXXX1X1 is generated by taking the leftmost path of the tree in Fig. 5 [step 1 ) of Procedure 2]. This step is shown in row 1 of Fig. 6 . Next, the decision closest to the fault is altered, and the input with the next smaller controllability is selected, i.e., a = 1. This decision's direct implications (i 1 = 1, i 2 = i 3 = 0) generates seed 100XXXXX [step 2) of Procedure 2] which, however, is discarded since it is compatible with the previous one [step 3) of Procedure 2] shown in row 2 of Fig. 6 . In the same way, the next decision is taken, and a new seed XXX1X001 (seed #2) is generated, which is not discarded as it contains a conflicting bit with seed #1 (row 3 in Fig. 6 ). The process goes on until n different seeds are generated or until all decision combinations have been examined. For the specific example and for n = 5, steps 1)-3) have produced only two different seeds and, hence, step 4) is necessary. After selecting the seed with the fewer specified bits, seed #1 is replaced by two other seeds one setting its seventh bit (chosen randomly) to value 0 and one setting the same bit to value 1 (indicated with green and red in column 5 in Fig. 6 ). This process is repeated two more times to generate two more tests (columns 7 and 9 in Fig. 6 ), i.e., until five different seeds are generated. Step 4) of Procedure 2 guarantees to produce distinguished seeds since from previous steps no compatible seeds are allowed to reach step 4), and the bit fixing process ensures conflicting bits in the obtained seeds. Each of these bits will be placed in a different seed set T P F i .
The proposed multiple-seed generation process drastically simplifies the merging process since it needs to consider much fewer constraints among seeds. The procedure is also efficient since the n different seeds are generated in an incremental TG process, which is much faster than n independent seed generations. This is the reason why the multiple seeds generation for the same fault is not chosen to be executed in parallel in the proposed method. 
B. Clustered Dynamic Seed Merging
When n seeds are generated per modeled fault, the methodology proceeds to the dynamic merging phase (as in Fig. 2 ). Since the n-detect property is preserved by the seed generation process (at least one conflicting bit between seeds of the same fault), the main challenge in this step is to ensure that the merging will not affect the speedup scalability of the methodology. Leaving the merging process identical to the one presented in Algorithm 1 results in a significant reduction in the obtained speedup by 10%-25%. These observations have been made by employing corresponding experimentation, which is not presented here due to space limitations.
To mitigate the speedup reduction, the parallel framework has been extended to a cluster-based approach where the available cores are partitioned into n clusters. Cluster i explicitly targets dynamic merging for a subset of seeds (i.e., T PF i ) obtained from the seed generation of Section IV-A. Fig. 7 presents the main components of the cluster-based dynamic merging procedure. Inside each cluster, the dynamic merging procedure is identical to that described with Algorithm 1; yet, the number of calculations is significantly smaller than the nonclustered merging as there are fewer seeds to pairwise compare with. While each cluster operates on its own T PF i , fault dropping is performed via the global fault list located in the shared memory. Hence, any fault detection due to seed merging or identified during the subsequent fault simulation (lines 13, 21, and 24 of Algorithm 1) updates the global fault list, reducing the number of desired detections by one per obtained test. When this number becomes zero, the corresponding fault is completely dropped from further consideration. This n-detect aware global fault dropping makes sure that merging will not continue to consider seeds corresponding to dropped faults.
The clustered-based dynamic merging produces n independent test sets with complete coverage of the given faults. These n sets are combined into a unified test set by eliminating duplicate tests to provide the final n-detect test set, as shown in the rightmost part of Fig. 7 . This elimination does not affect the n-detect a property of the obtained set since by construction the methodology prevents the generation of the same test two or more times explicitly targeting the same fault. In other words, if two tests are identical (each coming from a separate cluster) they have been generated by merging seeds corresponding to different faults. This occurs since seeds for the same faults are generated to contain at least one conflicting bit.
V. EXPERIMENTAL RESULTS
The proposed method was implemented using C++ language and run on a 20-cores (×2 threads) Intel Xeon CPU E5-2670v2 with 98 GB of RAM, running Linux. OpenMP parallel programming framework was used for parallelization. We present results for the larger full-scan versions of the circuits in the IWLS'05 benchmarks suite. The method can be applied to any linear fault model; here, we present results for the stuck-at fault model. Table I presents the obtained speedup and test set sizes (as increase %) of the proposed parallel ATPG method compared with a serial version of the algorithm. The speedup measure allows for the evaluation of the scalability of the approach under different execution setups, and for a fair comparison with other works considering the same architecture but with different characteristics, such as CPU clock and total memory. Results from experimental setups with 8, 12, 16, 20 , and 40 cores are reported. Forty cores are obtained by enabling hyperthreading in the 20 physical core system used. After the circuit name and the number of inputs (columns 1 and 2), the size of the circuit and number of faults in the collapsed fault list (columns 3 and 4) are presented. Columns 5 and 6 report the number of aborted faults (indicating the achieved fault coverage) and the test set size obtained by a serial version of the proposed methodology, respectively. The number of I   SPEEDUP AND TEST SET INCREASE RESULTS FOR THE PROPOSED METHOD USING 8, 12, 16 , 20, AND 40 CORES Fig. 8 . Speedup comparison with [9] and [13] for an eight-core setup. aborted faults in the multicore execution setups is always smaller than the one reported in column 5 (hence, fault coverage is at least as high) and is not reported here due to space limitations. Columns 7, 9, 11, 13 , and 15 list the test set size increase as a percentage of the one obtained by the serial execution and columns 8, 10, 12, 14, and 16 report the speedup achieved, when 8, 12, 16, 20 , and 40 cores are employed.
A. Single Detection Parallel Test Generation
The obtained results demonstrate an almost linear speedup increase while at the same time, the test set size increase is very limited for most of the circuits. In the worst case test set increase is no more than ∼20% whereas in the average case, it is only 9.10% for 40 cores. Circuit s35932 is an exception (with 25% increase) due to the very small test set size of the serial version with 20 tests, which becomes 25 tests for 40 cores. The proposed method also exhibits small memory increase, an objective often targeted by parallel methods. The last row of Table I summarizes the average memory increase among all circuits. For the 20-cores run, the required memory is only 3.27× more than the serial version, while the 40 cores run increase the memory by 3.77× on the average. These numbers indicate that the memory increase does not grow proportionally with the number of cores; rather, it grows with a decreasing rate as the number of cores increase. This is attributed to the dynamic manner in which the proposed algorithm performs fault dropping, alleviating the following steps from unnecessary calculations.
The rest of this section compares the proposed methodology with the most relevant and recent parallel approaches considering the shared memory multicore architecture model, such as [9] - [11] and [13] . Where available, results are compared directly for the common benchmarks. Moreover, results on additional benchmarks for each technique are listed, and average trends for each methodology are analyzed. For the proposed methodology, the larger (in terms of # nodes) circuits (b18, b19, and Ethernet) are listed on top of the common ones. The number of execution cores in each case was determined by the results reported in these works. Columns 2-7 of Table II provide a comparison with [10] for the setup of 12 cores. Similarly, Columns 8-16 of Table II compare the proposed methodology with the approaches of [11] and [13] , for the 16-cores setup reported in these works. A "-" indicates no available results for the corresponding work. Average trends are reported in the last row of Table II. The proposed methodology reports higher speedup on all circuits with an average 10.2× compared with the average speedup of 7.67× reported in [10] . At the same time, the proposed methodology reports a lower test set increase than [10] in all circuits except s35932. For the largest common circuits reported (b15 and b17), the proposed methodology reports a negligible test set increase (less than 1%) and considerable speedup (>11×) while [10] reports speedup around 8× and much higher test set increase (18% and 15%, respectively). Column 7 reports the memory increase factor for the proposed method, although this is not reported in [10] , for completeness. For the case of the works in [11] and [13] , the comparison is performed solely on average trends, as none of the industrial circuits D1-D7 is available [10] , [11] , AND [13] to us. Clearly, the proposed methodology exhibits higher average speedup than both approaches (13.25× versus 8.54× and 9.76×) with a small additional overhead on average test set size increase in the order of ∼1% (5.97% versus 4.85% and 4.91%). The memory increase factor of the proposed method is on the average smaller than that of [11] and slightly higher than the one of [13] which explicitly targets memory utilization.
An eight-core system is considered in [9] and compared with an implementation of [13] . Fig. 8 compares the speedup of the proposed methodology with that of [9] and [13] , as reported in [9] . The proposed methodology achieves, on average, ∼7× speedup, outperforming both existing methods for the four common circuits. This is achieved mainly due to the optimizations (discussed in Section III), which minimize redundant work by immediate updating of the fault status. The comparison regarding the test set size reduction is not possible with [9] as absolute values for test set sizes, and corresponding fault coverage are not provided.
B. Parallel n-Detect Test Generation
A comprehensive picture of the linear performance of the proposed n-detect TG extension for n = 5 is presented in Figs. 9 and 10 . The scalability of the technique is shown in Fig. 9 compared with a serial execution of the algorithm. The achieved speedup is reported in the y-axis for the different number of cores utilized (shown in the x-axis). For all circuits examined, the proposed methodology scales linearly as the number of cores used for its computation increases. Moreover, observe that for larger circuits the obtained speedup is higher due to the increased workload that allows smaller percentage of core idle times. As with the single-detect case, the results imply that the scalability of the proposed method will continue to scale well as the number of cores is increased. In fact, while the speedup trends are similar to that of the single detect in absolute values, the speedup is higher in n-detect, and in some cases, close to the theoretical maximum (number of utilized cores). For example, two of the largest circuits considered Fig. 9 . Scalability of the proposed parallel n-detect TG with respect to a serial execution, for n = 5.
b18 and Ethernet achieve ∼19× speedup for 20 cores, ∼28× speedup for 30 cores, and ∼39× speedup when 40 cores are utilized. The corresponding numbers for single detect are ∼15×, ∼22×, and ∼29×. This is mainly attributed to the cluster-based approach that dramatically reduces the number of pairwise condition checks during merging. The checks are restricted only inside the cluster where the seed to be merged was assigned. This possibility cannot be exploited in a serial approach in a straightforward manner, without significant effect either at the performance or the final test set size.
Similar to the results presented in Section V-A, the parallelization procedure affects the final test set size. Moreover, for the n-detect approach, due to the clustered-based merging, the solution is obtained from a reduced search space. Hence, it is expected, in many cases, to provide a suboptimal solution that returns test detecting fewer faults. When fewer faults are detected per test, it is inevitable that the final test set will be larger (test set inflation). Fig. 10 shows the test set size increase as the number of cores used is increased for the examined benchmark circuits. The increase is reported as a III   TEST SET SIZE, TEST SET INCREASE %, AND SPEEDUP FOR FIVE-DETECT PARALLEL ATPG METHOD USING 1 percentage of the test set size obtained by a serial execution of the same algorithm. Unexpectedly, the test inflation for the n-detect method is not significant. In particular, for the five-detect TG, the test inflation is on average 5% for 20 cores, 7.5% for 30 cores, and 9.5% for 40 cores, whereas the corresponding numbers for the single-detect case are 7%, 8.5%, and 9%. Table III provides the exact data for the test set size and speedup achieved by the proposed parallel n-detect TG method for n = 5 when 20, 30, and 40 cores are employed. All test sets provide 100% five-detect fault detection efficiency, i.e., all modeled faults are detected at least five times. The test set sizes are presented as % increase over the serial version of the same algorithm (shown in column 2). Five-detect fault coverage is achieved due to the multiple test seed generation procedure, which enforces the generation of five different test seeds. Columns 3-5 show the % test set size increase for 20, 30, and 40 cores, respectively. Columns 6-8 report the achieved speedup when 20, 30, and 40 cores are utilized. The obtained results demonstrate that the linear speedup increase and the small test set size increase achieved in the single-detect method are maintained. The achieved speedup when 20, 30, and 40 cores are employed is on the average higher in the n-detect method. Specifically, it is 17.72×, 26.06×, and 34.92× for 20, 30, and 40 cores compared with 14.82×, 19.96×, and 25.16×, respectively, for the single-detect method.
VI. CONCLUSION
In this article, we have proposed a parallel test pattern methodology for shared-memory multicore environments. A number of newly proposed heuristics have attempted to avoid assigning the same workload to multiple cores, while the distribution of work in the available resources has aimed to minimize the core idle time. The methodology has also been extended to generate multiple-detect (n-detect) test sets, previously proven to provide higher defect coverage. A new technique for the efficient generation of test seeds followed by a clustered-based dynamic merging procedure has been presented. Experimental results have demonstrated high-speedup rates that keep increasing as the number of the available cores increases. Test set size increase has shown to be limited and comparable to the other state-of-the-art parallel approaches. For the n-detect method, the results have retained all the good properties of the basic methodology to a more beneficial extent.
