Abstract-Current and future multicore architectures can significantly accelerate the performance of test automation procedures depending on the underlying architecture and the scalability of their algorithms. This paper proposes a new parallel methodology targeting the fault simulation problem, for shared memory multicore systems, that maintains scalability with the increase of the number of cores. The method is based on a simple single thread process that allows focusing on the optimization of the parallelization process in different dimensions. Additionally, a number of optimizations are incorporated in the approach to control fault dropping and to avoid unnecessary work. The reported experimental results, for both random and deterministic test sets, demonstrate the scalability of the method. As the number of cores increases, the reported speed-up increases proportionally, where comparable recent methods report saturation or even reduction of the obtained speed-up.
on-chip memory can be exploited to achieve significant performance improvement of complex and computational demanding algorithms. However, the high processing power and the large amounts of memory alone do not necessitate effortless efficient scalability of previously proposed methodologies. Scalable solutions require revisiting the problem in order to redesign and/or restructure the respective solutions for state-of-the-art architectures. This can be achieved by explicitly targeting effective partitioning of the problem and workload, high utilization of the available on-chip cores, and minimization of duplicate work.
Fault simulation is a fundamental process in test automation processes targeting the calculation of the fault coverage of a given test set. It is used either as a standalone tool or as part of algorithms developed for relevant problems (e.g., test generation, fault diagnosis, and techniques for fault tolerant design) [9] , [10] . As a result, any performance acceleration of the fault simulation process will also contribute in the performance of many existing tools. Multicore architectures offer room for improvement for most of the established serial design and test automation methodologies; however, this improvement depends on the underlining architecture (number of cores and memory availability) as well as on the ability of the proposed approach to scale with respect to the underlining architecture.
Improving the performance of fault simulation via parallelization has been investigated before, either as a stand-alone methodology or as a part of other tools. Early works focus mainly on workload partitioning assisted by message passing communication in off-chip multiprocessor systems. One of the first attempts was presented in [11] where a hardware accelerator named MARS with 15 processing elements was configured in a pipeline. Its proposed concurrent fault simulation algorithm required a large amount of memory, which made it impractical to use for large designs. In [12] a dynamic, 2-D parallel technique was proposed that extended bit-parallelism of faults to patterns, in order to address large communication delays regarding the faults dropped. The goal was to utilize multiple execution in different processing units by performing multiple fault propagation at the same time. The work of [13] proposed two gate-parallel algorithms for the connection machine that employed 16 processing elements and an elaborate routing network where a message could be sent in 12 routing steps. Amin and Vinnakota [14] proposed a faultdisjoint partitioning of the workload in multiple processors (up to 32) based on a static analysis of the distribution of activity in fault simulation. This method performed independent pattern simulations without requiring any communication between processors. The performance of all of the above approaches was limited by the high message delivery delays imposed by the distributed interchip communication model. As a result, communication was intentionally kept minimal and, in essence, the processing elements were operating independently on a part of the entire problem solution space with communication occurring rarely or even just for the overall solution composition.
Solutions designed for vector processors [13] [14] [15] [16] [17] , as well as solutions designed for distributed architectures, exploit parallelism across three different dimensions: data, algorithmic, and structural. Initial parallelization attempts relied on static partitioning of the fault list and/or the test set to compensate for the high communication cost [3] , [4] .
The development of multi/many on-chip core architectures necessitates the revisiting of the most effective (yet computationally intensive) techniques in order to be adjusted to these new architectures. Graphic processing units (GPUs) have been exploited for the specific problem of fault simulation, where the existence of many small processing units is utilized to create multiple threads in a single instruction multiple data (SIMD) fashion offering substantial speed-ups. The work of [2] presented an event-driven logic simulator accelerated by a GPU. When considering gate level netlists, this method achieved an average 3× speedup over traditional serial simulators. Kochte et al. [18] proposed an algorithm to map many of the optimizations of serial techniques for fault simulation to the SIMD paradigm achieving a speed-up of 16× with respect to the best serial approaches. Schneider et al. [19] , [20] proposed timing-aware fault simulation of small delay faults using a data-parallel approach on a GPU with 2880 processing units. The evaluation of the methods was carried out using 10.240 random input stimuli and showed significant improvement in run-time. Li et al. [21] exploited GPU's parallel capabilities in order to calculate the n-detect fault coverage of a given test with a single traversal of the circuit's netlist. The reported results on a GPU with 128 stream processors showed time reduction by a factor of 25× when compared to a commercial tool. In [22] , a fault simulation process is used to accelerate the calculation of the fault table which tabulates all fault detections by each test pattern considered. The algorithm matches pattern parallel simulation with the SIMDbased architecture of a GPU achieving 15× speed-up over the execution of a state-of-the-art serial tool. Li and Hsiao [23] proposed a method to exploit parallelism in three different dimensions, i.e., algorithm, model, and data. At the same time, it minimized communication between the host processor and the GPU by utilizing the individual's device memory as much as possible.
On the other hand, recent general-purpose shared-memory on-chip multiprocessors offer a fast, asynchronous and high capacity memory, in contrast to the SIMD paradigm of GPUs, which can leverage the intercore communication overhead. A small number of such approaches have been proposed for the highly related problem of test pattern generation [1] , [8] , [24] .
For the specific problem of the parallelization of fault simulation, only the recent work of [7] considers general-purpose shared-memory multiprocessors (as we do in this paper). The proposed technique exploits parallelism both using multiple pattern reasoning and compiled computing model distribution. Moreover, it focuses on utilizing powerful techniques for critical path tracing [25] to accelerate fault propagation. However, because the core process of [7] is highly optimized, the exploitation of parallelization techniques is limited. For this reason and in contrast to the method we propose in this paper, [7] fails to maintain speed-up gains when the number of processing cores utilized increases.
In this paper, we consider systems with multiple identical cores (homogeneous), each of which posses at least one level of private cache and at least one level of shared memory. The proposed methodology consists of three different phases, each distributing the remaining workload using a different rationale, yet all targeting maintenance of the achieved speed-up. Initially, a preprocessing step identifies groups of faults based on the likelihood of being detected by the same test. In the first phase, faults and tests are distributed to the available processing cores in a static way that enables balanced distribution of workload across the cores. Each core independently simulates its group of tests only for the faults assigned to it, followed by a global fault list update that drops detected faults from further consideration. In the second phase, the tests preassigned to a core are simulated for all the remaining undetected faults, immediately broadcasting dropped faults among cores via the shared memory hierarchy. This simulation step utilizes appropriate data structures to explicitly avoid considering the same fault in multiple cores, in parallel. At the same time, it makes sure that faults remaining undetected will be simulated for all tests exactly once. After all test patterns assigned to a core are simulated for all faults, new tests, along with their corresponding not yet simulated faults, are acquired from other cores in order to ensure workload balancing. This final third phase of the methodology helps decongest busier cores and avoids having idle cores, hence maintaining similar runtimes among all the cores.
Beside the parallelism across cores described above, the methodology extensively employs bit-level parallelism. Specifically, the microprocessor's data word w is utilized in conjunction with bitwise operations in order to simulate w − 1 faults at the same time (one bit left is used for the fault free circuit). Hence, the simulation traversal is accelerated by a factor of w − 1 [9] . Bit-parallelism is exploited in two directions.
1) Multiple faults are simulated using bit-parallel fault simulation. 2) Multiple tests are simulated using bit-parallel true value simulation. As confirmed by relevant experimentation, the proposed parallelization approach provides high speed-up rates that are increasing monotonically with the number of available cores.
The rest of this paper is organized as follows. Section II reviews the fault simulation process to identify challenges and guidelines for efficient parallelization for the considered architecture. Section III describes in detail the proposed parallel approach. Section IV discusses the experimental results and provides comparisons with relevant state-of the-art methods. Section V concludes this paper.
II. MOTIVATION AND CONSIDERATIONS
FOR PARALLELIZATION Fault simulation is the problem of identifying the percentage of modeled faults detected, when applying a given set of test patterns at the inputs of an integrated circuit. This percentage is known as the fault coverage of the circuit under test considering a fault list (F) generated for a specific fault model. The set of input patterns to be tested, known as the test set (T) can be generated using a deterministic or a random test generation process [9] . In this paper, we consider the specific problem of parallelizing the fault simulation under a shared-memory multicore system so that the obtained parallelization speed-up, when compared to a serial version of fault simulation, is maximized.
Parallelization of the fault simulation process is not a straightforward task as it relies, among others, on appropriate bookkeeping to avoid duplication of work as well as well-scheduled utilization of the available cores. Automatic parallelization tools and sophisticated compilers cannot provide the desirable speed-up since the problem is not inherently parallelizable. For computational intensive tasks the problem partitioning is of significant importance. One approach is to rely on standard parallelization tools that partition the problem under consideration following the generic partitioning rules. However, this results in local optimal solutions due to the limited consideration of the problem's entire solution space. This can severely affect the effectiveness of the proposed solution, especially for architectures with a large number of processing cores. Thus, the methodology proposed here explicitly addresses problem decomposition as well as the final result assembly. At the same time, it proposes solutions for two critical processes used to resolve the challenges in partitioning the problem, i.e., fault dropping and workload balancing. This section discusses attributes of these challenges and their effect on the efficiency of the entire parallel fault simulation process.
A. Efficient Fault Dropping
Fault dropping can considerably affect the efficiency of a fault simulation method. For a complete fault simulation method each test from a given test set (T) must be simulated for every fault in the considered fault list (F). Hence, the well-known worst case complexity of a serial method is O(|T| · (|F| + 1) · N), where N is the number of nodes in the netlist, implying a circuit traversal for each fault-test combination. Detected faults are discarded (dropped) from F and are not taken into further consideration by any of the subsequent tests and, as a result, the actual runtime is improved considerably below this bound. However, for parallel fault simulation solutions the effectiveness of the fault dropping is not effortlessly maintained. Inefficient parallel solutions can result in two or more tests that are simulated for the same fault more than once, increasing in this way the overall workload. Moreover, as opposed to serial solutions, where detected faults are immediate dropped from further consideration, fault dropping in parallel solutions might also be delayed (memory locking and coherency) or tests/faults could be reassigned among cores, destructing the bookkeeping. When two (or more tests) detect the same fault(s), then, the potential benefit from fault dropping is small. In this sense simulating the same fault in parallel should be avoided at the extend possible.
B. Effective Partitioning
In general, parallelization methods follow a three step approach where the given problem is: 1) appropriately partitioned (problem decomposition); 2) in parallel solved for each partition (parallel execution); and 3) reassembled to form a final solution (solution recomposition). Appropriate bookkeeping for decomposition and recomposition steps along with well-scheduled utilization of available cores and shared memory during parallel execution are very important for the efficiency of the parallel method. Problem decomposition, either by partitioning of faults or by partitioning of tests (or both faults and tests), and distribution among available cores are very popular parallelization methods [6] , [26] . Even though static partitioning can contribute in speed-up gain, it is not adequate to provide a scalable solution because of the overhead imposed by the constraint that no test should be simulated for the same fault more than once.
We explain this with the illustration of Fig. 1 . The fault simulation search space is represented as a table, where rows correspond to tests to be simulated and columns to the faults considered. Fig. 1(a) shows how it is explored by a typical serial fault simulation procedure that simulates tests one by one for all faults in a predefined order (i.e., t 1 , t 2 , . . . , t 9 ). A complete fault simulation should examine all cells in the table. An x mark indicates that the corresponding test detects the corresponding fault (e.g., t 1 detects f 2 and f 6 ); any additional simulations (and possible detections) of an already detected fault is considered as redundant work for traditional fault simulation process, e.g., where only the calculation of fault coverage is required. Hence, a fault detected by a test is dropped from further consideration (shaded cells). For the example in Fig. 1 Fig. 1(b) , the faults are distributed across the available cores (fault partitioning) and simulated for all the tests. Thus, each core (represented with different color) simulates all the tests for a fraction of all faults. While this approach is complete and the total number of cells examined is the same as in Fig. 1(a) , it is not scalable as expected. Since the percentage of dropped faults is different for the different cores, some cores terminate faster than others. For example, Core 1 becomes idle after 11 simulations in total (nonshaded cells), Core 2 after 12, and Core 3 after 15 simulations. Hence, the total execution time is limited by the slowest core (lowest percentage of dropped faults) as the idle cores' processing power is not utilized. In the example of Fig. 1(a) , the total execution time is 38 time units (white cells), assuming a uniform one time unit execution for each test-fault pair examined. A scalable parallelization with three cores should provide These additional detections occur since multiple tests are concurrently simulated for the same faults (e.g., f 6 is detected by t 1 , t 4 , and t 9 in all three cores). The total number of simulations is 60 instead of 38 in Fig. 1(a) , affecting significantly the expected speed-up. For this example, the execution time is 21 time units (execution time of Cores 1 and 2) resulting in a 1.81× speed-up instead of the 3× expected. Even for the case where each core broadcasts the identified detections (i.e., perform fault dropping across cores via the shared memory), experimentation showed considerable overhead due to undesired situations such as inconsistent race conditions, memory locks, and synchronization.
A hybrid approach combining the two approaches to explicitly avoid potentially unnecessary work is shown in Fig. 1 
(d).
Each core is assigned a subset of the fault list and a subset of the test patterns (test and fault partitioning) at one time. This process is iterated until the entire search space is examined. The faults detected at each core are dropped at the end of each iteration. This hybrid partitioning allows for fault dropping to be communicated more frequently and for better workload balancing. For these reasons it has been adopted in the proposed method. More details are provided in Section III.
C. Superfluous Work Avoidance
The efficiency of parallel solutions is very sensitive to the uncertainty introduced by fault dropping, shared memory access time, race conditions, synchronization, and the shared memory coherence mechanism. Parallel methodologies inherently suffer from increased total workload with respect to serial approaches. The limited view of the entire problem in each core can result in processing that is not necessary because concurrent processing nullifies its contribution to the final solution. Detecting a fault once is sufficient for the examined problem and any further consideration of detected faults in any core is considered unnecessary processing. In this paper, simulating a fault concurrently in two or more cores for different tests is referred to as superfluous workload. For example, in the serial approach of Fig. 1 (a) the shaded cells denote workload that does not require to be executed due to fault dropping. For the parallel approach of Fig. 1(c) , where the same fault is considered in more than one cores at the same time, many of these superfluous simulations are executed, increasing the overall executed workload. Superfluous workload is the main reason why test partitioning has a larger impact on the speedup than fault partitioning. The proposed work incorporates a number of techniques to minimize superfluous workload.
D. Shared Memory Utilization
Parallel approaches relying on shared memory are by nature highly dependent on the memory structure. Memory can be effectively used as the main intercommunication mean between processing cores, hence any inefficient usage or inappropriate architecture can severely affect the methodology's performance. Fault simulation is not an exception, since each core must communicate its detections as soon as possible to allow for effective fault dropping. Shared memory access requires memory locking and synchronization in order to avoid race conditions and ensure the memory coherency. The idle processing times imposed by memory locking and synchronization introduce delays, which can increase the overall runtime. This is the case, for example, when the number of faults dropped at one time is large (i.e., high fault dropping rate). Communicating dropped faults very frequently (e.g., on every detection) can increase the memory accesses and the corresponding idle time considerably. This increase has even larger impact on memory hierarchies that combine shared (main memory) and local memories (on-core caches), like the one considered here, where a large number of writes from local to shared memory (and vice versa) are performed.
To address this issue, the proposed methodology proceeds in three phases each utilizing the shared memory structure in a different way. First, the faults are partitioned into fault sublists and each core simulates its own partition of tests independently for its private sublist. Faults dropped need only to be communicated at the end of this phase, resulting in a single shared memory locking that limits core idle time. After a high percentage of (easy-to-detect) faults is dropped, this mutually exclusive approach is not effective any more. Static partitioning of the remaining hard-to-detect faults results in simulation times that are highly different among cores.This can lead to unbalanced workload which in turn results to underutilization of several cores that may become idle. Thus, a dynamic approach follows in which a small number of faults are dynamically distributed to each core at a time. These faults are simulated in parallel by each core (via a bit-parallel simulation process) and the shared fault list is updated immediately upon simulation. This second phase does not explicitly avoid race conditions as the first one. However, it takes advantage of the different execution times among cores to ensure they access the memory in distinct times; yet its dynamic nature does not result in idle cores, unless all the tests assigned to a core have been examined. A last phase is invoked to utilize the idle cores' processing power which may result to more frequent race conditions (both for faults and tests). To minimize these as much as possible this third phase makes usage of a special data structure called test and fault map (TFM) where a record is maintained for each test that contains all the undetected faults not yet simulated for this test. Each record can be accessed separately, i.e., without the need of locking the entire map.
III. PROPOSED PARALLEL FAULT SIMULATION APPROACH
The proposed parallel fault simulation method takes into account all challenges discussed in Section II. This section describes in detail the proposed method (Section III-A) and presents optimizations incorporated to provide a scalable and well balanced parallel solution (Section III-B). Necessary notation is given in Table I . 
A. Proposed Three-Phase Methodology
The proposed method consists of three phases, each one utilizing the available cores and the shared memory in a different manner (Fig. 2) . Initially, a preprocessing step is invoked to favor assigning to the same core, faults with higher likelihood of being detected by the same test. This step is based on constructing fault sublists by a depth-first-search (DFS) traversal of the circuit's netlist, followed by a bin-packing heuristic in order to create equal sized sublists to be allocated to cores. This sorted-fault partitioning is used to evenly distribute workload among cores and, at the same time, provides a better starting point for the first phase of the method than a nonsorted approach. The latter has been confirmed by extensive experimentation, the corresponding setup of which has been presented in [27] together with relevant results. Chatterjee et al. [27] also described alternative sorted-fault partitioning approaches such as breadth-first-search-based, reverse order, and random selection. The proposed preprocessing approach was selected due to its simplicity and its minimal impact on the total runtime. Keeping its runtime small is highly important for this step as it is the only part of the proposed method that does not run in parallel (its output is an input to problem decomposition). Experimentation has shown that alternative partitioning approaches can affect the overall speedup by 4%-5%, i.e., reduce it by ∼1×.
After the initial preprocessing step, three parallel (independent, dynamic collaborative, and workload balancing) phases take place (Fig. 2) . Initially, at the Independent Phase, tests and faults are equally and statically partitioned among the available cores following a superfluous workload avoidance approach and targeting large number of (mostly easy-to-detect) faults. In this phase cores are working on mutually exclusive search space, as seen in Fig. 1(d) , and are expected to have similar runtime since they are assigned an equal workload. A high percentage of easy-to-detect faults are efficiently dropped per core while, the shared fault list is only updated at the end of the phase avoiding excessive shared memory access for fault dropping, due to the large number of faults being dropped. Despite the obvious benefit of the workload distribution and redundant work avoidance, static partitioning may still result in imbalanced executions among the various cores due to the possibility of different fault dropping ratios among the cores. Hence the proposed methodology goes beyond this static approach, by supplementing it with two dynamic phases that are triggered using specific metrics.
In the following Dynamic Collaborative Phase, each core simulates the same tests as in the previous phase. The remaining faults are dynamically allocated to all available cores requiring appropriate bookkeeping in order to simulate every test for all undetected faults and also avoid concurrent simulation of the same fault. Faults that are under processing are marked and revisited later exploring the high-speed shared memory for intercore communication in order to avoid superfluous work. The final Workload Balancing Phase redistributes the remaining workload across cores by moving tests from busy cores to others that have already finished processing their initially assigned tests (idle cores). Appropriate bookkeeping using a TFM guarantees that a test is simulated for each fault at T w = select w tests ∈ T i 03.
bit-parallel-simulation(T w , fault-free) 04.
if (
break; 12. lock-shared(F) 13. F = F \ F di 14. unlock-shared(F) 15. return most once. The following sections describe in detail each one of these phases while Table II summarizes how each one of the parallelization challenges presented in Section II is addressed in each phase.
1) Independent Phase: |T|/n tests and |F|/n faults are statically assigned at each core. The first |T|/n tests from T are assigned to Core 1 (T 1 ), the following |T|/n to Core 2 (T 2 ), and so on. In the same rationale, |F|/n faults are assigned to each core (F 1 to Core 1, F 2 to Core 2, etc). Fig. 1(d) shows this initial allocation resulting in equal initial workload among cores. Detected faults are stored in local (per core) fault lists (F di ), avoiding shared memory contention which can be caused when the global fault list F in the shared memory is updated very frequently by many cores. This is often the case in this phase where a large number of easy-to-detect faults are detected. Hence, each core is independently working on faults from its local fault list (F i ) avoiding communication and time consuming synchronization with other cores. When each core terminates the shared memory is updated by dropping all detected faults from the global fault list F. Algorithm 1 shows the basic steps of the proposed approach executed at each core i. The main goal of this Independent Phase is to detect quickly the vast majority of the easy-todetect faults. For this we introduce the fault dropping rate as the lower acceptable bound of fault detections per test. This rate is monitored and a low value is an indication that the majority of the easy-to-detect faults have been detected, and hence, this phase should be terminated. More details on this can be found in Section III-B.
First, all tests allocated to core i are simulated to provide the fault-free responses (lines 01-03). This is done in a bit-parallel manner with w tests simulated at the same time; w is the data word size of the processor. Each one of the w bits is assigned a signal value corresponding to a different test starting from the primary inputs. The simulation traverses the circuit by applying bitwise logic operations at each line indicated by the respective logic gate. Interested readers are referred to [9] for further details on bit-parallel (fault) simulation. Next, each test in T i is simulated for all faults assigned to the core (F i ) in groups of w, again in a bit-parallel fashion (lines 05-07). The faults detected from this simulation (F det ) are removed from F i (line 08) and accumulated in the local list of detected faults (F di ) (line 09). This iterative approach terminates when any of the termination conditions occurs: 1) all tests have been considered for simulation (line 05) or 2) the fault dropping rate is lower than a predefined threshold, referred to as DropRate (lines 10 and 11). Finally, all accumulated faults detected are dropped from further consideration from the shared fault list F (line 13). This updating requires locking of the shared memory to ensure coherency (lines 12 and 14). Fig. 3 illustrates the execution of the proposed method using an example, assuming a system with four cores (n = 4) and with w = 3, |F| = 48 faults, and |T| = 24 tests. Hence, faults and tests are equally distributed to the available cores, i.e., |F i | = 12 faults and |T i | = 6 tests as illustrated in Fig. 3(a) . Detected faults are dropped locally per core and only at the end of the independent phase are dropped from the global shared fault list F [gray shaded cells of Fig. 3(b) ] to avoid excessive shared memory reads/writes. This write-back scheme is possible since each core has its own search space and, thus, broadcasting detections earlier is meaningless. When the first core finishes processing with its tests or the fault dropping rate for a test is too low (here DropRate = 1), the independent phase terminates. In this example Core 2 finishes processing for tests t 7 − t 12 (indicated in green) while Core 1 has one (t 6 ), Core 2 has two (t 17 and t 18 ), and Core 4 has three (t 22 , t 23 , and t 24 ) tests that are not fully simulated yet. These tests will be considered in the two subsequent phases.
2) Dynamic Collaborative Phase: During this phase, the same |T i | tests that are statically allocated to core i during the Independent Phase are considered for simulation. However, faults are distributed per core in a dynamic manner by examining the global shared list of undetected faults F. The main principles used during this phase are: 1) each core is allocated w faults from F at a time and 2) no two cores simulate the same fault, even for different tests, in order to avoid superfluous work. The latter condition requires global bookkeeping to mark a fault as "under processing" when allocated to a core and not allow for other cores to claim it. This bookkeeping is realized using local queues per core (Q i ). For some test t j a core goes through F in a circular modulus manner searching for w undetected faults that are not under processing by any other core. Undetected faults that are currently under processing by other cores are stored in Q i in order to be considered later by core i, only if this is necessary. Hence, after a fault in Q i has been processed by some core j and still remains undetected, then core i will process it. Alternatively, if the fault is detected by some core j then it is dropped from the global fault list F. This procedure continues until all tests in T i are simulated for all undetected faults in F by repeatedly revisiting Q i until its size becomes smaller than w. This termination condition has been set to allow core i to continue with its other tests for which many faults remain to be simulated. mark faults of F w as under processing in F 13. if (|F w | = w ) 14 .
unlock-shared(F) 19 .
while
if f k ∈ F AND not under processing 23.
add-list(
repeat steps 13-19 27. if
Algorithm 2 presents the pseudocode of the dynamic collaborative phase. The procedure begins by initiating a TFM which serves as input to the following phase. TFM maintains a record per test that keeps all the faults that have not been simulated by the test. Each core creates the records for the tests allocated to it (i.e., those in T i ), after the end of the independent phase by inserting all faults in F in the test's record TFM(t j ) (lines 02 and 03). If a test has been processed in the independent phase, all the simulated faults (i.e., those in F i ) are excluded from TFM(t j ) (lines 04 and 05). Then, each test in T i is simulated for all faults in F, in groups of w and TFM is updated accordingly. Faults are placed in the local list F w to be simulated and marked as under processing (lines [10] [11] [12] . Faults marked as under processing by other cores are placed in local fault-skip queue Q i for later consideration (lines 08 and 09). The outcome of the simulation (line 14) is used to update the shared fault list F (lines [15] [16] [17] [18] . This process is repeated until all faults in F have been either simulated or queued. Faults in Q i are then revisited for simulation (if not already dropped from F by some other core) in lines [20] [21] [22] [23] [24] [25] [26] . If the fault-skip queue has size smaller than w then its contained faults are not simulated and saved in TFM (line 30). If Q i becomes empty then the test has been simulated for all faults and, thus, can be removed from any further consideration (lines 27 and 28). For each not idle core j 03.
if
F w = Select w faults from F i 09.
unlock-shared(F) 13 . Fig. 3(a) ], hence, T 2 = {t 7 , t 8 , t 9 , t 10 , t 11 , t 12 }. Starting at the fault immediately following the last fault simulated by Core 2 in the independent phase, i.e., fault f 25 , Core 2 looks for w nondetected faults from F which are not under processing. Dropped faults are shown in gray shaded cells in F. These actually no longer exist in F; they are shown here for completeness. Under processing fault cells in F are shown in colors green, pink, blue, and yellow, indicating the core that process them, i.e., Cores 1-4, respectively. For this example, let w = 3. Therefore, during the first iteration of the dynamic collaborative phase, Core 2 acquires three faults and F w = {f 25 Fig. 3(c) ] and as a result they are placed in fault-skip queue Q 2 . Faults collected in Q 2 are considered after Core 2 terminates the circular traversal of F. Additional details on the implementation as well as the fault space completeness using this technique is given in Section III-B.
3) Workload Balancing Phase: In this phase, the restriction that each test is allocated to a specific core is relaxed. Targeting full utilization of the available processing power, tests from busy cores are moved to idle cores together with corresponding faults not yet simulated. For this purpose, the TFM constructed in a distributive manner by all cores during the previous phase and stored in the shared memory is utilized. Avoidance of superfluous work is not explicitly enforced since in this stage only very few faults remain undetected. The main purpose here is to conclude as fast as possible if these faults are not detected by the given test set or they have (possibly) only one detection that has not been examined yet. Covering all possible Fig. 4 . Construction of the TFM in Core 1 for t 2 during the dynamic collaborative phase and its utilization in Core 2 during the workload balancing phase.
combinations will provide the actual fault coverage of the test set under evaluation.
Algorithm 3 summarizes the basic steps of the Workload Balancing Phase. The process is activated when the workload assigned to a core in the dynamic collaborative phase has been executed (line 01), e.g., all tests in its test list (T i ) have been retired. The procedure acquires tests from busy cores by searching TFM (lines 02 and 03). The test selected (t k ) is removed from the test list T j of the busy core (line 04) and added to the test list T i of the idle core (line 05). Then, all the faults not yet simulated for the acquired test and not dropped by other tests during the previous two phases are allocated to the idle core (lines 06 and 07). This information is obtained from the globally maintained TFM that keeps a record for the faults not yet simulated per test (see Section III-B). The acquired test is simulated for the remaining faults and the shared fault list is updated with the new detections (lines 08-12). A test is retired when it has been simulated for all undetected faults (lines 14 and 15) and the phase terminates when the entire workload is examined (all cores become idle).
B. Parallelization Optimizations
This section provides details on three optimizations incorporated in the methodology presented in Section III-A that play important role in achieving high scalability, beyond the generic techniques used to ensure efficient utilization of the shared memory.
1) Fault Dropping Rate Monitoring:
As discussed in Section III-A, the main target of the Independent Phase is to cover easy-to-detect faults as early as possible, in order to reduce the overall workload. To achieve this with high utilization of the available processing power, faults and tests are equally distributed to cores in a mutually exclusive manner. As discussed in Section II the difference in the fault dropping rate between cores can lead to idle cores and, hence, to underutilization of the available processing power. The fault dropping rate is defined as the number of dropped faults over the number of simulated faults per test, averaged over all tests examined by same core i at a particular point in time. When this ratio drops significantly, this is an indication that the majority of easy-to-detect faults has been detected by some test and that the static distribution of workload followed in the independent phase will eventually introduce idle cores. To avoid this, each core monitors the fault dropping rate and when significantly reduced, the independent phase is terminated in order to allow for the redistribution of the faults among the cores in a dynamic fashion.
In order to find a suitable value for this rate (threshold) we have examined how the speed-up is affected by the following four parameters: |F|, |T|, |F i |, and |T i |. Hence, we run the method using for the threshold weighted values of each one of them varying from (1/n) up to n, where n is the number of the cores considered. These experiments have shown that the speed-up is affected by the changes of |F i | in a directly proportional manner and by the changes in |T i | and |T| in an inversely proportional manner. The changes in the weight of |F| do not affect the obtained speed-up. Next, we combined the values found for the three former parameters and performed a similar exploration. As a result, we have set the fault dropping rate threshold to |F i |/(|T i | · |F| · 2), which simplifies to [1(2 · |T|)] when |F i | = (|F|/n), |T i | = (|T|/n). Hence, the threshold for the fault dropping rate is inversely proportional to the test set size.
2) Fault-Skip Queues: While in the independent phase the mutually exclusive nature of the partitioning guarantees that no superfluous work will be executed, this is not the case for dynamic collaborative phase. Although tests are not shared among cores and, thus, no two cores can simulate exactly the same fault-test combination, superfluous work occurs when two (or more) tests are simulated for the same fault in two (or more) cores in parallel (explained in Section II-B). In order to avoid this problem, faults under processing are marked and not considered by other cores. Any fault found to be under processing during the dynamic collaborative phase is inserted in a fault-skip queue maintained per core (Q i ) in order to be considered after the traversal of the shared fault list F for a test. Faults remaining in the fault-skip queue after this phase terminates are stored along with the corresponding test in TFM. Fig. 4 presents a example of the usage of the fault-skip queues. Let Core 1 be in the dynamic collaborative phase. Core 1 simulates t 2 for all the faults not simulated during the independent phase, by traversing F. Faults indicated in pink are not simulated by t 2 since they have been marked by some other core(s) as under processing and, hence, placed in Q 1 . Specifically, f 25 , f 18 , and f 14 are inserted in Q 1 during iteration 1, f 29 , f 36 , and f 42 during iteration 2 and f 45 and f 47 during iteration 3. In iteration 4, after Core 1 performed a full traversal of F, t 2 25 are still under processing and, hence, skipped simulation and inserted back in Q 1 . After the end of iteration 5, not enough faults (fewer than w, here 2) can be found to be simulated for t 2 . Faults remaining in Q 1 are saved in the shared TFM [ Fig. 4(right) ] indicated in blue color. This way Core 1 can proceed with its other tests for which many faults remain to be simulated. The skipped simulations of t 2 (for faults f 25 , f 18 , f 45 , and f 47 ) will be considered in the workload balancing phase. In Fig. 4 , Core 2 enters the workload balancing phase after it becomes idle, acquiring these faults from the record corresponding to test t 2 in TFM and simulates them for t 2 without performing any further condition checks.
3) Shared Tests and Faults Map:
A shared TFM is employed for the appropriate bookkeeping during the simulation evolution to ensure that all test-fault combinations are considered. It consists of lists of faults, one for each test, that keep the faults not yet simulated by the test. For example, in Fig. 4 , t 8 has been simulated for all faults except f 16 , f 21 , and f 26 . TFM is initiated after the termination of the independent phase by recording all nonretired tests together with their corresponding faults not yet simulated. During the dynamic collaborative phase each list is updated when the corresponding test is considered for simulation. As shown in the previous section, this updating is in practice carried out by replacing a test's list in TFM with the fault-skip queue formed at the core executing the test's simulations. For the example of t 2 , simulated in Core 1 in Fig. 4 , the fault-skip queue saved in TFM is shown in blue. Shaded tests in TFM (Fig. 4) indicate tests for which all the simulations have finished, while tests without a specific list (e.g., t 3 ) indicate tests which are currently under processing by some core in the dynamic collaborative phase. A core becoming idle because all its tests have been simulated for all faults (Core 2 in Fig. 4 ) enters the workload balancing phase simulating tests that were originally assigned to other cores by considering the corresponding lists from TFM. For this example, Core 2 simulates t 2 (initially assigned to Core 1) and in this manner workload is reallocated from Core 1 to Core 2.
IV. EXPERIMENTAL RESULTS
The method was implemented in C++ language with OpenMP parallelization framework and run on a 20-core 2.5-GHz Intel Xeon CPU E52670v2 Linux machine with 98 GB of RAM and hyperthreading enabled (two threads per core, 40 logical cores). The basis for the simulation is an inhouse event driven DFS-based fault simulation tool. The larger full-scan version of the circuits in IWLS'05 benchmarks suite under the stuck-at fault model are considered during simulations. Even though stuck-at fault model is used, any linear fault model can be applied to the proposed parallelization method. Table III presents the speed-up obtained using 24-logical cores, over a serial fault simulation process (single core run), for a workload of 10 000 random tests. The 24-logical cores are used instead of the 40 available to allow fair comparison with the work of [7] . For the same reason, the speed-up metric is used as it normalizes the comparison between different approaches that consider similar architectures but with no identical characteristics such as memory size and system CPU clock frequency. Column 1 lists the circuit name followed by the number of circuit lines (column 2) and number of stuckat-faults (column 3) in the collapsed fault list considered for each circuit. Columns 4 and 5 report the speed-up achieved when only one dimension of parallelism is applied; column 4 reports the speed-up achieved when only fault list partitioning is applied (fault parallelism), while column 5 reports the achieved speed-up when only test set partitioning is applied (test parallelism). In fault parallelism the test set is shared among the cores, while faults are evenly distributed to the available cores. This option suffers from considerable overhead due to high idle core times resulted from imbalanced fault dropping among cores and overhead for recalculation of test fault free responses. In the test parallelism approach the fault list is shared among the cores. While this approach has less Fig. 5 . Scalability of the proposed method using a randomly generated test set.
overhead it still produces some superfluous work when two (or more) tests are concurrently simulated at different cores for the same fault.
Column 6 reports the speed-up achieved by the proposed method without the optimizations presented in Section III-B, i.e., the usage of TFM, the fault-skip queues and fault dropping rate monitoring are disabled (reported in [28] ). Disabling the fault dropping monitoring heuristic extends the Independent Phase beyond the point where it effectively identifies easy-todetect faults. Column 7 reports the speed-up achieved when all optimizations are enabled, which on average offer an additional speed-up of ∼2×. This improvement is significant since there is a theoretical upper limit to the possible speed-up, imposed by the number of available processing cores (24 in this case). Hence, the impact of the proposed optimizations is more obvious when the obtained speed-up by the method is significantly lower than 24×. For example, consider circuit s13207.1. The speed-up without the optimizations is 15.21×, much lower than the theoretical possible 24×. By enabling the optimizations, the speed-up is increased to 20.69×, (22.83% improvement). Obviously, proportional improvement is not possible when the speed-up achieved by the method (without the optimizations) is close to 24×, for example in the case of circuit b19_1 where only 1.21% improvement is possible beyond the 23.29× achieved. We report the additional speed-up obtained by the proposed optimizations as a percentage of the maximum theoretical speed-up (24×) for each circuit in column 8. Column 10 lists the overall CPU time of the proposed approach (including the optimizations heuristics) which is not greater than a few seconds for the larger circuits.
The main observations here are as follows. 1) Test parallelism is more effective in maintaining high speed-up, yet the combination of the two as it is proposed for the Independent Phase provides considerably improved speed-ups.
2) The proposed optimizations are necessary to provide a speed-up boost closer to the optimal (24×) and in no case their inclusion has a negative impact to the speedup of the methodology (i.e., the overhead is less than the gain).
3) The obtained speed-up implies that the proposed method will continue to provide scalable speed-ups as the number of cores increases in systems having similar architectures to the one considered here. Compared to the most recent and related work of [7] (reported in column 9 of Table III), the proposed work achieves considerably higher speed-up efficiency on all circuits and especially on larger circuits where the additional workload allows more room for parallelization efficiency. For example, for the larger circuits reported in [7] systemcaes and mem_ctrl the referred speed-up is 3.17× and 4.28×, whereas in the proposed approach it is 22.8× and 23.13×, respectively.
A comprehensive picture of the superiority of the proposed method in terms of scalability is illustrated in Figs. 5 and 6. Scalability is reported (y-axis) for different number of cores (x-axis). Fig. 5 shows the results obtained for 10 000 random patterns simulated by the proposed method, while Fig. 6 shows the results for 10 000 random patterns as reported in [7] . For all the cases examined, the proposed methodology continues to scale with the same rate for a larger number of cores, where the method of [7] exhibits significant speed-up saturation. This observation implies that the scalability of the proposed method will continue with a similar rate as the number of cores is increased. This is partially supported by the design of the proposed methodology where a simple basic simulation process was used in each core which, in turn, provides high level of freedom during parallelization. This is in a different direction from the relevant work of [7] where the basic process is aggressively optimized limiting the exploitation of parallelization techniques. Fig. 6 . Scalability as reported in [7] using a randomly generated test set. Table IV presents the speed-up obtained by the proposed approach (including the optimizations), using again 24-logical cores, when a deterministic test set is simulated in different orders. Columns 1-3 list the circuit name, the number of nodes and the number of faults in the circuit, respectively. Column 4 reports the size of the test set considered and column 5 provides an indication of the maximum amount of workload to be carried out by the simulation process, by multiplying the number of faults with the number of tests considered. The latter is by no means an accurate measure, as it only partially takes into account the complexity of the circuit; yet, cases with larger workload (as defined here) provide more solid conclusion for the evaluation of the proposed methodology as they represent more realistic examples. Columns 6-8 show the speed-up achieved following three different orders of the test sets which are then distributed to the cores as described in Section III. In column 6, the order by which the test generator tool has produced the test set is followed while in column 7 the reverse order is used. In column 8, a random test order is used. By comparing the three techniques we conclude that the test order is of little or no importance to the scalability of the approach. This is in contrast to many existing serial fault simulation approaches where the test simulation order can affect significantly the fault dropping rate and, hence, the overall CPU time. In the proposed parallel approach, however, test order does not seem to factor in, mainly due to the proposed fault dropping monitoring and dynamic workload balancing techniques. Observe that higher speed-ups are achieved for larger workloads because these cases take full benefit from the efficient utilization of the processing power.
The final part of our experimentation evaluates the impact of workload balancing in the proposed methodology. As discussed in Sections II and III, workload balancing is crucial in achieving high and scalable speed-ups. Fig. 7 shows how workload is distributed among the processing cores for four indicative benchmarks. Similar behavior was observed in all the benchmarks considered. The y-axis shows overall CPU time and the x-axis shows the different cores. In the plots on the left side, the workload balancing phase has been disabled, while in the plots on the right side it was enabled. Blue bars show the execution time of the Independent Phase, red bars correspond to the Dynamic Collaborative Phase, and green bars to the Workload Distribution Phase.
During the first phase cores are independently working on their private partitions detecting a large percentage of easyto-detect faults. Very often the dynamic collaborative phase (red bars) is the most time consuming phase since the remaining (not easy-to-detect) faults are systematically targeted in a dynamic manner. Redundant faults play an important role in the CPU time of the entire simulation process since they need to be simulated for all tests. When no workload balancing phase is applied then a large number of cores remain idle while other cores have a large amount of workload left. The horizontal lines in the plots indicate when the first core terminates when no load balancing is applied. For example, for circuit b17 core 18 remains idle for 28.5% of the execution time; yet when the workload balancing phase is enabled the maximum idle time is only 0.5% of the execution time. Dynamically redistributing the workload as proposed alleviates this problem and results in smaller overall CPU times for the entire process.
V. CONCLUSION
We present a parallel fault simulation method designed to obtain high scalability in shared-memory-based systems with a large number of homogeneous cores. The method uses a simple bit-parallel serial step and appropriately accommodates the workload in the available cores in three phases each following a different rationale An initial static independent phase evenly partitions tests and faults among cores targeting fast detection identification of easy-to-detect faults. A persistent, yet dynamic phase follows targeting the detections of hard-todetect faults where the fault list is shared and traversed by each core in a circular modulus manner. Finally, a workload balancing phase ensures minimal idle time until the completion of the entire process. The experimental results demonstrate that the proposed approach provides high speed-up and most importantly, it is scalable to the number of available cores, ensuring further performance improvement as the number of processing cores will increase beyond the state-of-the-art.
