Abstract-In this paper, we propose a lock-free architecture to accelerate logic gate circuit simulation using SIMD multi-core machines. We evaluate its performance on different test circuits simulated on the Intel Xeon Phi and 2 other machines. Comparisons are presented of this software/hardware combination with reported performances of GPU and other multi-core simulation platforms. Comparisons are also given between the lock free architecture and a leading commercial simulator running on the same Intel hardware.
I. INTRODUCTION
Basing our work on a low-cost SIMD multi-core machine, we describe experiments which use a lock-free architecture to accelerate logic simulation. These experiments verify that the proposed data structures allow SIMD acceleration, particularly on machines with gather instructions (Section VIII-A). Furthermore, on sufficiently large circuits, we can achieve substantial performance gains from multi-core parallelism (Section VIII-B). Moreover, the experimental results show that a simulator using this approach surpasses that of an existing commercial simulator on a standard workstation (Section VIII-D). Besides, these experiments show that the performance on a cheap Xeon Phi card is competitive with results reported elsewhere on much more expensive supercomputers (Section VIII-F).
In clocked synchronous circuits, the utility of simulators depends upon the maximal size of a circuit that can be simulated in addition to its simulation time. Let n be the number of gates in the circuit and t the time in seconds to simulate the circuit for one clock cycle, the raw performance will be n/t.
The performance of a logic gate simulation system is generally expressed in terms of e the number of gate evaluations per seconds. During the simulation, since not every gate will undergo a transition each clock cycle of the simulated machine, it is usually the case that e < n/t. The difference between these two quantities is exploited by event-based simulation.
When simulation is performed on single processor machine, the method is known to be effective. Event-based simulation depends on the maintenance of queues. Parallelisation would lead to potential lock contention for the queues. Message passing models of parallelism have queues, for messages, built into their basic communication mechanism. As such message passing parallelism has been successfully used to accelerate event-based simulation.
Massively parallel discrete-event execution on several thousand of processors has been extended to simulating very large numbers of circuits in detailed hardware simulations of microprocessors [1] , [2] .
In contrast to dynamically scheduling the behaviour of the model during the simulation, static scheduling of all the computations would be well suited for massively parallel machines. This technique is called oblivious simulation. In this paper, we describe a lock-free architecture that allows contention free parallelism on low-cost SIMD multi-core accelerator boards.
II. RELATED WORK
The initial research work on logic simulation started around 1980, where the concept of oblivious (cycle based) and even based simulation was first addressed [3] , [4] , [5] . Research on parallel simulation algorithms bloomed around the same time targeting both platforms with distributed memory [6] , [7] , and multiprocessors with shared memory [8] .
There are various existing studies on parallelizing logic simulation targeting various platforms such as supercomputers and workstations with accelerators such as GPUs.
Parallelising simulation algorithm targeting SIMD (Single Instruction Multiple Data) hardware systems was first done by [9] . The compiled code logic simulation targeting GPUs did not achieve an ideal performance. Due to the communication overhead between CPU and GPU and not optimizing the data transfers between host and device, the CPU outperformed GPU. Moreover, they did not use the general purpose parallel programming model CUDA.
In contrast to our work, Chatterjee, Deorio, and Bertacco [10] , [2] , [11] , Sen, Aksanli, and Bozkurt [12] , and [13] , [14] use circuit partitioning algorithms to achieve fast simulation.
Chatterjee et al. propose both oblivious and event-based simulation algorithm. Although their oblivious simulator [10] was simple and statically optimize-able, unfortunately, the size of the circuits that can be simulated by this simulator is limited due to the size of the local shared memory as well as a number of multiprocessor on the GPU. In their event based simulator [2] , they partition the design to macro-gates. During the simulation, one or more macro-gates is/are assigned to a multiprocessor. The number of concurrent thread blocks in a multiprocessor can determine the number of macro-gates that are simulated together.
The hybrid simulator (GPU event-based simulator) by Chatterjee et al. [11] uses event-based simulation as a coarse granularity and oblivious simulation within each coarse grain group. Limited amount of shared memory on GPU that is shared among threads in a block, puts a constraint on the size of these micro-gates.
Similar to our work, Chatterjee et al. used lookup table for gate evaluation, whereas Sen et al. used AIGs (And Inverter Graph) representation where all the logic gates in the circuit were AND gates. AIG is a way of representing Boolean function manipulation. The technique has been widely used in technology mapping, logic synthetic, and verification [15] , [16] , [17] , [18] , [19] .
There have been other research works on parallel logic simulation targeting platforms other than GPUs. Gonsiorowski, Carothers, and Tropper [1] study parallel logic gate simulation on supercomputers. They mainly focus on parallel simulation of a OpenSPARC T2 crossbar. Gonsiorowski et al. use Parallel Discrete Event Simulation (PDES) simulation kernel ROSS [20] framework that is built on Jefferson's Time Warp [21] and is designed for PDES.
Similar to our work, Gonsiorowski et al. use gate level netlist with basic Boolean gates and also considers unit-delay model (each gate has one clock cycle delay).
There are existing papers on the acceleration of digital circuit simulation using GPUs [22] , [13] , [23] , [24] , [25] , [26] and multicores [1] , [27] . In our work we use the Intel Xeon Phi [28] to run digital logic simulation.
III. MIC ARCHITECTURE (INTEL XEON PHI COPROCESSORS)
Intel makes alternative technology to GPUs, the Many Integrated Core (MIC) Architecture. The first version is the 22 nm Knights Corner chip, sold under the name of Xeon Phi.The Intel Xeon Phi coprocessors are SMPs that plug into the host (Intel Xeon processor) via PCI Express. The Xeon Phi is configured as a daughter card which runs an independent copy of Linux. It relies on the host motherboard for power and communication. This makes the Xeon Phi physically similar to GPUs.
Xeon Phi cores are based on the Pentium architecture. It has 57 to 61 cores clocked at around 1GHz. There are 4 hardware threads per core, that results in roughly 240 logical cores. Every core has 512-bit wide vector registers, in addition to the standard x86 registers.
Interconnection among cores is based on a ring network model that allows the L2 caches for each core to be accessible by all others. In all, a total coherent cache of over 30MB is available. A Xeon Phi has from 6 to 16 GB of GDDR5 RAM giving around 170 GB/s bandwidth. Each core has its own 32KB L1 cache only accessible locally.
The Xeon Phi is equipped with a new set of instructions, the Intel Initial Many-Core Instructions (IMCI) that is supported by Vector Processing Units (VPUs) within each core. Each VPU supports 512 bit SIMD vectors.
An advantage of using MIC rather than GPUs is that the same code written for a multicore CPU can run on the Xeon Phi coprocessors. This contrast with the way CPU application code needs alterations in algorithm and syntax when ported to a GPU using CUDA. An application written for the MIC architecture using the Intel C or Vector Pascal compilers runs unaltered not only on the Xeon Phi, but also on computers with standard Intel processors [29] .
IV. SIMD REQUIREMENTS SIMD seems initially attractive for simulation since it would allow a single instruction to perform a logical operation on 512 bit worth of data. But this tantalizing vision faces two serious obstacles; 1) All of the bits in the 512 bit word must perform the same operation : AND, OR etc. This seems to imply that a SIMD simulator would have to segregate logic gates into blocks of ANDs ORs etc. 2) If one represents simulated logic signals using packed single bits in a CPU register one needs a means of redistributing the outputs of the previous logic layer into the appropriate packed bit representation. Whilst this can readily be coded using shift and OR instructions, this would impose a serialization that would offset the gains from SIMD parallelisation. No machine currently supports an instruction that loads a packed bit format from a vector of other bit addresses. A weaker approximation to this type of instruction does exist. The Xeon Phi vgather instruction loads a SIMD register rx with 16 doublewords such that rx[i] is loaded from mem[ry [i] ] where ry is another SIMD register, and i is in range 0..15.
Although, a single instruction performs operation on 16 logic signals instead of 512, it is still 15 more than what can be done without SIMD. With 60 cores on the chip, there is a potential for a simulation parallelism level of 960.
Comming up with a data structure that allows both uses the vgather instruction and also ensures good cache locality, is the main challenge. Note that the technique is applicable to machines other than Xeon Phi, as similar gather instructions are being made available on AVX-512.
V. DATA STRUCTURE

A. Levels of logic
We restrict ourselves to simulating synchronous state machines and make further simplifying assumptions as bellow:
• All gates are two input, NOT is represented by a NAND with duplicate inputs, 3 input ANDs made up of pair of 2 input ones etc. • All two input gates have same gate delay, t. Working backwards from the rising edge of the system clock, the state latches can be affected either by external inputs that feed them directly or by logic gates. No change to an input to a logic gate occurring after a time −t can affect an input to a latch, it follows that there can be no dependencies in the last t of the machine cycle in the set of signals that either feed the latches or between the gates the generate signals that feed the latches. Thus all of these gates can in principle be simulated in parallel. Call this set of signals level N . Clearly we can, by induction, apply the same argument to the signals which feed these gates which we will call level N − 1. Given a netlist we levelise it as follows:
Step 1. form set of all signals feeding the latches or outputs.
Step 2. push gates whose outputs generate this set onto a stack
Step 3. form set of all signals feeding the set of gates on the top of the stack Step 4. if this set is empty goto step 5 otherwise goto step 2
Step 5. set n=0
Step 6. pop the stack and label all gates with level n Step 7. if stack empty terminate, otherwise set n=n+1 and goto step 6
By levelisation (Fig. 1) , we can perform parallel independent calculations of whole levels and only force synchronization at the end of each level's simulation [30] , [10] , [9] .
VI. SIMD BASED SIMULATOR ARCHITECTURE
In order to benefit from SIMD architectures, the same operation should be applied to a large number of data elements. Different logic gates perform different boolean functions. However, all can be represented as truth table lookup. We can thus perform AND, OR, NAND, etc. in parallel using SIMD instructions which read an aggregate look up table of size 16x4, which in turn holds truth tables for all binary logic gates. A simplified version of lookup table is as bellow: To keep the lookup table simple, we have used only basic two input logic gates. Larger circuits are broken down to the level of two input logic gates.
To allow efficient parallel access, we represent the circuit as 4 contiguous vectors. The first three vectors hold the date related to the structure of the circuit: comp holds logic types, inp0, inp1 identify the two inputs to the logic gate. The final vector called state, holds the time varying information of the simulation. In other words, this final vector array contains the current state values of all the signals. To update the state of Flip Flops, a separate set of arrays are used, that contain the location of input and output signal of each Flip Flop in the state array. Fig. 3 , shows an example of the array data structure used for the given circuit (Fig. 2) . The circuit netlist contains all the information related to the gate levels, inputs and outputs, and etc. state array holds the signals values from each level close to each other to ensure data locality. Level 0 in the state array contains all the values of input signals to the circuit.
Each location i in arrays inp0 and inp1, contains the ID number of the input signals to the logic gate in location i in array comp, where its type is stored. The current state value of its output signal is stored at index i in state array.
During each clock cycles, the simulation function is called up to the maximum depth of the circuit, to simulate logic gate of the same level all together 1. Listing 2, line7, shows how the value of current state signals is calculated.
The use of lookup table in this calculation, and the way the data is stored in the arrays, allow this calculation to run Intel Xeon Phi has 128, 512-bit SIMD registers on each of its cores (32 per thread). Components of the same path depth will be simulated as 512 bits chunks of data (Fig. 4) . In other words, the load/store, read/write, as well as calculations are done in SIMD on 512-bit of data at the same time (Fig. 4) .
Given an array of size N, on Intel Xeon Phi with 240 threads, each physical thread is allowed to process N/240 elements of the array. On top of this, vectorization allows 16 simultaneous calculations. So, each arithmetic unit only has to do N/3840 calculations.
During the simulation, logic gates of the same level are divided among threads. The amount of workload for each thread depends on the shape of the circuit (distribution of logic gates per level). Thus, the workload for each thread varies at each level. Each thread performs calculations on an equal piece of data. This ensures work balancing. Fig. 5 , shows the 
VII. EXPERIMENTAL DATA A. Test sets
To evaluate our proposed parallel SIMD circuit simulator, we have used test circuits from IWLS benchmark suit, in addition to synthetic circuits.
We took the benchmarks available in BLIF (Berkeley Logic Interchange Format) 1 We used a parser to flatten the circuit and generate the netlist array.
Due to the absence of large benchmark designs (for confidentiality concerns in industry), in addition to the IWLS benchmarks suit, we generate synthetic circuits to be used as benchmark suit for the experiments. The construction algorithm we used is that published in [32] which has been well validated for the way the circuits it builds are representative of real circuits.
The process of generating the test circuits and the simulation itself is shown in Fig. 6 .
B. Experimental Setup and Benchmark
We used an Intel Xeon, an Xeon Phi coprocessor as our primary platforms, to evaluate performance of our parallel SIMD simulators.
To assess the performance, the circuit simulator was ran on two different architectures for a varying number of cores over different sizes of circuits. The Intel Xeon Phi 5110S coprocessor with 60 cores, each operating at 1.053 GHz, an Intel Xeon E5-2620 processor operating at 2 GHz, and Intel core i7-2630QM were used. Table I shows the detail specification for the architectures.
Throughout the experiments, we focus on the total physical elapsed time. We also used other metrics such as event rate. We have used a counter to measure the number of gate transitions. Regardless of how many cores are used, this metric would give us the number events that can be computed per seconds. For comparison, we also used a commercial simulator (Xilinx) run on our Intel i7 machine.
VIII. RESULTS
In order to validate the effectiveness of our SIMD simulator, we ran the simulator over 1000 clock cycles on two main platforms of Intel Xeon and Intel Xeon Phi.
A. SIMD acceleration
In order to show the effect of vectorization on the performance of the sequential simulator, we ran the simulator on one core with and without SIMD.
With or without the vectorization, the Xeon sequential simulator runs faster the Xeon Phi one. However, the purpose of this experiment was only to show how much improvement the SIMD vectorization would give.
SIMD acceleration increases the speed for up to 10 times on Intel Xeon Phi and up to 4 times on Intel Xeon. We can see that the acceleration falls to 2 for large circuits (Fig. 7) . This reduction is due to the size of the circuit and size of the L2 cache.
When the circuit does not fit into the cache, the performance degrades. Note that each logic gate occupies 4 integers (Fig. 3) . With 512K L2 cache per core, the maximum size of the circuit that fits the cache is about 32k.
The bit-vector length in Xeon Phi is 512, 256 on Xeon. The expected potential speedup enabling SIMD acceleration: on Intel Xeon Phi is 16, and 8 on Intel Xeon. Meaning 16 logic gates on Xeon Phi and 8 logic gates on Xeon can be evaluated at the same time. However, the achieved vectorization speedup on one core was 10 on Xeon Phi, and 4 on Xeon. This is unsurprising, since Amdahl's law predicts that vectorization speedup would normally be less than the vector length.
B. Multi-core acceleration
The log/log plots in Fig. 8 , shows the effect of multi-core parallelism.
On Intel Xeon Phi, as we increase the number of threads (from 1 to 240), we clearly see improvements on larger circuits. From the circuit size of 3 million logic gate, the use of 240 threads shows some speedup.The larger synthetic circuit that was used in these experiment has around 160 million of logic gates. For this circuit size, we achieved the speedup of 10 using 240 threads on Intel Xeon Phi, in comparison to the baseline (the sequential version on Intel Xeon). When When using multiple threads, not all the resources (hyperthreads) are always available and free to do the simulation. As a circuit grows larger, there will be memory contention (when cores trying to access part of memory that is not accessible) leading to hyperthread stalls. Although smaller circuits fit into cache, they benefit less from multi-threading. We hypothesize that there is not enough work to keep the cores busy and task dispatch overhead degrades the performance.
C. Circuit Connectivity
The results reported so far have used synthetic circuits with random interconnect, random both in terms of which previous layer of logic an input comes from, and in terms of which logic gate within a layer is used. The circuits are thus maximally disordered to provide a worst case.
The same experiments were applied to another set of less random synthetic circuits. In these, the inputs to each logic gate are derived from the immediately previous layer. Randomness is thus significantly reduced, and with the reduction of randomness, we should expect greater locality of access. We call this set the semi-random set.
The observed speedup peaks at 460 as opposed to 300 for maximally random circuits. It shows the effect of data dependency and data locality on the performance. The increased performance is due to the change in write/read pattern. Most 
Number of Logic Gates Vectorization Performance
Random circuits Semi−random circuits Fig. 9 . Semi random synthetic circuits with inputs from previous level only compared with random circuits, both traces using SIMD + multi-cores. Speedup compared to one core non-SIMD on same circuit on Intel Xeon Phi. likely, a lesser number of gather instruction cycles is needed to collect the required data during the simulation.
D. Comparison with a commercial workstation simulator
In this part of the experiments, we used the commercial simulator only on standard Intel chips. So, this comparison did not use the Xeon Phi. The commercial simulator could only take circuits of small size. The circuits are of a size for which our simulator works best with only SIMD and multicore parallelism. The circuits are small and there is not enough work to keeps the cores busy and hide the latency. As we increase the number of threads, the overhead due to thread creation worsen the performance. However, even using one thread on the same machine, our SIMD simulator is much faster than the commercial simulator (Fig. 10) 
E. Comparison with simulations on GPUs
There are prior papers on logic circuit simulation on GPUs, though the results reported in the literature are for comparatively small circuits.
In [12] , the authors use partitioning and replication in conjunction with levelization in order to handle the problem that the GPUs provide a small amount of shared memory.
It is possible to directly compare the performance of our data structure with the result they report for only two of their circuits, working BLIF representations of the others not being available. Table II shows that when our data structure is run even on one core of a standard Intel i7, the performance substantially exceeds the results reported from [12] , when we use the metric of nanoseconds per gate simulation.
It is also worth noting that the mentioned paper report results only on comparatively small circuits well under a million gates. So, the applicability of their technique to large circuits is unclear.
Yuxuan et al. [13] , introduced a strategy to extract and partition the circuit in order to compile it to GPUs. They presented comparison between the Intel Core Duo T2400 processors with 1.8 GHz frequency and the NVIDIA GTX 465.
They achieved gate cycle times (Table III) comparable to the peak performance of MIC (Fig. 11) . This reflects the lower task dispatch cost in CUDA relative to Xeon Phi. The Xeon Phi achieves it best performance on large circuits where the task dispatch cost can be amortized.
Chatterjee et al. report simulation on NVIDIA 8800GT GPU with 14 multiprocessors. Due to no overlap of test circuits, in order to compare the performance of our simulator with the GCS simulator [11] , we compare our speedup relative to our sequential commercial simulator mentioned in Section VII-B to the speedup that Chatterjee et al. [11] report relative to a commercial simulator. Their GCS simulator outperforms their commercial simulator by between 4 to 44 times with an average speedup of 13. Our simulator running SIMD parallelism on one core Intel i7, outperforms our commercial simulator by an average factor of 356.
1) Conclusions relative to GPUs:
GPUs can achieve comparable gate cycle per second rates to the Xeon Phi. But this is only been demonstrated on the GPUs for the relatively small circuits. Although it is not explained in the literature why small circuits have been used in GPU experiments, we hypothesize that the relatively small local memory on GPUs motivates experiments to select problems that are easier to map to the local memory. It is clear from our results that Xeon Phi can be extended to the circuits of around 100 millions of gates.
The ring architecture, in which memory accesses by each core are satisfied in the priority: local cache, cache of other cores on ring, GDDR ram, seems very effective. It obviates the need for the programmer to schedule transfers to local memory whilst still giving very good performance even with maximally random circuits.
F. Comparison with simulation on the IBM Blue Gene
Gonsiorowski et al. [1] used a discrete event simulation framework that allows simulations to be run in parallel, called ROSS (Rensselaer Optimistic Simulation System), a modular time wrap system. The paper reports the performance of this framework executing parallel event based simulation (based on the time wrap protocol) using a message passing interface on Blue Gene/L.
1) Blue Gene/L Architecture: The experiments were done on two machines (IBM Blue Gene/L, and Intel X5650). The Blue Gene/L has up to 1024 cores, each performing at 700 MHz clock rate.
To evaluate the simulation performance, we compare the number of gate transitions per second between our simulator and [1] .
We are comparing the event metric for our largest circuit (with over 160 millions of gates) with their 216 million gates circuit. On Blue Gene/L with 1024 cores, they achieved an event rate of 116 million events per second.Our simulator achieved an event rate of 141 million events per second (Fig. 12) . Table IV , compares some of the characteristics of both Intel Xeon Phi and Blue Gene/L, in terms of the price per rack and the size, in addition to the number of available cores. Table V compares the event rate data taken [1] with the event rate measured in our simulator. We are achieving better performance on many fewer cores at much lower cost. The MIC clock speed is slightly higher than that of the Blue Gene, but the main gain comes from the ability of our data structure to handle both SIMD and multi-core parallelism with low synchronization overhead.
IX. CONCLUSION
In this paper, we proposed a lock-free architecture for accelerating logic gate simulation that allows targeting a low cost SIMD multi-core machine.
We used and applied a data structure on the state of art, Intel Xeon Phi technology. This data structure minimizes the synchronization overhead as well as maximizing the possibility of SIMD and parallel operations. The combination of this data structure and the Xeon Phi chip is a cost effective solution for simulation acceleration.
We have shown that this combination is far faster than, and can handle much bigger circuits than, a widely used commercial simulator running on a workstation. We have shown that the Xeon Phi is competitive with simulation on GPUs and allows the handling of much larger circuits than have been reported for GPU simulation. We also presented results which show that it gives comparable simulation performance to the IBM Blue Gene supercomputer at very much lower cost.
In future publications, we will address the portability of our simulator to other programming languages and to parallel machines by different manufacturers.
