Abstract
Introduction
The increasing size and complexity of designs and quality assurance, reliability and relentless time-to-market pressures have rendered functional verification and performance evaluation the major bottleneck in hardware design, calling for ever faster and greater verification coverage. Two main approaches have been followed to address this issue, namely testbench automation 1 and speeding up the simulation per se, thus allowing an increase in the coverage through improved run times. Traditionally, it has been Switch and Logic level models whose demands in terms of time and memory made their simulation on conventional von Neumann machines extremely time consuming. However, the increasing complexity of architectural designs has also dramatically increased the requirements of higher level digital simulation (e.g. Register Transfer) and has long placed it in the highly computation intensive world, with computational requirements which far exceed the capabilities of conventional sequential von Neumann computer systems. As the complexity of the designs has increased, long execution times have made simulation a major and increasing bottleneck in the VLSI design process. Since 1995, simulation needs have grown a thousandfold while simulator speed has only increased fiftyfold [8] .
An approach to speed up the simulation is to exploit the inherent parallelism of digital systems and employ parallel and distributed simulation techniques whereby gates, functional blocks, etc. are modelled as "Logical Processes" (LPs) which may be executed on different processors [9] . Distributed simulation has emerged as a particularly promising and viable approach to alleviate the simulation bottleneck in VLSI design and over the past ten years has received considerable attention from researchers in mainstream Hardware Description Languages such as VHDL and Verilog [3, 15] . Distributed simulation techniques, albeit ad hoc, are progressively finding their place in innovative commercial hardware design environments (e.g. the VCK by Avery Design Systems 2 , where SimCluster, an ad hoc distributed simulation engine, is used to leverage mainstream Verilog simulations).
Another important recent development in VLSI design has been a resurgence of interest in asynchronous design techniques, due to the significant potential benefits that the elimination of global synchronisation may offer to issues such as clock distribution, power consumption, performance and modularity [10] .
A number of asynchronous processors have been developed including NSR and Fred at the University of Utah, STRiP at Stanford University, FAM and TITAC at Tokyo University and Institute of Technology respectively, Hades at the University of Hertfordshire, Sun's Counterflow pipeline processor, Sharp's Data-Driven Media Processor, CalTech's processors and Lutonium, the series of asynchronous implementations of the ARM RISC processor (AMULET1, AMULET2e, AMULET3i and SPA) developed by the AMULET group at the University of Manchester [1] and SAMIPS [21, 24, 25] , a synthesisable asyn- Synchronous VLSI modelling and simulation techniques are proving unsuitable for the asynchronous design style and therefore the last decade has witnessed an intense research activity aimed at developing notations and techniques appropriate for modelling and simulating asynchronous systems. I-Nets, Petri Nets, Signal Transition Graphs, CCS and in particular the concurrent process algebra Communicating Sequential Processes (CSP) are some of these tools and formalisms that have been employed in asynchronous logic design [17] . This paper presents PARBREEZE, a framework for the distributed simulation of asynchronous hardware. The framework targets the behavioural simulation of asynchronous hardware developed within Balsa, a CSP-based synthesis environment developed at the University of Manchester, UK. The rest of the paper is organised as follows: section 2 provides an overview of the Balsa system and the associated handshake circuits; section 3 discusses the role of simulation in asynchronous hardware design; section 4 describes the architecture of PARBREEZE; section 5 discuss performance results; and section 6 summarises the paper and identifies areas for further work.
Balsa and Handshake Circuits
Balsa [6] is both an asynchronous hardware synthesis framework and a CSP-based language for describing asynchronous systems. It has been demonstrated by synthesising the DMA controller of Amulet3i, and SPA [16] , an AMULET core for smartcard applications , and SAMIPS. Figure 1 shows an overview of the Balsa system. Balsa 
Handshake Circuits
Descriptions of designs (.balsa file) are translated (Balsa-c) into implementations in a syntax directed-fashion with language constructs being mapped into networks of parameterised instances of handshake components (.breeze file) each of which has a concrete gate level implementation [4] . Balsa handshake circuits are very similar to those introduced in the Philips Tangram system [7] .
A number of tools are available to process the breeze handshake files [6] . Balsa-netlist automatically generates CAD native netlist files, which can then be fed into the commercial CAD tools that further synthesize the netlist to the fabricable layout.
Balsa has approximately fifty handshake components in total, most of them inherited from Tangram. Each component has a unique name, symbol, definition and several implementations (based on different technologies). In the handshake circuit, components communicate via point-topoint channels to exchange control information and optionally data. During a transaction, the initiator component requests the transfer of information (by issuing a request signal) and the target component responds (with an acknowledgement). Channels are connected to components via ports which may be either active (connected at initiator's side) or passive (connected at target's side). Depending on the direction of the data flow, a channel is classified as a push channel (from the the initiator to the target) or a pull channel (from the target to the initiator). Sync channels are used for synchronisation and do not carry any data. Figure 5 . Each node in the graph represents a handshake component. The activation port starts the operation of the repeater (BrzLoop) which initiates a handshake with the sequencer (BrzSequence). The sequencer first triggers a handshake with the fetch component (BrzFetch) on the left, which causes the data be moved from the channel i to the variable element (BrzVariable). Then a handshake is issued by the sequencer to the right fetch causing the data to be read from the variable element and be assigned to the channel o. Once these operations finish, the sequencer completes its handshake with the repeater which starts the cycle again.
Levels of Simulation
Three levels of simulation are supported in Balsa, namely behavioural, gate-level and post layout simulation (figure 1). The latter two low simulation levels are carried out by the native simulators of the commercial CAD tools supported by Balsa. At the behavioural level, discrete-event simulation is used to simulate the network of handshake components. Two sequential simulators have been developed for this level, LARD [2] and breeze-sim [12] .
Simulation in Asynchronous Hardware Design
Functional verification and performance evaluation is a more complex task in asynchronous systems than in their synchronous counterparts. In the latter, benchmark execution times are easy to interpret based on the number of clock cycles and the existence of a critical path. Delays in the critical path can determine the clock period while non-critical path delays have no effect on the performance of the system. In contrast, the temporal behaviour in asynchronous systems is more difficult to understand and interpret as delay inter-dependencies are more complex. Delays in one module may often be masked by occasional longer delays in another module, while the accumulation of delays through a chain reaction in a non-deterministic concurrent environment may have a chaotic effect on system performance. The need to evaluate the asynchronous architecture for different sub-system delays further complicates the process rendering simulation speed a crucial element [19] .
For instance, the slow performance of the LARD system often made it quicker to synthesise directly into a concrete realisation and then use the native CAD environment for simulation. As a result, functional simulation of the Amulet3i processor was severely constrained by the speed of the simulation. Consequently, the design was frozen prematurely in order to meet tape-out deadlines. A faster simulation environment would have allowed the design space to be explored more thoroughly [5] .
An effort to improve the performance of the sequential simulation for Balsa has yielded impressive results, with the breeze-sim simulator achieving a speedup factor of more that 19000 compared to LARD 3 [12] .
However, as asynchronous design techniques find their place in the mainstream VLSI industry and asynchronous designs become increasingly more complex, simulation speed will remain a crucial problem and, like in synchronous hardware design, distributed simulation will provide the only viable solution . The distributed simulation effort for Balsa has targeted the handshake circuit level. Simulation at lower levels (switch, gate) would require a complete (expensive) CAD suite and a (specific) technology file and its associated libraries to be installed, merely to investigate design alternatives at the architecture level. A fast distributed simulation environment tightly coupled to Balsa is more sensible from the designer's viewpoint, avoiding the need to negotiate complex general purpose commercial CAD frameworks.
A decentralised event-driven approach based on the Logical Process Paradigm has been adopted for the development of the Balsa distributed simulator (the PARBREEZE kernel), while MPI is used for interprocess communication. The breeze file is parsed and partitioned so that different subsets of handshake components are assigned to different Logical Processes (LPs). Finding an optimal partition for a given circuit graph is an NP-complete problem and various heuristics have been developed to address this problem [11, 20] . Figure 6 depicts an example configuration of the PAR-BREEZE kernel. Partitioning the handshake graph in different LPs generates a set of cut edges as depicted in figure 7 . An edge cut defines a handshake channel which connects components in different LPs. To avoid modifying the implementation of the Balsa handshake channels, a new category of Network Handshake Components (NHC) has been defined to facilitate interprocess communication. NHCs are automatically inserted in an existing handshake circuit be- Figure 9 shows the use of network components when sync, push and pull channels are cut. As an indicative example, when a sync channel connecting an active port A to a passive port P is chosen by a partitioning algorithm to be an edge cut, then a pair of BrzNewCompActSync, BrzNetCompPassSync of NHC are introduced. After the cutting we obtain two new sync channels: the first one connects the passive port P with the active port of BrzNetCompActSync and the second connects the active port A with the passive port of BrzNetCompPassSync.
The Network Handshake Components

Event Driven Scheduler
Each LP in PARBREEZE is built around a typical event driven scheduler as depicted in figure 10 . The scheduler extracts events from an chronologically ordered event queue and processes them invoking the model of the corresponding handshake component. Two message queues are also utilised to respectively send and receive MPI messages to Before the processing of the next event, the scheduler examines whether there are remote messages waiting in the Incoming Event queue, and inserts them all in the internal event queue. This may naturally result in causality errors, however, as we have shown in previous work [18] , such errors can be ignored. The time stamps of the incoming events are all set to the the current value of the internal clock of the LP before they are inserted in the event queue.
The interaction with remote LPs is dealt with by a listener POSIX thread which runs in parallel with the main scheduler thread ( figure 11 ). The listener receives MPI messages (using non-blocking MPI Iprobe followed by the MPI Recv) and inserts them in the Incoming Event queue. In the absence of incoming messages, the listener turns its attention to the outgoing queue sending all pending messages. Synchronisation between the two threads is achieved by means of semaphores, using the pthread mutex lock 
Experiments and Results
A series of experiments have been performed in order to evaluate the performance of PARBREEZE. As a testbed, the SAMIPS asynchronous processor has been used. SAMIPS is built around a five-stage pipeline datapath (figure 12), namely Instruction Fetch (IF), Decode/Register File Read (ID), Execution or Address Calculation (EX), Memory Access (MEM) and Register Write-back (WB). The datapath includes an ALU, a Shifter, a Multiplier/Divider, an Address Adder, and a PC incrementer. SAMIPS has been modelled as a hierarchy of concurrent processes with approximately 900 lines of Balsa code, with the corresponding handshake circuit consisting of approximately 2300 handshake components and 3600 channels. SAMIPS executes MIPS machine language instructions produced by a MIPS cross-compiler and loaded during the initialisation phase as 32-bit quantities in hexadecimal format from an image file. The well established Dhrystone benchmark has been used for the experiments [23] . The execution platform is a cluster machine with dual-processor Intel Xeon 3GHz nodes and 2 GBytes of memory, interconnected via a Myrinet switch.
Partitioning Strategies
Two main partition strategies have been used, namely manual and using the well established graph partitioning system METIS [13] . The manual partitioning follows the pipeline stages of SAMIPS, from two up to five LPs.
METIS partitions a graph following a multilevel recursive bisection algorithm [14] , where the vertices of the graph represent the set of the handshake components while the edges are the communication channels.
Four different partitioning strategies supported by METIS have been used (the first three require the execution of the simulation once, to collect the necessary information): Figure 13 shows the performance achieved by PAR-BREEZE for the different partitioning algorithms using the The results show that the distribution of PARBREEZE can achieve a maximum speedup of 1.4 (26.8% reduction in execution time, see table 1) and that the choice of the partitioning strategy can have a significant impact on the execution efficiency of distributed simulator.
Execution Efficiency and Analysis
Understanding the factors that affect the performance of the simulator and the relationship amongst these factors is crucial for the selection of the appropriate partitioning algorithm.
As a first step in this endeavour, we have quantified the available parallelism in the model that can potentially be exploited. We define the degree of parallelism in the model as the average number of concurrent events in the system that can be executed at the same time step. The degree of parallelism of SAMIPS is presented in figure 14 . Figure 15 shows how well the partitioning strategies have exploited the available parallelism. The parallelism balance factor is defined as the (average) ratio of the maximum number of concurrent events in a particular time step assigned to a single LP (and therefore executed sequentially) over the number of events that the optimal, even distribution would have assigned to each LP. Figure 16 illustrates the quality of the workload balancing in terms of the workload balance factor, namely the number of events executed by the busiest LP divided by the optimum number of events. The latter is defined as the total number of events in the whole graph divided by the number of LPs being used.
The results in figures 15 and 16 indicate that the partitioning strategies that achieve better exploitation of the parallelism and workload balance are those that include the weighted vertices as one of their parameters while the worse is the one that tries to minimise communication. Figure 17 shows the total communication cost, in terms of the total number of MPI messages in the overall system. Clearly, as the number of LPs increases so does the total number of MPI messages that need to be exchanged between them. As expected, the partitioning that results is the lowest communication cost is the one that minimises the edgecuts Figure 18 shows the average idle times over all LPs for the different partitioning algorithms. The partitioning strategy which results in the lowest idle times is the one which minimises communication (which is also the one that achieves the maximum speedup in figure 1 ). This is explained by the fact that since there is a reduced number of messages in the system, the LPs spend less time waiting for remote messages.
From figure 13 it is clear that the reduction in communication yields the highest speedup even though the workload is not well balanced. This is mainly due to the fact that the simulation model is a communication bound system, rendering the communication overhead ( figure 20 ) the determining factor of the performance. Indeed, the cost of send- 
Speedup and MPI Overhead
Despite the significant speedup achieved by the distribution of PARBREEZE, PARBREEZE has not managed to beat the performance of the sequential breeze-sim simulator. This is mainly due to the computation overhead that the MPICH-GM has introduced in the system, which has increased the execution time of the simulator on one node by 106% ( figure 13 ). To investigate the cause of this computation overhead, a different version of MPI, MPICH has been tested. The results achieved are illustrated in figure  19 . Clearly, as PARBREEZE is distributed onto more nodes, the execution times increase. This should be expected, as MPICH is for gigabit Ethernet and its communication efficiency is very poor compared to MPICH-GM 4 ( figure 20 ). Hence the communication cost becomes the dominant factor of the simulation, as explained in the previous section. However, the computation overhead introduced by MPICH in PARBREEZE on one node is almost zero compared to the 106% introduced by MPICH-GM. This additional overhead of the MPICH-GM, can be attributed to the way it performs memory (malloc) and thread management. Indeed, a simple program which invoked malloc(), took 5.3 secs to execute on a single node when compiled with gcc and MPICH while within MPICH-GM it took 8.0 secs. Compiling the program using the -pthread parameter for POSIX threads, raised the 
Summary and Future work
Asynchronous Logic is progressively finding its place in the mainstream VLSI design, not least in the development of GALS (Globally Asynchronous Locally Synchronous) systems. As a result, there will be an increasing demand for appropriate efficient simulation techniques. This paper has presented a framework for the distributed simulation of asynchronous handshake circuits generated by the Balsa system. This work has shown that significant speedup can be achieved by the utilisation of distributed simulation. Our investigation has identified both the partitioning algorithm and the efficiency of the communication software (MPI) employed as significant factors for the performance of the simulator. Further work will investigate the cause of the computation overhead of MPICH-GM and will perform a more detailed analysis of the performance of the simulation, using additional benchmarks (such as the SPA processor) and partitioning algorithms (such as the JOSTLE system [22] ). Synchronisation issues will also be investigated.
