This paper addresses the design of complex arbitration modules, like those required in SoC communication systems. Clock-less, delay-insensitive arbiters are studied in the perspective of making easier and more practical the design of future GALS or GALA SoCs. The paper focuses on high-level modeling and delay-insensitive implementations of fixed and dynamic priority arbiter. Pre-layout simulations show that arbiters which are able to process several hundreds mega requests per second can be designed using the 0.18 µm CMOS process of STMicroelectronics.
Introduction
One of the critical components of a SoC is the communication system, commonly named on-chip bus. Such an on-chip communication system has to be very flexible to interface in-house and external virtual components, providing high bandwidth, low latency, low power, arbitration mechanisms and routing capabilities.
In a SoC the on-chip bus connects the components to each other and dynamically allocates a path from one block to another. Several blocks running concurrently may require accessing the same resource leading to contentions. In this case, an arbiter is needed to solve the conflicts and to ensure that only one block is accessing the resource. The choice is done with the help of priorities affected to each request.
Several arbitration algorithms were proposed in the past to solve the problem of accessing a unique resource from an arbitrary number of blocks. These algorithms can be classified according to the characteristics of their corresponding hardware implementation. To mention a few, arbitration structures can be distributed or centralized, can be linear like daisy-chain arbiters, or ring-based like token-ring and round-robin arbiters.
Most on-chip communication systems and the arbitration modules they include are today designed with synchronous circuits. In this paper, delay-insensitive asynchronous arbiters are considered, to be part of future on-chip busses of GALA (globally asynchronous locally asynchronous) or GALS (globally asynchronous locally synchronous) SoCs.
In the SoC design perspectives, delay-insensitive arbiters have this main advantage of being hundredpercent reliable (enough time is given to resolve metastability).
Today, reliability of on-chip communication systems is becoming a major issue since the increase transaction rates is drastically reducing the so-called Mean Time Before Failure characterizing clocked synchronizers.
As far as power consumption is concerned, such event driven communication/arbitration structures have a minimal electrical activity. Indeed, unlike clocked circuits, power consumption of delay-insensitive asynchronous arbiters is proportional to access rates [6] .
Furthermore, delay insensitivity enables the design of fast "long distance" communication busses [5] .
Finally yet importantly, such delay-insensitive communication systems are fully autonomous blocks, which can easily be reused in complex SoC architectures as soft, firm or hardware virtual components, hence decreasing design time and complexity.
Based on these motivations, the paper contributes to two fundamental issues: arbitration algorithms are modeled using a high-level description language called CHP (Communicating Hardware Processes), and the delay-insensitive arbiter architectures are derived from these CHP specifications.
The paper is organized as follows. Section 2 describes the global design flow from algorithm modeling to circuit synthesis and electrical simulation. Section 3 and section 4 present both the modeling and the circuit design of respectively fixed and dynamicpriority arbiters. Section 5 reports pre-layout simulation results. The last section concludes this work and mentions the main prospects .
Design flow
An overview of the global design methodology used to synthesize asynchronous circuits is presented in figure  1 . The flow starts from a high-level modeling using the CHP language (Communicating Hardware Processes). The CHP language, initially proposed by Martin [2] , is naturally adopted in this work because i) it includes non deterministic choice structures required to model arbitration, and ii) it is very well suited to model and synthesize delay-insensitive circuits [2] [3] . In [4] , some new features were added to the CHP language to satisfy simulation requirements.
Delay-insensitive gate-level implementations are derived from the CHP specifications following a method that is not fully automated so far. However, the schematics are obtained following a formal procedure that is beyond the scope of this paper and will be detailed in future communications. It is based on the formal methodology presented in [2] .
The design flow used is similar to the one presented in [4] . Gate-level implementations of CHP programs are all described using VHDL gate netlists that are verified by back-annotable logic simulations with timing. Two kinds of libraries are targeted: a standard-cell library and a specific asynchronous-cell library including Mullergates. For each asynchronous cell, a VHDL functional view (including timing information), a schematic view and a layout view were developed. Then, circuit netlists are imported into the Cadence™ framework for electrical simulations, placement and routing. The technology used for this study is the 0.18 µm CMOS process from STMicroelectronics. All implementations use a four-phase handshaking protocol. Single-rail input channels are used for the fixed-priority arbiter, whereas dual-rail input channels are used for the dynamic-priority arbiter. The request signal associated to the single-rail channel R is denoted R_i. Signals R_i_0 and R_i_1 denote the two rails of the dual-rail input channel R. Signal R_o denotes the acknowledge signal of input channel R. Finally, the shared resource is assumed to be accessed through a single-rail channel S.
In the schematics, gates marked with a C are Muller C elements. An R (respectively S) is added when the gate output needs to be set to zero (respectively one) at reset. A "+" is added to specify asymmetric behavior. Input marked with a + only contributes to drive the output to one.
Fixed-priority arbiter
Fixed-priority arbiters have to choose among input requests with predefined hardware-coded priority values. This section considers several arbitration algorithms: a daisy chain, a particular binary-tree structure and a recent priority-arbiter architecture proposed in [1] .
Fixed-priority arbiter modeling

Daisy-chain arbiter
The CHP program of figure 2 describes a 4-input daisy-chain arbiter implementing a sequential priority scheme. The corresponding design is shown in figure 5 . The circuit is by default in sleep mode, and wakes up each time an activity is detected on at least one of the input requests. The probe CHP operator (denoted #) is used to watch activity on inputs and trigger processing ( Figure 2 , line 1). Priority is simply modeled by sequentially testing the requests. Highest priority input R3 is analyzed first. It is tested active, #R3 statement in figure 2 (line 2). When active, the shared resource S is attributed to R3 by writing to S using the CHP write statement S !. Concurrently ("," CHP notation), request R3 is granted using the CHP read statement denoted R3 ?. The channel R3 is also tested not active, statement #R3 in figure 2 (line 3). In this case, next channel R2 is analyzed the same way, and so on for R1 and R0.
Note that the stability of guards #R3 and #R3 is not guaranteed [2] . In fact, the request signal level R3 may change while it is evaluated. In this case, a nondeterministic choice denoted "@@" is used in CHP. At the hardware level, it corresponds to using a synchronizer in charge of solving metastability that may occur while deciding whether R3 is zero or one.
Figure 2. Daisy-chain arbiter CHP model
Binary-tree arbiter
Another interesting arbitration structure, a binary-tree arbiter, is described in figure 3 and the corresponding gate level implementation presented in figure 6 . The program clearly shows the tree structure of the selection process.
Figure 3. Tree arbiter CHP model
Here again, line one of the program is used to sense input activities. Part one and part two of the program (respectively labeled 1 and 2 in figure 3), concurrently solve R3/R2 and R1/R0 contentions in the first stage. Priority of R3 over R2 (respectively R1 over R0) is modeled by a two-stage linear structure first checking R3 (respectively R1) and then R2 (respectively R0). Then, the CHP sequential operator ";" is used to specify that the concurrent parts of the first stage have to complete before stage 2 can process (label 3 in figure 3 ). In this last stage, deterministic choices are used since variables s0 and s1 are stable ("@" notation). According to s0 and s1 values, the winning input is granted and the shared resource is concurrently accessed.
Parallel-request-sampler priority-arbiter Recently, Bystrov, Kinniment and Yakovlev introduced in [1] an enhanced version of priority arbiters that decouples request signals sampling (synchronizer module) and contention solving (priority module). This structure outperforms previously proposed structures in terms of complexity and latency. Moreover, unlike the daisy chain and the tree arbiters, this new arbiter structure is strongly modular, and priorities can be modified by only redesigning the priority module. However, in [1] the design of this priority arbiter family was very intuitive and performed by hand. Figure 4 gives a formal CHP specification of a 4-way parallel-request-sampler fixed-priority arbiter (requestsampler FPA) like those proposed in [1] . Line 1 is used to sense request activities. Then, four identical subparts model the concurrent sampling of the request signals (labeled synchronizer 3 to 0). Variables s0 through s3 are used to store the samples, which are then exploited by the fixe d-priority module to figure out which input has to be elected (labeled fixed-priority module). 
Fixed-priority arbiter design
Daisy-chain arbiter
The structure of the 4 -way daisy-chain arbiter is described in figure 5 . It is composed of three blocks. The loop control block is in charge of reactivating the arbiter after an input request has been served and the shared resource accessed. This block includes the trigger block, which senses input activities to keep the arbiter quiet as long as no request occurs. The second block consists in a cascade of three synchronizers [2] that sequentially sample the input requests. The third block, the fixed-priority module, determines the input request to grant according to a priority order. Highest priority input R3 is sampled first and served if active. Otherwise, input R2 is analyzed the same way, and idem for R1. Note that no synchronizer is needed for R0: if no other input request is valid, input R0 is necessarily the one that activated the arbiter. The duplicator module controls the shared resource S. 
Binary-tree arbiter
The binary-tree arbiter structure is presented in figure 6 . Two pairs of cascaded synchronizers concurrently sample and analyze the input requests according to a priority order. Highest priority is R3 over R2 (respectively R1 over R0).
Then a delay-insensitive logic implements a deterministic comparison in order to select the input to serve. 
Dynamic-priority arbiter
In a dynamic -priority arbiter, each input channel is carrying requests as well as priority values. The dynamic-priority arbiter performs priority comparisons of active requests and grants the one with the highest value. The input having the highest index is by convention selected in case of identical priority levels.
According to the performances of the enhanced version of priority arbiters presented in [1] , the structure of the parallel-request-sampler arbiter is implemented in this section with a dynamic -priority resolution algorithm. The time overload to resolve dynamic contention is compared in §5 with the fixed-priority version presented in figure 4. Figure 9 presents the program of a 4-way parallelrequest-sampler dynamic -priority arbiter (requestsampler DPA). Each input R i is a dual rail channel, encoding the request signal together with a priority value ranging from 0 (lower) to 1 (higher). Only two prioritylevels are considered for the sake of simplicity. This can of course easily be extended and adapted to the requirements of real applications. The first stage is very similar to that of the fixedpriority version of the arbiter (subparts labeled synchronizer 3 to 0). The priority module is much more complicated because it has to perform priority-level comparisons of competing inputs. The program proposed in figure 9 describes a two-stage comparator structure, but any other structure could be modeled leading to different tradeoffs in terms of complexity, speed and power. The first stage concurrently analyses pairs of inputs and if necessary concurrently performs priority comparisons. Variable v1_0 indicates: no request occurred (value 0), request R0 occurred (value 1) or request R1 occurred (value 2). The request with the highest priority is passed to the following stage. Variable v3_2 plays the same role for inputs R3 and R2. Variables v1_0 and v3_2 cannot be both zero since priority comparators are triggered on requests' activity. If only one request comes out stage one, the corresponding input is immediately granted (label "1 st stage acknowledgment" in figure 9 ). If two requests come out stage one, then another comparison has to take place. A multiplexer identifies the two requests in competition and sends their priority values to the second stage comparator (subpart labeled "2 nd stage: 2-way priority comparator" in figure 9 ). Finally, the issue of the comparison together with the couple of the inputs in contention is used to figure out which input is the winner (subpart labeled 2 nd stage acknowledgment in figure 9 ). Figure 10 presents the architecture of the 4 -way request-sampler DPA specified by the CHP program of figure 9 .
Dynamic-priority arbiter modeling
* [ #( R0 ∨ R1 ∨ R2 ∨ R3 ) --loop control → [ [#R3
Dynamic-priority arbiter design
The two 2 -way request analyzer & priority comparator blocks concurrently compare two pairs of requests. Each of them delivers the priority-holder request. Details of the gate-level structures are given in figure 11 . It can be seen that the 2 -way priority comparator is only activated when both input requests are active. The first stage of acknowledgment detects if a unique request is active at the output of the two 2-way request analyzers and priority comparators. In that favorable case, the corresponding input is immediately granted and the shared resource accessed, preventing from useless power consumption and latency.
The logic implementation is described in figure 12 . The upper part implements the favorable cases when a unique request passed through the first stage of the priority module. When more than one request passed through, the two multiplexers described at the bottom of figure 12, propagate the priority levels to the final twoway priority comparator included in the second stage. The issue of the comparison is used by the second acknowledgment stage to grant the selected input request ( Figure 13 gives the gate-level implementation of the "second acknowledgment" stage). Table 1 summarizes latency and cycle time values obtained by electrical simulations of the gate-level netlists. The arbiters were simulated with two input-sets to measure best and worst case latencies.
Simulation results
For the daisy-chain arbiter, best and worst case latencies are respectively when R3 requests and when only R0 requests among the four inputs. For the tree arbiter, the best case corresponds to R1 active only and the worst case to R2 and R0 active. The parallel-requestsampler FPA best case occurs when R0 is the unique active request. The worst case occurs when only R1 is active. The dynamic-priority arbiter best and worst-case latencies were explained in Section 4. 
Fixed-priority arbiters
The daisy-chain arbiter responds here the fastest. It outperforms other structures in the best case of latency. However, the significant decrease of performance in the worst case confirms that the sequential sampling of the input requests strongly penalizes this structure. Furthermore, including wire delays in the simulations would heavily increase the latency. Such a linear architecture is not sufficient to cover the range of real applications, in which the number of peripherals is higher a nd in which the relative frequencies of the requesters have to be considered.
The binary-tree and the parallel-request-sampler arbiter present a more balanced structure, which exhibits similar delays for the best and worst cases. When the number of inputs increases, the number of synchronizers increases linearly, whereas the fixed-priority logic complexity increases in O(log 2 ).
The advantages of the request-sampler FPA over the tree arbiter are: a simpler fixed-priority logic block and a one-level stage request sampler.
Dynamic-priority arbiter
The dynamic -priority arbiter can sustain a request rate of about 405 MHz and 220 MHz. The modularity of the dynamic-priority arbiter enables an easy reconfiguration of the priority algorithm. Several priority levels offer a high degree of freedom for system-level designs.
Optimizations
Using more aggressive optimizations based on fast handshaking components [7] would increase the speed. Another way to reduce the arbiter latency is to properly assign the arbiter inputs to requesters according to their respective priorities and request frequencies. For example, highest-priority requests should be connected to lowest-latency arbiter inputs in order to answer highest priority requests as quick as possible. Another strategy would be to connect lowest latency arbiter inputs to highest frequency requests in order to optimize average speed and powerconsumption simultaneously.
Conclusion
In this paper, it is shown that arbiters can be modeled using the high-level language CHP and their corresponding delay-insensitive implementations derived. Pre -layout electrical simulations report that delay-insensitive priority arbiters processing several million requests per second can be designed using an upto-date CMOS technology.
This work actually proves that it is today possible to cleanly and formally model and design delay-insensitive arbiter modules that are reliable, modular and fast. It also defines the fundamentals of an automated synthesis process devoted to arbiters. F inally, it constitutes an enabling factor for the asynchronous technology to be increasingly adopted in the design of SoCs.
Prospective works will be focused on the automation of the synthesis process and the improvement of arbiter architecture and circuit performances. "n to p" fixed or dynamic priority routers will also be investigated to address the design of complex on-chip routing systems.
