Introduction
If asynchronous circuits can outperform synchronous ICs in many application domains such as security and automotive [11] , the design of integrated circuits still remains essentially limited to the realization of synchronous chips. One reason can explain this fact: no CAD suite has been proposed by the EDA industry to provide a useful and tractable design framework. However, some academic tools have been or are under development [1, 2, 3] . Among them TAST [3] is dedicated to the design of micropipeline (µP) and Quasi Delay Insensitive (QDI) circuits [11] . Its main characteristic is to target a standard cell approach. Unfortunately, it is uncommon to find in typical libraries (dedicated to synchronous circuit design) basic asynchronous primitives such as C-elements. Consequently, the designer of QDI asynchronous IC, adopting a standard cell approach, must implement the required boolean functions on the basis of AO222 gate [1, 9] . It results in sub optimal physical implementations as illustrated on figure (1) that gives evidence of the power and area savings that can be obtained from the development of a library dedicated to the design of asynchronous circuits. Within this context, we developed TAL_130nm (TIMA Asynchronous Library), a standard cell library dedicated to the design of QDI asynchronous circuits. This paper aims to introduce the methods we used and the choice we made to design TAL. It is organized as follows. In section II, the structural specificities of QDI gates are introduced. This section also describes two sizing criteria, deduced from a first order delay model, allowing reducing area cost while maintaining the throughput. In section IV, we deduce from the first order delay models of both static and ratioed CMOS structures two sizing criteria allowing reducing the area cost of any QDI gate while maintaining its throughput. Finally, section IV reports the performance of the gates designed following our sizing strategy and compare them to gates implemented using basic AO222 gates borrowed from a standard synchronous library.
NB : the meaning of the different notations used throughout the paper is given in table 1.
QDI Element Specificities and Library Sizing Strategy

QDI Element Specificities
Depending on the desired robustness to process, voltage and temperature variations, handshake technology offers a large variety of asynchronous circuit styles and a large number of communication protocols. Our aim is not here to give an exhaustive list of all the possible alternatives, but to introduce the main specificities of the primitives required to design 4-phase QDI circuits. For such circuits, the data transfer through a channel starts by the emission of a request signal encoded into the data, and finishes by the emission of an acknowledge signal. During this time interval, which is a priori unknown, the incoming data must be hold in order to guarantee the quasi-delay-insensitivity property. This implies the intensive use of logical gate including a state holding element (usually a latch) or a feedback loop. As we target a CMOS implementation, it results from the preceding consideration that most of the required primitive are composite or complex positive gates. Indeed they can be decomposed in one or more simple dynamic logic gates and a state holding element. In fig.1 we give possible decompositions of a 3-input Muller gate and a COR222 gate, both widely used to implement basic logic such as "And", "Or", "Xor" in multi-rail design style.
Library Sizing Strategy
Due to their composite structure, different sizing strategies can be applied to the library. The one we adopted is based on the five following design rules: c: balance at first order the amplitudes of the currents flowing through the N and P arrays in order to balance the active and RTZ phases. d: designing at least the drives X0, X1, X2, X4 for each functionality in order to accommodate a large range of loads. (Many gates have been designed in drives 0,1,2,4,8,12) e: design each drive in order to ensure that, independently of the logic function, its output driver has the same current capability that the equivalent inverter. As an example, the last stage of the logic decomposition of the 3-input Muller gate (M3) of drive Xj is sized in order to deliver the same switching current than the inverter of drive Xj. f: minimize the area by designing each cell in order to accommodate weak and important loads in two functional stages. This means that only the two last stages of the COR222 decomposition will be sized in order to accommodate the output load; the preceding stage being designed for a minimum area cost. This is equivalent to targeting implementations with low input capacitance values. Such strategy may allow the most frequent possible use of weak drives without compromising too much the speed performances. g: avoid whenever possible logic decompositions in which the state holding element drives the output node. In figure (1f) , the placement of the output inverter and the
