ABSTRACT Proceeding miniaturization in the VLSI circuits continues to pose challenges to the conventionally used synchronous design style in microprocessors. These include the distribution of clock in the GHz range, robustness to delay variations, reduction in electromagnetic interference, and energy conservation, to name a few. The asynchronous logic has been known for its ability to address the aforementioned challenges by means of the closed-loop handshake protocols, instead of notorious clock signals. Because of these advantages, there have been numerous attempts on building general and special purpose microprocessors during the last three decades. Still, however, the number of asynchronous processors commercially available is scarce, mainly due to an insufficient electronic design and automation tools support, an ambiguous design flow and testing mechanisms for asynchronous logic and, most importantly, absence of a forum to look for relevant works, explaining the design steps and tools for such microprocessors. This paper is intended to bridge this gap by 1) reviewing the design principles of asynchronous logic, including classification, signaling conventions, and pipelining approaches; 2) presenting the complete design flow and available electronic design and automation tools; 3) developing an encyclopedia of various general and special purpose microprocessors proposed by far; and 4) presenting an evaluation of those works in terms of area on the die and performance metrics. This paper will also serve as guidelines for the asynchronous microprocessor design and implementation in all phases from specification to tape-out.
I. INTRODUCTION
While reduction in feature sizes has led the digital circuits to operate at increased clock-rates, the synchronous designs, on the other hand, face certain challenges that are difficult to overcome in the deep submicron era [1] . These include chip wide clock distribution, and susceptibility to delay variations. The former may be addressed by means of a balanced clock tree with a sufficiently low skew, however, the strong clock drivers will still pose a threat to energy requirements [2] . The asynchronous logic, which relies upon closed-loop handshakes for communication between components, naturally
The associate editor coordinating the review of this manuscript and approving it for publication was Songwen Pei.
eliminates the need for a clock, and at the same time provides an inherent ability to adapt to uncertainties and even dynamic changes of timing parameters. Lower power dissipation, reduced electromagnetic emission, higher operating speed, and better modularity are among a few other traits associated with asynchronous logic designs [3] , [4] .
In spite of the advantages that asynchronous logic enjoys over its synchronous counterpart, it never flourished, and failed to catch industries' attention. The primary reason behind this predicament is insufficiently mature electronic design and automation (EDA) tools support [5] . For the same reason, the principles of asynchronous logic, the existing asynchronous systems − be them in the industry or in academics, and their design flow along with the EDA support, are usually misunderstood and more often overlooked. Same is the case with asynchronous microprocessors, which have been developed during the past three decades using various design flow and tools, but neither had they managed to be among the processors of eminence, nor could they define the standard design flow for asynchronous systems.
With proceeding miniaturization, and consequently growing number of functional units on a single chip, however, the asynchronous logic once again is receiving attention of the research community [6] . We believe there is a need to comprehensively present principles of asynchronous logic, its standard design flow and available EDA support, followed by a thorough evaluation of various general and special purpose microprocessors existing in literature. This is the main contribution of the proposed work: while it serves as an encyclopedia of asynchronous principles and microprocessors, it is intended to give direction for specifying, modeling, synthesizing, and implementing all classes of asynchronous circuits and systems, and to present a quantitative evaluation of existing asynchronous microprocessors.
The rest of the manuscript is organized as follows: We begin by presenting principles of the asynchronous logic in Sect. II. Sect. III-A details the design flow and EDA support for asynchronous circuits and systems. In Sect. III-B, we present an overview of the existing processors, and their quantitative evaluation. We conclude the manuscript in Sect. IV.
II. FUNDAMENTALS OF ASYNCHRONOUS LOGIC
In what follows, we briefly review the fundamental principles of asynchronous logic, knowing which is essential for understanding and designing asynchronous circuits and systems.
A. DATA AND CONTROL PATHS
Data path refers to a part of circuit that is responsible to perform operations, such as, arithmetic and logic on data. The control circuit, on the other hand, maintains the operation sequence of data path, as well as controls the timing.
Two asynchronous circuits are connected in such a way that their data paths are directly connected to each other, while their control paths are connected to each other by means of a pair of signals, known as request and acknowledge − together called the control signals. The latter indicate validity and safe reception of data between sender and receiver on the data path respectively. The instances at which the control signals are asserted lead to a distinction between two delay models of asynchronous logic; namely bounded and unbounded. In the former, the control signals are automatically asserted once a presumed delay, which is usually long enough for the corresponding operation on the data path to complete, has elapsed. On the other hand, in unbounded delay models, additional steps need to be taken to know for sure the data validity and their safe reception. Data and control paths are usually separately synthesized to gate level netlists, because each of them requires a different set of methodologies and tools. 
B. HANDSHAKING CONCEPT IN ASYNCHRONOUS DESIGN
A channel is a point to point, unidirectional communication link that connects two asynchronous circuits. Usually there are three signals comprising a channel: request, data and acknowledge; the request signal may be encoded into the data bus in some cases. The sender places some data on the data bus, and indicates their validity to the receiver using a control signal. The receiver, on the other hand, receives the data, consumes them, and indicates its availability for receiving the subsequent data item, using the other control signal. This request-acknowledge activity, to transfer a data item, is termed as a handshake. The communication may be initiated by the sender; in which case the channel is known as a push channel, whereas, in a pull channel, the receiver initiates the communication by asserting the request signal fig. 1 depicts each type of channel with its block diagram and corresponding waveform.
C. CLASSIFICATION OF ASYNCHRONOUS CIRCUITS
Asynchronous circuits can be classified according to their delay models; the usual three classes are discussed next.
1) DELAY INSENSITIVE CIRCUITS
The class that is most robust against variations of process, voltage and temperature (PVT), is called delay insensitive (DI) [7] , since it assumes arbitrary, but finite, wire and gate delays [4] , [8] , [9] . The receiver, in such circuits, is bound to properly acknowledge every transition by the sender, which VOLUME 7, 2019 means the next transition is allowed only when the previous data are correctly accepted and/or consumed. However, the number of asynchronous circuits that may be made DI, is very small [10] . A DI circuit is illustrated in fig. 2 (a): the acknowledge signal is asserted once each of the two receivers has issued its own acknowledgment; indicating its availability for accepting the next one. The black box introduced, is responsible for joining the two acknowledgments (waiting for all the receivers to respond), since delay1 =delay2, and hence, their acknowledgments may arrive at different times as well.
2) QUASI DELAY INSENSITIVE CIRCUITS
DI circuits with isochronic forks are said to fall in quasi delay insensitive (QDI) class. This class of asynchronous circuits compromises the delay insensitivity property, by assuming that in isochronic forks, all the receivers receive the signal at the same time, as presented in fig. 2(b) . That is, the input delay of each receiver is identical, so only one acknowledgment from any of them ensures the completeness to the sender. Isochronic forks, if not carefully implemented, may cause a hazardous effect in the circuits [11] .
3) SPEED INDEPENDENT CIRCUITS
Speed Independent (SI) class assumes arbitrary, but finite, gate delays, and zero wire delays. This class is similar to synchronous style: it assumes that before the req signal is asserted, the data have to be stable at the receiver side; similar to synchronous approach, where the clock edge occurs sufficiently later than the data becoming valid and stable. So to achieve this, in asynchronous environment, there has to be an appropriate delay, by means of a buffer or inverter chain, in the req path. However, these circuits, by doing this, lose their robustness against PVT variations.
D. SIGNALING CONVENTIONS
In asynchronous designs, the local controller, instead of a global clock, governs the data movement on a channel [12] . The control signals follow some predefined pattern for accurate operation, where the latter is specifically known as signaling; this is discussed next.
1) 4-PHASE SIGNALING
To complete one handshake cycle, or to exchange one message, the 4-phase signaling protocol uses four transitions, two by each of the req and ack signals [13] . The waveforms illustrated in fig. 1 are examples of 4-phase bundled data 1 protocol. As may be seen in the waveform, the transition to high level indicates any valid event, while the transition to zero changes the phase that resets the communication − giving this scheme another name, Return-to-Zero (RTZ) protocol.
2) 2-PHASE SIGNALING
In 2-phase signaling, transition of request signal from zero to one, as well as, from one to zero, indicate validity of the data, as illustrated in fig. 3 . Since there is no, unnecessary, resetting phase involved, this type of signaling is also termed as NonReturn to Zero (NRZ). Naturally, this type of signaling will lead to faster circuits, besides being more energy efficient due to fewer number of transitions required per data transfer.
In comparison to 2-phase signaling, the advantages that 4-phase signaling enjoys include increased robustness to delay variations, since the RTZ phase provides sufficient safety margin, and the circuits are relatively simpler to design. For instance, a level controlled latch can be directly driven by using control signals of the 4-phase protocol: one level switches it to opaque, while, the other makes it transparent. The 2-phase signaling, on the other hand, requires some additional logic to make the latch functional as required.
E. DATA REPRESENTATION
The purpose of communication is to transfer meaningful information in the form of data. The predefined suitable representation of data is also called encoding, on which two parties agree. Generally, in asynchronous logic, the data may be encoded in one of the two ways: 1) single rail encoding where one bit of data takes one line, 2) M-of-N encoding in which single bit of data takes multiple lines. Next, we discussed these two encoding schemes.
1) SINGLE RAIL ENCODING
In single rail encoding as mentioned above, each wire carries one bit of data [14] . The control signals use separate rails, and are said to be bundled with the data signals − hence the name bundled data encoding. In bundled data encoding, the control signals may adopt either of the two signaling conventions discussed above. In that case, the suitable prefix, 4-phase or 2-phase (whichever adopted), is placed before name of the encoding scheme. Because of their simplicity, these schemes are widely used, where the cost, in terms of area on the die, is approximately the same as synchronous equivalents [15] .
2) M-OF-N ENCODING
This type of encoding is used within the DI class of asynchronous circuits, where N wires carry log 2 (N ) bits of information (N is in a power of 2), and there is an explicit wire to carry the acknowledgment [12] . The dual rail encoding [16] is a special case of 1-of-N encoding, with N = 2. Each bit is encoded using two rails: true and false. Level 0 is represented by logic '1' on the false rail, while '1' on the true rail is used to represent level 1. A '0' simultaneously on both rails indicates 'no valid data'. The two rails are mutually exclusive, so at a time, only one is allowed to make a transition. The new data validity at the receiver side is detected by transition, since no explicit request signal is available. The completion detection circuit performs this task. An example of dual rail encoding, and completion detection logic for each type of protocol, are illustrated in fig. 4 and 5 respectively.
Note in the completion detection mechanism, it is important to detect the RTZ phase on all lines as well, which cannot be handled by an AND gate. The reason behind this deficiency in AND gates is the fact that a low on only one input will cause a low on the output. Therefore, a component that waits for all the inputs to go low before it could deassert its output, should replace the AND gate. Muller C-element (MC) [17] is one such component, which has been the primitive for asynchronous logic since its inception.
From 1-of-N encoding class, one hot encoding represents n bit data by 2 n lines. It is different from the dual rail codes for n = 2, in that it uses a 4-bit unique code to represent the 2-bit data, unlike the dual rail codes, which would encode each bit using two lines. This difference is presented in Table 1 . Although the area overhead for the two equivalents remains the same, the fewer number of transitions in the 1-of-N encoding makes it more energy efficient, and therefore the preferred method. With 1-of-2 encoding, the 4-phase protocol is known as Null Convention Logic (NCL) [18] , [19] , where the RTZ phase is called a spacer or an empty word, used to separate two code words. The other difference is usage of the majority or threshold gates [20] for completion detection, in comparison to MCs used in dual rail codes.
Level Encoded Dual Rail (LEDR) is another important dual rail encoding scheme [21] . In these codes, the data bit is the first bit in the codeword, followed by a 1-bit phase, which keeps alternating between odd and even for each codeword. Therefore, the two consecutive codewords are always different in their phases, making it possible to distinguish identical data items without the need of having a spacer in between − resulting in more energy efficient codes.
F. ASYNCHRONOUS PIPELINE IMPLEMENTATIONS
Efficient asynchronous circuits are usually built as pipelines, which increase the overall throughput by distributing the task among several function units operating in parallel on different data values. There are several types of asynchronous pipelines, micropipelines [22] , mousetrap [23] , GasP [24] , VOLUME 7, 2019 FIGURE 6. 4-phase bundled data pipeline.
QDI [25] , [26] , asP* [27] , wave [28] , [29] , surfing [30] , and RAMP [31] ; all of them have a common Muller pipeline as their backbone though. The Muller pipeline is a simple arrangement of MCs, such that each of them forms a single stage. The output of each stage (MC) serves two purposes: 1) it becomes the input req to the successor stage, 2) it is sent as the inverted ack to the predecessor stage. The first stage receives its input req from the sender, and generates ack in return. Similarly the last stage generates the output req to the receiver, and receives the ack in return. The Muller pipeline is a mechanism that relays handshakes [4] . The pipeline is said to be empty when all the MCs are initialized to zero. At this point in time, the left-environment (also called sender or producer) can initiate the handshake by asserting req. While this transition ripples through the pipeline to the right-environment (also called receiver or consumer), due to the symmetry, each stage sends the acknowledge to the previous stage. Now in case the producer is faster than the consumer, it may deassert its req which should traverse the entire pipeline up to the last stage and get blocked, waiting for the receiver to consume the token by asserting the ack. Sooner or later, a time may come when all the stages get blocked because of the slow nature of the receiver. A fully filled pipeline has an interesting characteristic, i.e., alternating stages will always store opposite values. Singh and Nowick [23] made use of this feature to build the mousetrap pipeline.
The 4-phase bundled data pipeline with datapath is illustrated in fig. 6 . In a completely filled pipeline, one can observe that only half of the pipeline stages store data, since each pair of successive MCs hold alternating logic. This pipeline configuration is just like a Master-Slave setup in synchronous designs [4] .
The 2-phase bundled data pipeline, also known as micropipelines, was proposed by Sutherland [22] . As may be observed in fig. 7 , the control path is identical to the Muller pipeline, with a slightly different signal interpretation, which makes it follow the 2-phase handshaking.
III. ASYNCHRONOUS PROCESSORS, LANGUAGES AND DESIGN TOOLS A. TOOLS AND LANGUAGES

1) TANGRAM
Tangram [32] , is a tool based on the dedicated programming language (a CSP based VLSI programming language) with [33] , where the compiler contains the handshake circuits translation rule for each tangram program. At the next stage, transparent silicon compile or handshake circuit compiler performs two tasks: component substitution, and layout generation. In the former, the handshake components are implemented into standard cell library, and in the layout generation phase, commercial CAD tools are used. Some asynchronous chips programmed in Tangram are [34] - [38] .
2) CHP: COMMUNICATING HARDWARE PROCESSES
CHP [39] is a programming language for fine grain distributed computation. Usually, a CHP program consists of the parallel configuration of several concurrent processes, where inside each process, the code is mostly sequential. These processes do not share variables; the latter are local to each process, but they may be passed to other processes as communication channel messages. The procedures and functions are also used as local variables. Integer (int), boolean (bool) and symbol are three generic variable types. For structuring data, two mechanisms named array and record are used, where the latter may contain many variables, each having its own type.
The process graph in CHP is made with a set of processes as vertices, and communication channels as edges. Initially, a process is declared and then instantiated. A process may be of two types: a meta-process or a simple process, where the former contains a number of sub-processes, and label meta identifies this type. On the other hand, label chp identifies a simple process.
Synthesizing a QDI system comprises few steps. At first, the system is described as a sequential CHP program, which is decomposed into a fine grain CHP process. This step is known as process decomposition, after which a CHP program is transformed into handshake expansion (HSE). Finally, the HSE code is transformed into a productive rule set (PSR) program. CHPsim locates and detects a deadlock, estimates the performance, and debugs the system, as well as provides syntactic and runtime checks, where the main and interesting function of the CHPsim is co-simulation. Many projects have been synthesized with CHP, including [40] - [45] .
3) BALSA
Balsa [46] , [47] , is a language for describing asynchronous hardware system, as well as, it is an asynchronous circuit synthesis system that generates gate level netlists from Balsa high-level description language. Balsa design flow, shown in fig. 8 , demonstrates the overall working.
Balsa contains a number of tools from which some of the important ones are listed below.
• balsa-c: Balsa language compiler, intermediate language breeze is the output of balsa-c compiler
• balsa-netlist: from a Breeze description it produces an appropriate netlist of the target CAD/technology framework.
• breeze2ps: postscript file of the handshake circuit graph is produced by this tool.
• breeze-cost: circuit area cost estimation tool.
• balsa-md: makefiles generating tool.
• balsa-mgr: for balsa-md, a graphical front-end with project management facilities.
• blasa-make-test: for a Balsa description it automatically generates test harness.
• breeze-sim: at the handshake component level the preferred simulator.
• breeze-sim-control: for the simulation and visualization environment a graphical front-end. Balsa adopts a syntax directed translation method to yield communicating handshake components, where the compilation approach is transparent and similar to Philips Tangram system [32] . The set of ≈45 handshake components are listed in [47] , which are connected by channels, on which the communications take place.
4) ASYNCHRONOUS CIRCUIT COMPILER
Asynchronous Circuit Compiler (ACC) is the first fully automated synthesis tool for asynchronous and delay-insensitive circuits. It is used in Tiempo [48] asynchronous circuit design flow [49] as shown in the fig. 9 . The input of ACC is a description written in Transaction Level Modeling (TLM) using SystemVerilog [50] description language. Such a format gives logical integration of Tiempo clockless technology into verification platforms such as Mentor Graphics Questa TM , Cadence NCsim TM and Synopsys VCS TM . It produces output at gate level netlist in Verilog description language. In addition to standard cell libraries, ACC uses Tiempo asynchronous cells for circuit mapping. The gate-level netlist representation generated using standard back-end and electrical simulation tools can be placed-and-routed and simulated respectively. As ACC is made interoperable and compliant to standard design flows, it can be integrated/used with any tool based on industry standard. As an example, TAM16 [51] IP core is designed by using Tiempo fully asynchronous and delay insensitive technology.
5) PETRIFY
Petrify [52] , [53] is a tool for synthesis of Petri Nets (PN) and asynchronous circuits. It manipulates concurrent specification as well as optimized asynchronous control circuits. Petrify generates bi-similar and simpler PNs or a Signal Transition Graph (STG) from the originally described PN or STG ( fig. 10 ). Furthermore, it transforms a specification using token flow analysis of the initial PN which in turn yields a Transition System (TS). In an initial TS, the same label transitions are counted as one event. The condition required to obtain pure. unique, free and non-redundant PN is that the TS is transition re-labeled and transformed. By using design gate library, Petrify generates asynchronous controller net list while the input-output behavior remains unchanged. By solving complete state problem, it performs state assignment when asynchronous circuits are synthesized, and generates speed independent circuits [54] .
6) OTHER TOOLS
In the literature, there are a number of other similar tools including Teak [55] , Occam [56] , LARD [57] , DESI [58] , VSTGL [59] , Workcraft [60] , VERISYN [61] , Pipefitter [62] , CHAINworks [63] and TiDE [64].
B. PROCESSORS
Processors that are dependent on a global clock are synchronous where the clock regulates processing. The global clock in such processors may become problematic, in particular, when the processing environment is more complex. The main issues faced in a synchronous processor are the clock skew and the clock circuit itself. The later can dissipate an enormous amount of power because it's always running. A alternative choice among the research community is to consider asynchronous designs. In an asynchronous design. each functional unit communicates with other by using a local clock or more technically using handshaking. Such a design choice delivers simplified interfacing and average case performance as compared to the worst case performance in synchronous design. In asynchronous designs, the clock delay is larger than the delay of slowest component. Asynchronous processors are efficient in power dissipation as only the required part of the circuit is alive. In this section, different asynchronous processor designs are explored.
1) CALTECH ASYNCHRONOUS MICROPROCESSOR
Caltech Asynchronous Microprocessor (CAM) [40] is a 16-bit RISC type architecture with 16 general purpose registers, an ALU, four buses, and two adders. The two adders are used for memory addresses calculation and program counter calculation, respectively. The CAM use 4-phase handshaking protocols with dual rail-data transfer. The estimated performance of the CAM processor was approximately 15 MIPS at 7V when realized with 2µm Scalable CMOS version at room temperature and 30MIPS at 12V when a chip is cooled with liquid nitrogen. The performance was estimated to be 26MIPS at 10V@105mA when realized with HP 1.6µm SCMOS. The processor is realized using Harvard architecture and its chip consists of 2000 transistors.
2) FULLY ASYNCHRONOUS MICROPROCESSOR
Fully Asynchronous Microprocessor (FAM) [65] is a 32-bit RISC like architecture with 4-stage pipeline. Its data path includes 32-bit ALU, 32 registers (each 32-bit wide), 32-bit barrel shifter, multiplier and an adder. The 4-stage pipeline uses ALU and register file and includes operations instruction fetch, memory access, instruction decode and instruction execute. It consists of two types of blocks: a computation block and an interconnection block. The computation block includes adder, shifter and register while the interconnection block includes combinational logic, pipeline register and a data latch. The instruction set of FAM microprocessor has 18 instructions, uses 4-phase handshaking protocol with dual rail-data transfer and is based on CMOS technology with approximately 71000 transistors. Its design is based on Differential Cascade Voltage-Switch-Logic (DCVSL) for completion detection of combinational logic with performance measured 300MIPS in 0.5 micron CMOS.
3) NON-SYNCHRONOUS RISC PROCESSOR
Non-Synchronous RISC (NSR) [66] is a 16-bit load/store architecture with 16 general purpose registers and contains 5-state pipeline. The 5-stage pipeline include units for instruction fetch, instruction decode, execute unit, memory interface and register file. In NSR, stalling caused by a slower instruction is covered by adding self-timed FIFO queues between concurrent units. Each block accepts data from other blocks for processing and sends the result by means of FIFO queues. An instruction that does not need a particular pipeline stage is never passed through that stage. For example, if an instruction does not use the memory, it is never passed through the memory interface pipeline stage. The self-timed concurrent blocks in NSR design communicates using 2-phase bundled data protocol. For a prototype NSR, seven Actel Field Programmable Gate Arrays (FPGAs) are used. To test any unit of NSR processor, each unit request is blocked by using a switch to hide the request and acknowledge signals from other units. The best case performance of NSR is estimated to be 1.3MIPS.
4) COUNTERFLOW PIPELINE PROCESSOR ARCHITECTURE
Counterflow Pipeline Processor (CFPP) [67] architecture ( fig. 11 ) is realized using SPARC instruction set and is based on the idea that instruction flows in one direction and its result on other within a pipeline. The CFPP have multi-stage pipeline design in which program counter is at the bottom while the register file is placed at top (of pipeline). An instruction flows up for execution and stalls when the upper pipeline stage cannot accept a new instruction. An instruction may also stall if it reaches execution stage while at the same time the upper stage include a stalled instruction. Such situation may be avoided if there is a gap of arbitrary size in a pipeline which leaves some stages empty (without any instruction).
An instruction include opcode, source and destination bindings as shown in the fig. 11 . Each binding contain three fields register name, validity bit and a value. When an instruction reaches execution stage, the new data value is loaded to a destination binding and marked valid after the execution. Similarly, when an instruction reaches top pipeline stage, the data value from destination binding is loaded to a specified location of a register file. Afterwords, the destination binding flows downward in the pipeline on the result of a subsequent instruction.
There are two bindings in a result pipeline. If a subsequent instruction needs source binding with its register name matching the register name of result binding, it garner the value from register pipeline to instruction pipeline, just like bypassing or forwarding. On the other hand, if the executed instruction destination binding matches with the result binding, then the result binding garner the value from destination binding. With any register that match, the new instruction source binding updates with most recent values. Another situation arises when an instruction yet to execute with result binding match with destination binding, the result in binding is killed. As the result binding is not valid for later instructions, all the rules just described guarantee the correct result binding for instructions. A multi-result binding on different stages of a pipeline at the same is similar to register renaming.
Issues like trap and incorrect branch predictions are resolved by the architecture effectively. Trap caused by any instruction on any stage generates trap-result bound to a result pipeline (not to a destination binding). The instruction that causes a trap is set to invalid that may proceed to next pipeline stage but will have no effect on register file or result pipeline. The trap-result is interpreted by stage responsible for program counter control which changes program counter to a suitable trap handler. The incorrect branch predictions are similarly (to traps) handled while the program counter control starts execution from the proper path.
In CFPP architecture, non-identical stages perform different processing: one stage, for example, performs multiplication while the other performs addition. This architecture may use siding which performs execution of long computation delays instruction. This implies, when multiplication instruction reaches multi-launch, it stalls till all operands required become valid before launch. Multiplication is shifted to siding multiplier and execution result is recovered in later multi-recover stage. Other sidings such as adder and memory sidings are also available in this architecture.
Non-identical stages weakens the performance of architecture: if a store instruction, for example, is dependent on multiplication instruction then the compiler must reschedule independent instructions between them. This way, when the multiplication instruction propagates to eight stage, the store instruction would propagate to fourth stage. When the multiplication stores the result in 10th stage, there will be five empty stages between multi-recover and the memory launch. The result from multiplication enters pipeline and propagates through five stages to reach awaiting store VOLUME 7, 2019 FIGURE 12. AMULET1 organization: reprinted from [68] .
instruction which affects the throughput. It had long stage pipeline which requires an excessive amount of area. This version was not implemented on hardware.
5) AMULET1
AMULET1 [68] is an asynchronous version of ARM processor and is object code compatible with ARM6 (32-bit) processor. It consists of functional units address interface, register bank (with 30 general purpose registers each 32-bit wide), execution unit and data interface ( fig. 12 ). Each functional unit in AMULET1 works concurrently and independently. To avoid control hazards, coloring mechanism is used [69] . In this mechanism, a color bit is used to represent the state of the processor as well as the same color bit is allocated to an instruction fetch at a particular moment. Whenever the instruction and processor color bits mismatches, the instruction is discarded. The processor color bit changes on the termination of the instruction stream. This architecture uses register locking mechanism, 2-phase single rail protocol for communication and bounded delay timing model and operate on fundamental mode of operation.
The AMULET1 processor is fabricated using two CMOS processes where 1µm process at ES2 gives the performance of 20.5K Dhrystone (@5V and 152mW, 77MIPS/W) while 0.7 µm process at GEC Plessey Semiconductor gives the performance of 40K Dhrystone @ 5V [69] . It does not give best performance compare to its synchronous version ARM6 but gives a clear way for asynchronous implementation.
6) TITAC: DESIGN OF A QUASI-DELAY-INSENSITIVE MICROPROCESSOR
TITAC [70] is a non-pipelined asynchronous implementation of 8-bit Von Neumann microprocessor. The processor is organized as a control section and data path section. The control section contains two independent controllers (controller 1 and controller 2) where the first controller is hard wired controlled. The other controller is microprogrammed which controls outside chip storage. Either controller can be selected to control data path section. TITAC microprocessor is based on quasi-delay insensitive timing model and uses 4-phase communication protocol. It is realized using 1µm CMOS and uses ≈22068 transistors with the estimated performance measured 11.2MIPS and 1.8MIPS, respectively, for controller 1 and controller 2.
7) THE GALLIUM ARSENIDE ASYNCHRONOUS MICROPROCESSOR
The Gallium Arsenide (GaAs) Asynchronous Microprocessor [71] with a 16-bit RISC pipeline architecture is the modified implementation of Caltech Asynchronous Microprocessor using GaAs Technology. The processor data path includes program counter, 16 general purpose registers, an ALU and memory unit for load/store operation execution. All data path gates, completion detection circuit and buffers, except NAND gates, are Direct Coupled-FET Logic (DCFL). The performance of the processor measured is 50MIPS/W.
8) FRED ARCHITECTURE
Fred [72] is a self-timed decoupled pipeline computer architecture based on micro-pipelining and roughly based on NSR III-B3. It uses most of the Motorola 88100 instruction set. Fred organization include dispatch unit, register file (32 general purpose registers) and execution unit as shown in the fig. 13 . Dispatch unit is the main control unit that controls program counter, instruction fetch and sends instructions to other functional units. It issues instructions and monitors data hazards after satisfying data dependencies by resolving register destination conflict. Execution unit has five functional units (arithmetic, logic, control, memory and branch unit) where a distributor is responsible for sending an instruction to appropriate unit. The result of each functional unit is written back to register directly or by a register (R1) queue. In Fred architecture, direct result forward to other functional units is not allowed due to complexity.
Many decoupled independent processes of Fred architecture are connected via FIFO queues (dedicated paths) of arbitrary length to process various instructions at a time.
As each pipeline stage passes data by communicating locally with the neighbor stage, no extra control circuitry is used for adding additional pipeline stages. Fred prototype is described in hardware description language VHDL and is fully functional. For performance measurement, different benchmark programs were run through Fred and the average performance measured was 149.67 MIPS.
9) HADES ARCHITECTURE
Hatfield Asynchronous DESign (HADES) [73] , [74] is a superscalar RISC type processor with Harvard architecture. HADES is a step towards the design of an asynchronous superscalar processor. It includes four pipeline stages namely instruction fetch (fetches in groups), instruction decode (twofold operation), execution (independent functional units) and write-back stage. In the write-back stage, two register files integer and boolean are used. The condition generated by integers comparison is stored in the boolean register file for resolving branches. It addresses read-after-write and writeafter-write hazards using register locking mechanism and decoupled operand forwarding. To resolve such hazards, each functional unit in execution stage have separate forwarding register. This architecture is capable of issuing single and multiple instructions: instructions are issued in order but allows their out of order completion. Furthermore, it uses 4-phase protocol for communication. A formal specification language, Communication Sequential Processes (CSP) [75] , is used for designing baseline of HADES processor in which all concurrent processes communicate asynchronously. The specification language CSP and description language VHDL allows the designer to check correctness of the design and simulate them easily.
10) ASYNCHRONOUS PROCESSOR BASED ON PETRI NETS
The processor in [76] is based on Holton's [77] 3-bit simple synchronous processor design where the asynchronous version employed the same instruction set specified in synchronous implementation. It performs the operations load general register (LdGR), load accumulator (LdAcc), arithmetic operation (Arth) and store. At first stage, labeled petri nets are produced (as shown in fig. 14) where places are represented by circles and transitions are represented by bars. An abstract labeled petri net includes two places and two transitions for instruction fetch and execute mode. An instruction and word fetch results program counter (PC) increment while an instruction execute transition produces complex structure. An instruction can be one or two-word wide: on the completion of one word instruction, the processor executes next instruction while on the completion of two-word instruction, the instruction (first) word remains in the instruction register as the second word fetched contains data.
Instruction execution completes as follows: the load instruction is decomposed into decoding of instruction LdAcc and fetching the second word Accdta. Arithmetic instruction execution is completed using ALU and latching result to the accumulator. Store instruction is completed after memory address register write (loading address) and memory write.
After the completion of labeled petri nets, the designer derives the temporal relation between the transitions. The analysis shows the increment of PC is concurrent with all execution transitions. The design was improved (from version 1) by refining using decoupling ALU as arithmetic instructions do not require data from memory. In this refinement, the acknowledge is sent to Memory Address Register (MAR) after decoding. In this improved version (version 2), the ALU VOLUME 7, 2019 FIGURE 15. Pipelined processor model with two latches: reprinted from [76] .
and Accres (latching result to accumulator) are concurrent to MARr and Memr (memory read) that produces reduction in arithmetic operation execution time. In this later version, modules are decoupled further because of low concurrency. Furthermore, instruction register is concurrent with ALU and Accres where instruction decode work concurrent with MARr and Memr. This, however, introduces deadlock and stall signal. With further refinement (version 3), the deadlock was removed by adding new register for storing fetched word and allow MAR to accept request from this transition. The stall signal allows new PC value to MAR only when previous instruction start to decode.
Further improvement was brought in version 3 (of the processor) after analyzing the temporal relation of modules in the processor. Extra latch was introduced for decoupling memory and instruction registers. The instruction register is now concurrent with MARr and Memr (version 4 of the processor). This include 4-stage micro-pipeline with extra feedback as shown in fig. 15 . The performance of the designs was measured using UltraSAN. In version 4, PC-cycle takes 109.0ns and execution of Arithmetic operation takes 100.0ns. At the second stage, the labeled petri nets are translated to asynchronous circuits where the translation method employed was inspired from Patil's work [78] . 
11) Amulet2e: AN ASYNCHRONOUS EMBEDDED CONTROLLER
Amulet2e [79] is an asynchronous embedded controller powered by asynchronous ARM core. Amulet2e has asynchronous ARM core, RAM/cache of 4Kbyte and memory interface to connect with external memory. For performance gain, amulet1 was modified to amulet2 where the main change in execution pipeline is reduction of pipeline stages as shown in fig. 16 . In amulet2, stall due to register locking is resolved with forwarding mechanism using the last result register (LRR) and last loaded value (LLV) techniques to bypass the register read as shown in fig. 17 . To predict branches, amulet2 introduced Branch Target Cache as shown in fig. 18 . Branch prediction is another performance edge compare to amulet1. In amulet1, issues raised due to sequentially pre-fetch instructions from program counter are corrected by execution pipeline.
In amulet2, HALT is introduced for power efficiency where the system resumes full performance on interrupt. The range of chip select lines, address bus and bidirectional data bus in the memory interface of amulet2e makes it more convenient than amulet1. Furthermore, amulet2e uses 4-phase bundled data protocol for communication. When all performance features are turned on, the performance measured is 42MIPS (Dhrystone 2.1 benchmark). On the downside, amulet2e is only used as a research prototype and is not suitable for commercial use.
12) ASYNCHRONOUS MIPS R3000 MICROPROCESSOR
The asynchronous version of MIPS R3000 microprocessor is known as MiniMIPS [41] . The microprocessor MiniMIPS has a 32-bit RISC CPU, memory management unit and two 4Kbyte on-chip caches (instruction cache and data cache). It contains 32 general purpose registers (each 32-bit wide), two special purpose registers for division and multiplication and a program counter. The pipeline structure in MiniMIPS includes fetch loop and execution pipeline. The fetch loop has program counter, fetch and decode unit. The execution pipeline, on the other hand, includes execution, register and write back units. All execution units (e.g., adder and multiplier) are parallel and works concurrently. This means, multiplier's result is directly written to register and is not passed to any other execution unit or write back unit.
The MIPS R3000 microprocessor is a 3-stage pipeline architecture where as MiniMIPS is very carefully pipelined to gain performance. The main design goals are to achieve, without sacrificing low power advantage of the asynchronous design, high throughput, address the architecture issues missed in CAM, precise exception, branch delay slot, branch prediction, register bypassing and caches.
MiniMIPS microprocessor operates on two modes: user and kernel mode. The design uses 4-phase handshaking protocol (dual rail or 1 of N code) and quasi-delay insensitive timing model. The performance of MiniMIPS measured is 180MIPS @ (4W and 3.3V @ 25 • ).
13) TITAC-2
TITAC-2 [80] , an asynchronous implementation of MIPS R2000, is a 32-bit microprocessor based on scalable-delay insensitive model. It has a modified version of instruction set and include multiply/divide, delay slot of branch and privilege instructions. As the instructions encoding of TITAC-2 and MIPS R2000 are different, they are not object-code compatible. They, however, are similar in pipeline stages (both have five pipeline stages), precision exception handling, external interruption, memory protection and chip cache. The pipeline stages include instruction fetch, instruction decode, execution, memory access and write back.
TITAC-2 introduced new timing model based on scalabledelay-insensitive (SDI) model. In short circuit functions, the delay becomes K times larger than estimated delay, where K is the maximum variation ratio. The SDI model is faster than delay-insensitive (DI) or quasi-delay-insensitive. This model is used for subsystems where global interconnection uses DI model. By using Dhrystone V2.1 benchmark, the measured performance of TITAC-2 is 52.3VAXMIPS (@ 2.11W and 3.3V).
14) ASYNMPU
ASYNMPU [81] , fully asynchronous CISC microprocessor, is the first implementation of CISC microprocessor that is pin-to-pin compatible with Intel 8/16-bit microprocessor. The functional units of ASYNMPU include pre-fetch, decode, control, execute unit, three ports (one read and two write ports) and its register bank has 26 registers. The execute unit includes bus interface, arithmetic logic unit, mov unit and a miscellaneous unit for handling microinstructions. As ASYNMPU is pin-to-pin compatible with Intel 8/16-bit microprocessor, the bus interface unit makes it possible to interface the external synchronous system with asynchronous processor. In Von Neuman architecture, the control (read, write data) and pre-fetch (for instruction fetch) unit may access bus interface at same time, result metastable state. In ASYNMPU, the bus interface unit has arbiter [82] , [83] block for avoiding such metastable state.
The design of ASYNMPU addresses the complex feature of CISC microprocessor using asynchronous processing techniques. Among its major features are instruction set compatibility, controller-sequencer and variable instruction length handling. The ASYNMPU uses 2-phase bundled data handshake protocol and its instruction size varies from 1-6 bytes. Its performance is equivalent to an Intel 8/16-bit microprocessor (uses a 33MHz clock) and the average power dissipation calculated is 110mW. Bus interface of ASYNMPU unit in the busy state uses 11mW where as it uses 0.73mW in idle state: a distinctive feature of asynchronous implementation.
15) ECSTAC
Event Controlled Systems Temporally specified Asynchronous CPU (ECSTAC) [84] , [85] , is a fast asynchronous microprocessor based on event controlled system design methodology. It is a linear pipeline Harvard architecture and uses RISC like ISA with 8-bit data path and 24-bit address path. This mismatch results in a performance trade-off. ECSTAC architecture includes program counter, instruction and data cache, instruction decode FIFO, operand fetch, ACS (24-bit adder, comparator, and stack processing), ALU, order unit, register and scoreboard. The instruction decode unit is heavily pipelined to accommodate the data received from instruction cache. After instruction decode stage, the operand fetch stage fetches the operands which requires registers. The output is formed with immediate value (if any) and sent to ACS unit which performs 24-bit address offset addition. It checks the jump condition: if it is true, a signal is sent to all preceding units to invalidate instruction within it due to branch.
The instruction decode unit includes stack pointer that provides address for reading and writing from data memory. Order unit maintains instruction order of ACS and records the unit (ALU or data memory) used by the instruction. Furthermore, the order unit multiplex data bus of ALU and data memory to register bank (8-bit 16 general purpose register and flag register). It returns events from register bank to ALU and data memory. The scoreboard scheme, on the other hand, is used to prevent data hazards. It is based on a transition signaling operating under fundamental mode. Processor gives peak performance of 28 MIPS while fabricated in ES2 0.7µm DLM CMOS process.
16) TinyRISC TR4101 MICROPROCESSOR CORE
ARISC [86] is an asynchronous implementation of TR4101 embedded microprocessor core based on Harvard architecture. Its pipeline structure includes fetch, decode, flush, register read, register write, issue and execution units. The PC register holds address of token (32-bit instruction word) that is used by the fetch unit to provide a token to decode unit. After the instruction is decoded, it is sent to flush unit which checks branch condition. If the branch condition is true, the instruction pipeline is flushed. The issue unit issues instruction to relevant execution unit when the operands are ready and issues new PC value for PC-ALU.
Opaque latch controller is used on the input of each parallel execution unit for faster and power efficient instruction execution. The register locking mechanism ensures the correct register write and read which avoids data hazards. The ARISC microprocessor operates on MIPS-II/MIPS16 modes, where all units, except fetch and decode units, operate independently. The data path is designed and verified using Verilog hardware description language and Synopsys synthesis tool, respectively. The speed independent control logic design is accomplished partly by hand and using Petrify tool. Its design was simulated (can be used as well) with three different configurations: 1) separate instruction and data cache, 2) shared cache, and 3) synchronize bus interface and synchronous shared memory module @83MHz. The 4-phase bundled data protocol and normally opaque latch controller (where needed) are used for communication. High MIPS are achieved by using asynchronous configuration as configuration 1) gives performance of 74MIPS and 635MIPS/W(Stanford benchmark) and 123MIPS(Peak benchmark) with V dd = 3.3V .
17) ASPRO-216
ASPRO [87] is a standard-cell QDI 16-Bit RISC asynchronous microprocessor based on A. Martin's method specifically design for an embedded application. It is a scalar processor where instruction issues in order and completes out of order. The fetch-decode loop includes PC-unit which sends addresses to program memory interface. The program memory interface fetches instructions either from on-chip memory or external memory. The external and program memory is 48K and 16K words each, respectively, and the instruction words are 24-bit wide. The external and program memory together with instruction decoder work in fetch-decode loop. The decoder sends information to data-path loop and acknowledge to PC-unit. At this point, if branch or unconditional branch is taken, the PC-unit sends target address, otherwise, the incremented address is sent to fetch-decode loop.
The data-path loop includes register file (16 general purpose register), bus interface and the processing units. The processing unit includes ALU, branch unit, load/store unit and custom unit (for future enhancement). The ALU has ''Min'' and ''Max'' instructions that are used for image processing, bit reversing used in FFT computation and ''Slt'' for carry and overflow testing. This architecture also has data memories (64Kbytes, byte/word addressable) where 256-word area is reserved for peripherals (accessed with dedicated instructions). ASPRO design is completed using standard cell library of 0.25µm 5 metal layer CMOS technology with automatically generated RAM and gives a performance of 140(peak)MIPS [42] .
18) 80C51 MICROCONTROLLER
The microcontroller 80C51 [36] is an asynchronous implementation of 8-bit CISC type microcontroller for achieving power effectiveness. The 80C51 asynchronous microcontroller is fully bit and timing compatible with synchronous 80C51. Asynchronous 80C51 is based on the latch with latch enable signals as well as demand-driven peripherals. Its design has been described in description language Tangram [88] . Standard cell gate-level netlist is achieved from Tangram description after intermediate handshake circuit level. The handshake circuit uses 4-phase bundled data protocol. The design is processed in 0.5µm 3 metal layer CMOS which contain data RAM of 256bytes and program ROM of 16Kb. Gate level simulation of asynchronous 80C51 results 2.10MIPS(943MIPS/W) when memories are excluded, and worst-case condition is assumed.
19) AMULET3
AMULET3 [89] , successor of AMULET1 and AMULET2e, is a 100 MIPS asynchronous embedded processor based Harvard architecture. It is a viable asynchronous processor for commercial use as it supports 4T version of ARM architecture and 16-bit Thumb compressed instruction set 2 for more detail on Thumb please check [90] . The AMULET3 processor has six pipeline stages that include instruction prefetch, instruction decode, execute, data memory reference, record buffer, and register result write-back stage as shown in fig. 19 . Branch prediction unit in instruction pre-fetch stage supports thumb code. The decode and register read stages include logic, such as ARM and Thumb decode, and mechanisms such as register read and forwarding. It either decodes thumb critical control signals directly with thumb instruction decoder, or it first converts them to ARM equivalent instruction and then uses ARM instruction decoder. Register read and forwarding stage traces operand in the register file and search the reorder buffer if the operand is not available. The forwarding process stalls till the value become valid, where three read ports are available for AMULET3 register file.
Execution stage has adder with carry arbitrary scheme, multiplier (computes 32x32 product in approximately 20ns) and shifter. The program status register is fit logically into the execution stage. Reorder buffer stage stores result from execution pipeline stage and data memory interface. The results in reorder buffer may be orderly written-back to register file as well as used for forwarding purpose. The AMULET3 uses 4-phase communication protocol, gain high performance as compared to its predecessor and operates up to 120MIPS (Dhrystone 2.1).
20) A8051
A8051v1 [91] is a novel asynchronous pipeline architecture for CISC type embedded controller and is compatible with Intel 8051. It proposes optimized instruction execution scheme by skipping the redundancy and bubble states and uses only required stages. The A8051v1 is a multi-looping pipeline architecture and handles variable length instructions (1 to 3 bytes) for CISC type machine. It has 5-stages pipeline which include instruction fetch (pre-decode with branch predictor unit), instruction decode, operands fetch, microinstruction execution and write back unit. The instruction decode unit checks data dependency for the previous instructions. The microcontroller uses 4-phase handshake protocol, dual rail encoding and delay insensitive timing model. It was realized using 0.35µm CMOS technology while the performance measured by the designers was 75.5MIPS with 5-stage pipeline. Without the pipeline, the A8051v1 delivers with performance of 35.8MIPS. 2 Effect on the processor is same as 32-bit ARM instruction. 
21) THE LUTONIUM MICROCONTROLLER
Lutonium [92] is an asynchronous implementation of 8-bit CISC type 8051 microcontroller for low ET 2 where E is average energy per instruction and T is the cycle time. Based on Harvard architecture, the 8051 microcontroller supports 255 variable length instructions, with each instruction varies from one to three bytes. Lutonium architecture, as shown in the fig. 20 , includes fetch/IMem, decode, execute, branch and register units. Fetch/IMem includes fetch uni, program memory (holding code up to maximum 8kB) and switch box. Two bytes can be fetched from program memory and the switch box route the instruction to decode unit. Optimization of the fetch loop gives the average throughput of 1.37bytes/cycle.
The decode unit is decomposed into many processes (control0 and control1) which consumes dynamic energy as per requirement. Control0 decode the first byte of instruction opcode and the next two byes (if any) are decoded by control1. The decoded bytes are sent to an appropriate execution unit. Lutonium stops all switching activities in deep sleep mode and wakes up without any delay from the deep sleep mode. In the deep sleep mode, only counters operate. Lutonium uses 4-phase handshake protocol and Quasi-delay-insensitive timing model. The performance of Lutonium prototype implementation using TSMC SCN018 0.18 µm CMOS process by MOSIS at 1.8V was estimated 200MIPS (1800MIPS/W).
22) MODELLING SAMIPS
SAMIPS [93] , [94] , a synthesizable asynchronous MIPS processor, is based on the MIPS application architecture. The main purpose of the design was to use it as a test case in an integrated formal verification and distributed simulation environment [95] . The pipeline in SAMIPS consist of five stages: instruction fetch, instruction decode, execution, memory, and a write-back stage. In instruction decode stage, the read/write register operation is performed. The instruction decode unit, based on asynchronous design, checks six MSB and five LSB (in R-type only) to generate control signals bundled with data. The processor handles data hazards using forwarding mechanism which is based on history information recorded in DHdetection unit inside the register bank.
The execution unit includes ALU (without multiplication and division operation), a functional unit for branch test, a shifter, color matching mechanism and ForWarding mechanism (FW) unit. For control hazards, SAMIPS uses coloring mechanism that is first used in AMULET1 [69] . In this mechanism, one bit is used to represent the state of the processor as well as instructions at a particular moment. When the instruction and processor color bit mismatch, the instruction is discarded. The processor color bit changes on the termination of instruction stream. The model of SAMIPS is described in Balsa: a CSP based asynchronous hardware description language and synthesis tool with LARD [57] is used for behavioral simulation.
23) SENSOR NETWORK ASYNCHRONOUS PROCESSOR
Sensor Network Asynchronous Processor/Low Energy (SNAP/LE) [96] is an ultra low-power processor for sensor networks. It is a 16-bit data-driven RISC core processor based on ISA of SNAP [97] (MIPS architecture) and optimized for data monitoring in sensor network. It has extremely low power idle state and very fast wake-up response. The target sensor node remains idle most of the time which makes the asynchronous technique as the best choice for processors for that types of nodes. The asynchronous processor design has hardware support for commonly-occurring sensor network operations. The low power consumption of the processor maximizes the network lifetime. The SNAP core includes event queue, instruction fetch, decode, execute units, buses, register file, message FIFO's and memories (two on-chip 4KB memory banks).
The execution units include adder, logic unit, load-store unit (for memory), timer unit (for timer coprocessor interfacing), jump branch unit, a linear-feedback shift register and a shifter. Two types of buses are commonly used: a fast and a slow bus. The execution units in sensor networks are placed on a fast bus. The SNAP architecture is completely event-driven and remains idle until external event hit event queue. After completion of an event, the DONE instruction halts the processor until the next event appears on an event queue. A 4-phase protocol and Quasi-delay insensitive timing model was adopted for asynchronous circuits. The processor shows the performance of 200MIPS @ 1.8V consuming ≈ 218pJ/instruction while using 0.18µm TSMC process.
24) BITSNAP
BitSNAP [98] is a dynamic significance compression for a low-energy sensor network asynchronous processor based on SNAP ISA [97] , [96] . It uses bit-serial data-path with dynamic significance compression for achieving low energy consumption. The processor is proposed as a logical extension of SNAP/LE [96] processor. On a bit-serial data stream, BitSNAP employed dynamic adaptive compression known as length adaptive, making the processor a length adaptive data path processor. For each word, the delimiter bit and bits prior it are only sent instead of 16-bit word which reduces switching activity. The architecture of BitSNAP is similar to SNAP/LE with some modification: dynamic significance compression and parallel bit data path conversion to serial bit data path. All execution units, register file and all data path bus split and merge units are bit-serial units operating on LAD digits. The memories, ShareI (shares input word with decode unit or data path) and fetch unit are bit parallel circuits. The BitSNAP processor uses special hardware for interfacing bit parallel data to bit serial units and interfacing LAD data to bit parallel units. In 0.18µm CMOS process, the expected speed of BitSNAP presented by designer was 6 and 54MIPS while consuming 17pJ/ins at 0.6V and 152pJ/ins at 1.8V respectively.
25) HT80C51
HT80C51 microcontroller [99] , an asynchronous implementation of 80C51, is a commercial product by Handshake Solutions (Philips). It is functionally compatible with the instruction set and peripherals of 80C51 with some unique features. These features include extremely low power, low electromagnetic emission, low supply-current peaks, zero standby power with immediately wake-up, asynchronous and optional synchronous mode. Single process HT80C51 execute an instruction in sequential phases fetch, decode, read, execute and the write phase. The structure of the microcontroller is based on Harvard architecture and the instructions are variable length (one, two or three bytes). The high-level programming language Haste, by Handshake Technology design flow (technology independent), was used for designing microcontroller and its peripherals. The HT80C51 non-pipelined asynchronous microcontroller was realized in 0.14µm CMOS where transistor count was 30820 with performance (worst case) of 8.9 MIPS @ 1.8V, 0.7mW [100] .
26) ASYNCHRONOUS 8051 MICROCONTROLLER CORE
A8051 [101] is an asynchronous implementation of Intel 8051 microcontroller for low voltage and low energy applications (hearing aid). The designers of the microcontroller used a number of techniques for yielding low power dissipation. To minimize system activity, they used two-stage asynchronous pipeline including instruction fetch and decode-execute stages that operate independently. The design they offer is without predictive approach and include indirect RAM access. Using partial decoding algorithm, the most significant nibble of instruction is decoded to identify type of operation and the least significant nibble identifies the addressing mode. As is common in microcontrollers, A8051 is comprised of registers, latches and decoder based memory altogether contribute to larger area. To reduce area of proposed A8051 and enhance the performance, the designer proposed methodology for interfacing asynchronous system synchronous IP memory blocks (RAM and ROM) [102] .
The proposed 8-bit asynchronous microcontroller contains 4K×8 ROM as an instruction memory and 128×8 RAM as a data memory with Harvard architecture. Its design was completed in Balsa as an electronic design automation tool. The microcontroller 8051 was fabricated within the dual core microcontroller system DC8051 [103] , [104] with two modes of operation: synchronous operation is based on Synopsys DW8051 IP core whereas asynchronous mode is based on A8051. The cores share 1kbyte ROM and 128byte RAM as well as 1kbyte external RAM. The DC8051 system was implemented using 130nm CMOS and the measured performance of A8051 using Dhrystone v2.1 benchmark reported as 7.4MIPS consuming 349 pJ/I.
27) VORTEX PROCESSOR
The Vortex [105] processor is based on a superscalar asynchronous processor design. Vortex CPU supports 32-bit integer data path and execute up to nine instruction per cycle. The Vortex architecture prototype is shown in fig. 21 . It includes dispatcher (instruction decoder and control signal generator), a crossbar (input/output router) and functional units. All the parallel functional units communicate through central crossbar, instead of a register file. Each instruction consists of two parts: prefix and body. The prefix of instruction is used by the dispatcher for choosing a destination of the instruction: a specific functional unit or crossbar. The asynchronous low-level circuitry is based on the ''integrated pipelining'' templates [25] . It was fabricated as a part of Testchip2 realized using 0.15 µmG process by TSMC.
28) ARM996HS PROCESSOR
ARM996HS [106] , the first licensable and clockless processor core, is a 32-bit RISC type asynchronous processor core implemented using Harvard architecture. The processor is fully compatible with ARMv5TE (ISA), debug architecture and supports 16-bit Thumb instruction set. The ARM996HS processor has 5-stage integer pipeline that includes fetch, decode, execute, memory and write-back stages. All these stages are connected with pipeline control unit. It has a 32-bit fast multiply-accumulate (MAC) block, divide coprocessor and tightly coupled memory. The memory protection unit and non-maskable interrupts provision are used for specific security enhancement.
Factors such as low electromagnetic emission, ultralow power and high robustness converge are the principals of design that were successfully achieved. The compiled code for ARM9E CPUs family can be run on ARM996HS. The Processor was implemented using handshake technology, Timeless Design Environment (TiDE) design flow, based on Haste high-level design entry language (formerly known as Tangram). The TiDE design flow is a frontend to synchronous EDA tools. The Tiempo handshake interface is used to adapt changes in environmental conditions such as supply current, voltage and temperature. The performance measured under worst condition was 54 DMIPS (1.08V, 125 • C) and with nominal condition (1.2V, 25 • C) was 83 Dhrystone MIPS. These statistics are based on netlist simulation (post layout) by using the Artisan Sage-X library for the 0.13µm TSMC process.
29) TAM16 MICROCONTROLLER
TAM16 [51] is a 16-bit clockless microcontroller IP core by Tiempo. It had complete and power efficient instruction set along with adapted software development kit for ultra-low power application. The software development kit include a linker, assembler, instruction set debugger and simulator. To make its instruction set binary compatible with other customers' microcontroller, the instruction set can be customized easily. Two memory interfaces, 1 UART, 3 cascadable timers, 16-bit Programmable Input/Output (PIO), interrupt controller and boot configurations pins are embedded peripherals in TAM16.
In Tiempo technology, the IP is designed for ultra low noise, ultra low power consumption and ultra-low EMI. These features make Tiempo, the fully asynchronous and DI processor, robust against fault injections. It is a commercial product for ultra-low power embedded electronics chips. TAM16, for example, is used in RFID tags, sensor networks, smart cards, e-metering devices and for low electromagnetic emission chips. The low electromagnetic emission chips are used in medical, aeronautics and automotive industries. TAM16 is available as a place and route silicon proven Verilog-netlist. By using CMOS 130nm technology, the TAM16 is designed and processed as test chip. It shows the performance of 7.1 and 15.5MIPS at 0.7V and 1.2V, respectively. The consumption of core is 33.4 and 49µA/MIPS at 0.7V and 1.2V, respectively.
30) AsynRISC
AsynRISC [107] is an asynchronous pipelined processor with instruction set similar to MIPS R2000. It has five pipeline stages as shown in fig. 22 . The pipeline stages include instruction fetch, instruction decode and register fetch, instruction execution or memory address calculation, memory access and register write back. Control hazards are solved using two one-bit registers instcolor (in instruction fetch stage) and syscolor (in execution stage). As all control transfer takes place in execution stage, in IF stage, every new instruction proceeds to next stages after attaching color bit of the instcolor register. The color bit is later checked by execution stage by matching the color bit with syscolor register. If the color bits match, the instruction is executed normally, otherwise, it is discarded. Data hazards are resolved by adding two extra fields in every general purpose register. A pending bit indicates register is up-to-date or waiting for new contents. Two bits, known as pending instruction index, records the instruction which produces the new contents for a register. At IF stage, the 2-bit instruction index register provides an index to every new instruction. As at most four instruction reside in datapath, two bits are enough for this register. It was designed and verified using Balsa asynchronous hardware description language and Balsa simulation system respectively. The performance was measured by executing a particular program having 500 dynamic instruction taking 21256799 unit 3 execution time.
31) A8051v2
A8051v2 [108] , the second version of asynchronous 8051 [91] , is a low-power implementation of asynchronous 8051 employing adaptive pipeline structure. While there are many dissimilarities in system architecture and instruction execution scheme, the instruction set architecture of the proposed design is fully compatible with Intel 8051 [109] . Among the major changes are inclusion of additional features for multi-cycle instruction. These features are multi-looping control, branch prediction (for unconditional branches) and single threading (in the execution stage). It was realized using adaptive micropipeline for skipping and combining pipeline stages for gaining power efficiency and performance. Stage skipping and combining mechanisms are controlled by adding extra inputs i.e EL N for latch controller and EC N for pipeline stage bundled with the latched data. The decision, whether or not to skip the operation of N th pipeline stage, is taken by an EC N signal. Furthermore, the decision if the N th latch is transparent or not, is determined by an EL N signal. The A8051v2 was simulated with NanoSim tool and mapped into Hynix 0.35 µm CMOS technology with a nominal voltage of 3.3V. A8051v2 @ 3.3V shows the performance of 84.2 MIPS (2316MIPS/W) measured by executing Dhrystone V2.1 benchmark.
32) PA8051
PA8051 [110] is a pipelined asynchronous 8051 soft-core microcontroller implemented in description language Balsa. Its design consists of 5-stage pipeline as shown in fig. 23 . The pipeline stages include instruction fetch, instruction decode, operand fetch, execution stage and write back stages. The design also include a memory unit which is not part of the pipeline. The instruction fetch stage contains ROM interface, fetch controller and two buffers (as instruction cache to the program memory).
The memory unit provides the memory interface to RAM − READ − ARBITOR where the arbiter [111] is used to read/write data and read instructions from/to fetch and instruction decode units, respectively. The memory exchange with write back unit is likewise arbitrated by the MEM − INTERFACE as shown in the fig. 23 . The PA8051 microcontorller uses 4-phase bundled data communication protocol to reduce the area cost. The design of the microcontroller is described in the CSP based asynchronous HDL language Balsa and synthesized into Xilinx netlist. The synthesis was completed with Xilinx ISE for the target device Spartan-IIE 300 ft256 FPGA.
33) NCTUAC18
NCTUAC18 [112] is a quasi-delay-insensitive microprocessor core implementation for microcontrollers. It is an 8-bit asynchronous microprocessor core with an instruction set of PIC18. The 4-stage pipeline of NCTUAC18 include instruction fetch, instruction decode, operand fetch and execution/write back stages. The instruction decode stage includes instruction decode block, branch control block, stall control and NPC control. The instruction decode block generates the control signal for the whole processor and checks whether or not the instruction is a conditional branch. If it is a conditional FIGURE 23. PA8051 architecture overview: reprinted from [110] .
branch, the instruction decode block requests the branch control block to take over, otherwise, it requests NPC (for next PC value). On conditional branch instruction, the stall control generates request signal to NPC for generating the same PC value to retrieve the instruction. The NPC control is responsible for the correct generation of the PC value. In NCTUAC18, the execution and write back stages are combined in one stage.
The design uses Muller pipeline with 4-phase protocol, dual-rail encoding, quasi-delay insensitive timing model. The proposed design was verified by using ModelSim 6.0 and its gate level design was synthesized using Altera Quartus II software for the target FPGA Altera Cyclone EP1C20F400C8. The maximum path delay in the instruction decode stage, the critical stage of the design, was ≈455ns. The designer admitted that the design deals with branch instruction inefficiently and the quasi-delay insensitive model makes the circuit design difficult. Later on, the NCTUAC18S [113] was introduced with new stall mechanism which handled branch instructions effectively. Furthermore, an acknowledge wire was added to the instruction decode and write back stages. The wire is used to generate an acknowledgement, by the write back stage, on completion of previous instruction. With this modification, the branch instruction is stalled in the instruction decode stage until the acknowledgement is received. The NCTUAC18S was implemented with dual-rail Muller pipeline and 5-stage pipeline with separate execution and write back stages. In the modified design, it is possible to write and read data at the same time. The design was gate-level modeled in hardware description language Verilog, verified with ModelSim 6.0 and synthesized using design compiler with TSMC 0.13µm process library.
34) DRAP
The Dynamically Reconfigurable Asynchronous Processor (DRAP) [114] , [115] , is a processor based on novel clocked architecture called RICA [116] . The design main goal was to make an architecture that fulfill the demand of VOLUME 7, 2019 high-throughput mobile applications for energy efficiency and programmability using high-level languages. The DRAP processor consists of a heterogeneous array of course-grain asynchronous cells that are implemented using a reconfigurable data-path architecture. An abstract and comprehensive architecture view is presented in [114] and [117] . The design is based on operational cells where each interconnection of operational cells performs limited operations such as logic operation, addition and multiplication. The interconnection design for sample array is based on island-style structure as set-up in standard FPGA's [118] . Configurable routing switches are assembled around the operation cells to allow each cell to interface with its nearest four neighbors. Assembling routing switches, in addition, assist handshake signals and execute conditional acknowledge synchronization using technique presented in [119] . The sample array in DRAP contains 18-bit 400 asynchronous operational cells as listed in Table 2 . These cells are interconnected using switches that are based on multiplexer.
Different blocks of instructions are executed by changing the operational cells and interconnects configuration similar to the architecture of CPU. For general application, an adequate mixture of the cells is selected manually while for a specified application other specific cells can be selected. Integrated circuits and the interconnect switches are controlled by the configuration bits stored in the program memory. A total of 9260 configuration bits are required for the reconfigurable core with the selected type of operational cells and interconnects. The program and data memories are interconnected to each other by using special cells of the core. The 4-phase single rail handshake protocol was adopted for the design of operational cells. A network of programmable switches plays a role of interconnection for data-path creation.
Using a UMC 0.13µm technology, the sample array was realized and compared with the architectures Custom RICA 400 [116] (0.13µm), ASIC (0.13µm), ARM7TDMI-S [120] (0.13µm) and TIC64x 8-way VLIW [121] . The algorithms bilinear demosaicing [122] , 8K-point radix-2 1-D FFT [123] and 2-D DCT [124] are the benchmarks for the evaluation of the design. For the same throughput, the power consumption of each design was calculated for each benchmark. The power and area rating of the Custom RICA 400, sample DRAP and ASIC design were originated using PrimePower (from Synopsys) post-layout simulation. The ratings for ARM7TDMI-S core and TIC64x are provided in [120] and [125] , respectively. All these ratings are measured @1.2V where the energy ratings are measured only for data-path without a memory. Comparison results are listed in Table 3 . 
35) ASYNCHRONOUS NEURAL SIGNAL PROCESSOR
Asynchronous neural signal processor [126] is a 0.25V 460nW processor with inherent leakage suppression design for spike-sorting function. The spike sorting function was completed in three steps in this processor: spike detection, alignment and feature extraction. The algorithm employed for spike sorting in this design exhibits best suitable power-density characteristic for wireless neural signal processing in real-time [127] . The processor receives 8-bit digital data from a neural signal acquisition front-end running at 20kHz. The synchronous-asynchronous interface converts the synchronous input into 4-phase dual-rail data. All modules communicate using 4-phase dual-rail handshaking protocol. The asynchronous neural signal processor block diagram is represented in fig. 24 . Both versions of the processor, synchronous and asynchronous, were realized in a 65nm CMOS for performance comparison. The asynchronous version prototype shows the 2.3x reduction in power.
36) uaMIPS
The Micro-Watt Asynchronous MIPS (uaMIPS) [128] , a sub-threshold ultra-low power processor, is an asynchronous implementation of 8-bit 5-stage conventional synchronous MIPS processor. Designed for a benchmark purpose, the instruction and data memories are based on flip-flop to simplify its design. Using a pipeline oriented de-synchronization tool [129] , the synchronous version was converted to 4-phase bundled data asynchronous version. The unavailable asynchronous elements in the standard cell library were manually inserted using different techniques. The asynchronous design flow is shown in fig. 25 . QuasiDelay-Insensitive (QDI) implementation was created in System Verilog CSP and proteus backend flow [130] . The QDI is not much attractive approach towards ultra-low-power design because of its performance/power ratio. The proposed 
FIGURE 26.
Comparison between uaMIPS and other ultra-low-power processor. Reprinted from [128] .
uaMIPS asynchronous (bundled data) processor in 28nm HKMG, HVT(V T =0.6V) shows better power efficiency as shown in fig. 26 .
37) ANSYNCHRONOUS MSP430
Asynchronous MSP430 [131] design is a low power and relative timing-based asynchronous MSP430 microprocessor. It is asynchronous implementation of openMSP430 [132] 16-bit processor with RISC type ISA. The design of MSP430 had two directly connected finite state machines: decode and execute. In an asynchronous implementation, the data-path is nearly identical to its parent design. A new conjunctive stateful communication method is employed between the asynchronous finite state machines.
The MSP430 microprocessor uses 4-phase bundled data protocol with relative timing methodology as described in [131] . Both designs, asynchronous and synchronous, are designed in the same computer-aided design (CAD) tool and synthesized using the same IBM 65nm 10SF node with same EDA tools and scripts. A comparison shows that the synchronous design consumes 5% more area than asynchronous design. The asynchronous design is on the average 33% slower, consumes less than one-tenth the power and consumes one-seventh the energy per operation as compared to the synchronous. These statistics are concluded after executing different benchmark programs. Furthermore, asynchronous implementation of openMSP430 shows an improvement in power dissipation.
C. DISCUSSION
Starting with Martin's [40] , we have investigated a number of asynchronous microprocessors on abstract level, and compiled their summaries into one document. During this work, we have observed that most of the designers implemented an equivalent asynchronous version of one of the available synchronous benchmark processors, such as MIPS and ARM etc, and they had adopted various specification methods, and tools. Following are some observations that we have made; this may be considered as conclusion of conducting this work.
1) Most of the proposed asynchronous microprocessors are pipelined architectures. 2) Specifically talking of the pipelined processors, most of the designers used a different number of pipeline stages to resolve data and control hazards. Some proposed their novel schemes by claiming that the synchronous methods were not directly applicable to asynchronous logic − mainly due to its distributed control nature. 3) From Table 4 , one can observe that pipeline stages, technology, and voltage directly affect the performance. 4) AMULET3 [89] , Lutonium [92] and SNAP/LE [96] showed better performance and power ratio, in comparison to others. This is evident in Table 4 .
IV. CONCLUSION
We have elaborated on asynchronous logic design principles, along with their available electronic design and automation tools support for specifying, modeling, synthesizing, and implementing asynchronous circuits and systems. The main objective of the work, beside collecting most of the contributions towards designing asynchronous microprocessors, is defining the asynchronous design flow and summarizing the available tools, which, to the best of our knowledge, have been misunderstood or mostly overlooked. We have presented an entire encyclopedia of general, as well as, special purpose asynchronous microprocessors ever developed, irrespective of their classification, signaling mechanisms, architectures, and process. We have presented a thorough evaluation of those processors in terms of performance and area utilization. 
