Asynchronous programmable logic for reconfigurable network on chip architectures by Yu, J
  
Asynchronous Programmable Logic for 
Reconfigurable Network on Chip Architectures 
 
 A thesis submitted in fulfilment of the requirements for the degree of Doctor of Philosophy 
 
 
Jing Yu 
B.Sc. (Hons) Computer Science and Technology, Hong Kong Baptist University, China 
M. Eng. Electronic Engineering, RMIT University, Australia 
 
 
 
School of Engineering 
College of Science Engineering and Health 
RMIT University 
 
October 2019 
  
 
 
Declaration 
 
I certify that except where due acknowledgement has been made, the work is that of the author alone; 
the work has not been submitted previously, in whole or in part, to qualify for any other academic 
award; the content of the thesis is the result of work which has been carried out since the official 
commencement date of the approved research program; any editorial work, paid or unpaid, carried out 
by a third party is acknowledged; and, ethics procedures and guidelines have been followed.  
Jing Yu 
October 09, 2019 
  
P a g e  | III 
 
 
 
 
 
 
 
 
 
“Cogito ergo sum” 
René Descartes 
  
P a g e  | IV 
 
Abstract 
Since their invention, reconfigurable systems have evolved from a simple logic replacement function 
to an important technology that enables hardware to be flexibly tuned, reshaped and altered at will to 
optimally suit the design purpose. Reconfigurable architectures such as field programmable gate 
arrays (FPGA) have become popular due to their low non-recurring engineering costs and faster time-
to-market and are now found in diverse applications from large server farms to embedded and 
portable systems. FPGAs have more recently emerged as reconfigurable co-processor platforms 
tightly coupled to microprocessors within integrated system on chip architectures. However, it is 
becoming clear that issues such as extreme process variability in deep sub-micron transistors will 
impact the ability to achieve timing closure in high-performance synchronous FPGAs in the same way 
as it affects conventional ASIC architectures. This is particularly true for low-voltage, low power 
embedded and portable devices. 
This has motivated renewed research into asynchronous approaches, which eliminate the distributed 
clock and replace it with a handshake scheme that automatically adapts to changes in circuit 
behaviour due to the combined effects of process variability, supply voltage and temperature. As one 
of a number of alternative asynchronous approaches, Null Convention Logic (NCL) is a symbolically 
complete, quasi-delay insensitive logic system that is inherently self-determined, locally autonomous 
and self-synchronising. Thus, NCL circuit design is largely insensitive to the delay of each individual 
logic element and can be set up to be “correct-by-construction”. In contrast to conventional Boolean 
gates, NCL gates are designed as majority (threshold) logic with state-holding (hysteresis) behaviour. 
However, NCL circuits cannot be easily mapped onto conventional FPGA systems which contain no 
appropriate delay-insensitive functions. Therefore, this thesis proposes and analyses a novel NCL-
based reconfigurable logic block that is intended to form one component of an asynchronous 
reconfigurable system on a chip. 
A dual-rail NCL reconfigurable logic block (NRLB) has been designed and simulated using a 
commercial 28nm FDSOI CMOS process. The Boolean logic equations representing the 27 
P a g e  | V 
 
fundamental dual-rail NCL gates were first decomposed and common terms identified. A look-up 
table was then created that allowed these logic terms to be re-combined under the control of a 
configuration memory (shift register) so that the equations for each of fundamental dual-rail NCL 
gates could be created as required. The table is “fracturable” in that it can also be partitioned and set 
up as a pair of 2-input NCL registers. While the basic area and latency of the dual-rail 2-output NRLB 
are comparable with previous single-rail asynchronous reconfigurable systems, it is demonstrated that 
the mapping process can be more productive, and it results in a gate resource usage that is much less 
than comparable asynchronous architectures. Furthermore, the NRLB has more flexibility, and it is 
more suitable to perform fundamental NCL threshold gate functions. The cell also exhibits a static 
power consumption figure similar to that of a commercial Intel Stratix-V FPGA device. 
A customised CAD flow is introduced to provide a specific CAD toolset for this new NCL-based 
reconfigurable architecture. The flow is based on the open-source Verilog-to-Routing (VTR) CAD 
toolset, with its architecture files tuned to match the 28nm Stratix V FPGA characteristics, augmented 
with a simple Verilog to dual-rail NCL conversion process. The VTR CAD flow can produce various 
statistics such as area and time analysis to support comparison between the conventional FPGA 
system and NCL-based reconfigurable system.  
Due to the size expansion from single-rail to dual-rail logic operation and the sub-optimal mapping 
optimisation in VTR, the average size of an asynchronous benchmark mapped onto the NCL-based 
reconfigurable system is typically in the order of nine times larger than the equivalent synchronous 
design mapped onto conventional FPGA devices. However, using a set of standard benchmarks, the 
NCL-based reconfigurable system shows an improvement in input to output latency of 51% on 
average compared to the synchronous case.  
Finally, an existing Virtual Channel Network-on-Chip (NoC) Router was implemented in both a 
synchronous (FPGA) style and an asynchronous style using the CAD flow, and its performance was 
analysed and compared. This router system is intended to provide an improved interconnection 
solution to current FPGA systems, and therefore overall latency is an important consideration. The 
P a g e  | VI 
 
specific NoC architecture was selected as its five sub-modules have different circuit sizes and 
algorithmic complexities, and it is, therefore, able to demonstrate a range of area and latency 
behaviours. By tuning the various parameters controlling their size and depth on three of the sub-
modules (input module, VC allocator and switch allocator), a range of sub-modules of varying sizes 
and complexity were derived. A best-case improvement of 20%~30% lower latency was found for the 
NRLB compared to the synchronous case, with the percentage improvement increasing with the 
module complexity. The final two modules (output module and crossbar) exhibit small gate delays 
compared to the others as well as a small number of pipeline stages. As a result, they show a similar 
or even worse latency performance on the NRLB architecture. In general, all five sub-modules show 
the expected area ratios—from 8x~12x larger than the synchronous case. Finally, a complete NoC 
router was implemented on the NCL-based reconfigurable system as an example of a medium to large 
scale hardware design. In this case, a 50% latency improvement was observed, similar to the previous 
benchmark tests. These experiments indicate that the latency benefit of this asynchronous 
reconfigurable approach when compared to synchronous FPGA systems, increases with the size of the 
designs and the complexity of the algorithms.  
  
P a g e  | VII 
 
Acknowledgement 
Undertaking PhD has been a genuinely adventurous and fruitful journey. I have learned not only the 
methodology of research but also how to keep my motivation on this personal journey. I cannot get 
this far without all the people I have encountered in these few years, and they have been a part of my 
motivation. 
My supervisor, A/Prof. Paul Beckett is the most influential person who has encouraged me and guided 
me to solve all the problems in my journey. I can still remember what he has said, “I am not always 
right, you can argue it with your thoughts.” As researchers, we always doubt the world, and we keep 
finding the truth, which is our original intention. Our weekly “brainstorm” meeting brought me back 
on the track when I got lost. 
I would also like to thank my associate supervisor Dr Glenn Matthews. He provided me with all the 
resources I need to finish my research and gave me many pieces of advice when I worked as a tutor. I 
am grateful for the time management advice from Dr Samuel Ippolito as he has taught me how to 
balance research and teaching. 
I was not the only person who was interested in asynchronous designs and NCL. The people who 
were in the same research group helped me enormously. I am thankful for Matthew Kim and Dr 
Prashant Prabhakar Dabholkar’s NCL knowledge and help in setting up the UNCLE tool, Andrew 
Przybylski for his work on VTR and Yosys, Nguyen Le Huy for the Cadence environment setup, 
Kashfia Haque, Renuka Sovani, Duc Nguyen, Long Tran and Mingchen Gu for their companionship. 
A special thanks to my university for providing all the supporting resources; the library and the 
School of Engineering office are also very supportive and helpful. 
Finally, a big thank you to my family, especially my wife Ashley, she was there from the start of my 
PhD journey and has always supported me throughout to the end. She was the one who has motived 
me and encouraged me never to give up.  
P a g e  | VIII 
 
Contents 
 
Abstract ................................................................................................................................................. IV 
Acknowledgement .............................................................................................................................. VII 
List of Figures ....................................................................................................................................... XI 
List of Tables ..................................................................................................................................... XIV 
List of Acronyms ................................................................................................................................ XV 
 Introduction ..................................................................................................................... 1 
1.1 The Field Programmable Gate Array ...................................................................................... 2 
1.2 Asynchronous Logic ............................................................................................................... 4 
1.3 Network-on-Chip .................................................................................................................... 6 
1.4 Research Approach & Methodology ....................................................................................... 7 
1.5 Contributions and Outcomes ................................................................................................. 11 
1.6 Publications ........................................................................................................................... 12 
1.7 Outline................................................................................................................................... 13 
 Literature Survey........................................................................................................... 15 
2.1 Asynchronous Logic Systems ............................................................................................... 16 
2.1.1 Null Convention Logic .................................................................................................. 16 
2.1.2 NCL Throughput Behaviour ......................................................................................... 17 
2.1.3 Unified NCL Environment and Data-Driven Register .................................................. 19 
2.2 Reconfigurable Asynchronous Design .................................................................................. 24 
2.2.1 Asynchronous Designs on FPGA.................................................................................. 25 
2.2.2 Asynchronous Reconfigurable Systems ........................................................................ 27 
2.2.3 FPGA CAD Flows ........................................................................................................ 31 
P a g e  | IX 
 
2.3 Network on Chip on Reconfigurable Systems ...................................................................... 32 
2.3.1 Network on Chip Router ............................................................................................... 34 
2.3.2 Hard vs. Soft Router on FPGA ..................................................................................... 36 
2.3.3 Asynchronous Network on Chip ................................................................................... 37 
2.4 Summary ............................................................................................................................... 40 
 Dual-Rail NCL Reconfigurable Logic Block ............................................................... 41 
3.1 NCL Look-Up-Table ............................................................................................................ 42 
3.2 Programming Bits and Shift Register ................................................................................... 54 
3.3 NCL Reconfigurable Logic Block Layout Area ................................................................... 56 
3.4 NCL Reconfigurable Logic Block Delay and Power Analysis ............................................. 58 
3.5 NCL Full Adder with N-output Look-Up-Table ................................................................... 65 
3.6 NCL Register and Multiplier ................................................................................................ 71 
3.7 NCL Threshold Gate Verilog-AMS modules ....................................................................... 76 
3.8 Summary ............................................................................................................................... 77 
 NCL CAD Flow ............................................................................................................ 81 
4.1 Experimental CAD Flow ...................................................................................................... 81 
4.2 Synchronous VTR CAD Flow .............................................................................................. 82 
4.3 Asynchronous VTR CAD Flow ............................................................................................ 86 
4.4 Synchronous vs. Asynchronous Benchmark Comparison .................................................... 87 
4.5 Summary ............................................................................................................................... 93 
 Network on Chip Router ............................................................................................... 95 
5.1 Experimental Setup ............................................................................................................... 95 
5.2 Input Module ......................................................................................................................... 97 
5.3 Output Module .................................................................................................................... 102 
P a g e  | X 
 
5.4 Crossbar Module ................................................................................................................. 107 
5.5 Virtual Channel Allocator ................................................................................................... 110 
5.6 Switch Allocator ................................................................................................................. 113 
5.7 Summary ............................................................................................................................. 116 
 Conclusions ................................................................................................................. 119 
6.1 Future Work ........................................................................................................................ 123 
Appendix A ......................................................................................................................................... 125 
Appendix B ......................................................................................................................................... 126 
References ........................................................................................................................................... 127 
 
  
P a g e  | XI 
 
List of Figures 
 
Figure 1 The Asynchronous Register in the NCL Circuit [24] ......................................................... 5 
Figure 2 Research Flow ................................................................................................................... 10 
Figure 3 NCL THmn Threshold Gate ............................................................................................. 16 
Figure 4 Two Different Completion Methods in NCL Handshaking Behaviour [58] .................... 20 
Figure 5 UNCLE Synthesis Flow .................................................................................................... 21 
Figure 6 Reset-to-Null Half-latch Structure [38] ............................................................................ 22 
Figure 7 Data-driven Half-latch ...................................................................................................... 23 
Figure 8 Three Half-latch Structure/ Data-driven Register [38] ..................................................... 24 
Figure 9 Processing Element Circuit [67] ....................................................................................... 25 
Figure 10 NCL-based 16-address LUT (portion) [76] ...................................................................... 28 
Figure 11 NCL 3-Input Threshold Gate ............................................................................................ 29 
Figure 12 TH23 Gate Realised on an 8×4 Programmable Gate Macro Block [81]........................... 30 
Figure 13 Typical VTR CAD Flow [36] ........................................................................................... 32 
Figure 14 Router Block Diagram [28] ............................................................................................... 34 
Figure 15 Power Breakdown for Intel Teraflop Research Chip [92] ................................................ 37 
Figure 16 NoC 2D Mesh Architecture [89] ....................................................................................... 38 
Figure 17 Partial NCL Gate Circuit ................................................................................................... 42 
Figure 18 Simplified LUT Organisation ........................................................................................... 43 
Figure 19 Example NCL Full Adder Circuit ..................................................................................... 45 
Figure 20 Dual-rail LUT Layout [100] ............................................................................................. 47 
Figure 21 NCL Look-Up-Table Schematic ....................................................................................... 48 
Figure 22 PMOS Pull-up Network “Go-to-Null” .............................................................................. 49 
Figure 23 Spectre simulation of pull-up network (TH34w2) ............................................................ 50 
Figure 24 NMOS Pull-down Network “Go-to-Data” ........................................................................ 51 
P a g e  | XII 
 
Figure 25 Spectre simulation of pull-down network (TH34w2) ....................................................... 52 
Figure 26 NCL LUT Semi-static Output ........................................................................................... 53 
Figure 27 Complete Detection Signal Generation ............................................................................. 54 
Figure 28 Shift Register Configuration Memory (portion) and Master-Slave Flip-Flop .................. 55 
Figure 29 TH34w2 Programming Bits. ............................................................................................. 56 
Figure 30 Lookup Table and its Shift Register Layout ..................................................................... 57 
Figure 31 Minimum Width Transistor Area [103] ............................................................................ 58 
Figure 32 Spectre Simulation of NRLB (TH34w2) .......................................................................... 60 
Figure 33 Reconfigurable NCL LE with extra embedded registration [51] ...................................... 62 
Figure 34 Optimised Dual-rail NCL Full Adder ............................................................................... 65 
Figure 35 N-output TH34w2 Schematic ........................................................................................... 69 
Figure 36 NCL Full Adder with 8x2 LUT Simulation Waveform .................................................... 70 
Figure 37 4-Bit Dual-rail Full-word Register .................................................................................... 71 
Figure 38 Bit-wise NCL Register ...................................................................................................... 72 
Figure 39 Non-pipelined 4x4 Multiplier Schematic (8x8 NRLB)..................................................... 73 
Figure 40 7-Stage 4×4 Multiplier with Bit-wise Completion ............................................................ 74 
Figure 41 TH34w2 Netlist in UNCLE Library ................................................................................. 76 
Figure 42 TH34w2 Analog Description in VAMS Module Block.................................................... 77 
Figure 43 NCL Full Adder Simulation with Vams Modules ............................................................ 78 
Figure 44 Stratix V Devices ALM High Level Block Diagram [105] .............................................. 82 
Figure 45 Customised VTR CAD Flow (Synchronous and Asynchronous) ..................................... 86 
Figure 46 Area of Input Modules with Various Flit Data Width ...................................................... 97 
Figure 47 Area of Input Modules with Various Buffer Depth .......................................................... 98 
Figure 48 Area of Input Modules with Various Numbers of Ports ................................................... 98 
Figure 49 Area of Input Modules with Various Numbers of Virtual Channels ................................ 99 
Figure 50 Latency of Input Modules with Various Flit Data Width ............................................... 100 
Figure 51 Latency of Input Modules with Various Buffer Depth ................................................... 101 
Figure 52 Latency of Input Modules with Various Numbers of Ports ............................................ 101 
P a g e  | XIII 
 
Figure 53 Latency of Input Modules with Various Numbers of Virtual Channels ......................... 102 
Figure 54 Area of Output Modules with Various Flit Data Width .................................................. 103 
Figure 55 Area of Output Modules with Various Numbers of Ports ............................................... 104 
Figure 56 Area of Output Modules with Various Numbers of Virtual Channels ............................ 104 
Figure 57 Latency of Output Modules with Various Flit Data Width ............................................. 105 
Figure 58 Latency of Output Modules with Various Numbers of Ports .......................................... 106 
Figure 59 Latency of Output Modules with Various Numbers of Virtual Channels ....................... 106 
Figure 60 Area of Crossbars with Various Flit Data Width ............................................................ 107 
Figure 61 Area of Crossbars with Various Numbers of Ports ......................................................... 108 
Figure 62 Latency of Crossbars with Various Flit Data Width ....................................................... 109 
Figure 63 Latency of Crossbars with Various Numbers of Ports .................................................... 109 
Figure 64 Area of VC Allocators with Various Numbers of Ports.................................................. 110 
Figure 65 Area of VC Allocators with Various Number of Virtual Channels ................................ 111 
Figure 66 Latency of VC Allocators with Various Number of Ports .............................................. 112 
Figure 67 Latency of VC Allocators with Various Number of Virtual Channels ........................... 112 
Figure 68 Area of Switch Allocators with Various Number of Ports.............................................. 113 
Figure 69 Area of Switch Allocators with Various Number of Virtual Channels ........................... 114 
Figure 70 Latency of Switch Allocators with Various Number of Ports ........................................ 115 
Figure 71 Latency of Switch Allocators with Various Number of Virtual Channels ..................... 115 
P a g e  | XIV 
 
List of Tables 
 
Table 1 Equivalent Boolean Functions of a 4-Variable NCL Macros ........................................... 18 
Table 2 Single-Rail Boolean Gates to Dual-Rail NCL Threshold Gates Conversion .................... 22 
Table 3 16-Master/16-Slave system: Performance Results [27] .................................................... 33 
Table 4 4-Master/16-Slave burst- and width-adaptation system: Performance Results [27] ......... 33 
Table 5 Comparison of Different FPGA-Based NoCs by Stratix III [29] ...................................... 36 
Table 6 Dual Rail 4-Variable NCL Macros ................................................................................... 46 
Table 7 Propagation Delays for Reconfigurable Block vs. Input Transitions (ps) ........................ 61 
Table 8 Scaled Propagation Delay Comparison ............................................................................. 63 
Table 9 Average Propagation Delay Comparison based on the Number of Active Input Ports .... 64 
Table 10 Static Power Consumption on Input Transition on TH34w2 ............................................ 64 
Table 11 Transistors Numbers in N-output LUT Architecture ........................................................ 66 
Table 12 Time of Programming Phase on Different Numbers of Output Ports ............................... 66 
Table 13 Propagation Delay based on Input to Output Transition of Bit-wise NCL Register ......... 72 
Table 14 Unsigned 4×4 Bits NCL Multipliers mapped onto Different Reconfigurable Systems .... 75 
Table 15 VTR Heterogeneous Benchmarks CPD Ratio (%) Chart .................................................. 84 
Table 16 I/O Pads Parameters .......................................................................................................... 85 
Table 17 Interconnection Parameters ............................................................................................... 85 
Table 18 Configurable Logic Block Parameters .............................................................................. 85 
Table 19 NCL Reconfigurable Block Architecture Description File ............................................... 87 
Table 20 Benchmark Area Comparisons – Synchronous vs. Asynchronous ................................... 88 
Table 21 Benchmark Resource Usage ............................................................................................. 90 
Table 22 Benchmarks Latency ......................................................................................................... 92 
Table 23 Baseline and Range of Router Parameters ........................................................................ 96 
 
 
P a g e  | XV 
 
List of Acronyms 
Acronym Explanation 
ABP Adaptive Backpressure  
ACF Autocorrelation function 
ALM Adaptive Logic Module 
ASIC Application-Specific Integrated Circuit 
ASSP Application Specific Standard Product  
CAD Computer-aided design 
CLB Configurable logic block 
CPD Critical path delay  
CPLD Complex Programmable Logic Devices 
DFF D-flip-flop  
DFM Design for manufacture  
DIMS Delay Insensitive Minterm Synthesis  
DSOP Disjoint Sum-of-Product  
EDA Exploratory Data Analysis 
FPGA Field-Programmable Gate Array 
GALS Globally Asynchronous Locally Synchronous  
GPU Graphics processing units 
LE Logic Element 
LUT Look-up-table 
MWTU Minimum Width Transistor Unit 
NCL Null Convention Logic 
NoC Network-on-Chip 
NRE Non-recurring engineering  
NRLB NCL-based reconfigurable logic block 
PAD Pixel Array Detector  
PE Processing Element 
PGMB Programmable Gate Macro Block 
QDI Quasi delay-insensitive 
RAM Random-access memory 
SOI Silicon-on-insulator 
SOM Sum-of-Minterm 
TAST TIMA Asynchronous Synthesis Tool 
UDP User Defined Primitive  
UNCLE Unified NCL Environment  
VC Virtual Channel  
VTR Verilog-to-Routing 
P a g e  | XVI 
 
  
P a g e  |  1 
 
  
Introduction 
When the field-programmable gate array (FPGA) was first invented, it was mostly used as a simple 
logic replacement, as an advance on existing complex programmable logic devices (CPLD). The 
ability to alter the hardware organisation after fabrication meant that a circuit could be flexibility 
altered at will and tuned to suit the purpose of a given design. In this way, they filled a gap between 
hardware and software and their flexibility and low cost compared to application-specific integrated 
circuit (ASIC) implementations overcame their noticeable size and performance disadvantages. 
Compared to an ASIC design flow, where each specific design requires a dedicated ASIC, reusable 
reconfigurable systems like FPGAs could simply implement a design by being correctly programmed 
(configured). Its two key benefits [1] of lower non-recurring engineering (NRE) costs and faster time-
to-market were generally welcomed by circuit designers, and they are now found in diverse 
applications from large server farms to embedded and portable systems. FPGAs have more recently 
emerged as reconfigurable co-processor platforms closely coupled to microprocessors within complex 
integrated system on chip architectures (e.g., [2], [3]). 
In [4], Lyke et al. suggest two broad aspects that will influence current and future reconfigurable 
systems. The first of these, designing for reconfigurability, focus on the organisation of the flexible 
fabrics themselves, their logic, switching and configuration mechanisms. On the other hand, designing 
with reconfigurability implies the extensive use of EDA tools to embed user designs within 
reconfigurable fabrics and to adapt their organisation to best exploit the flexible characteristics of the 
reconfigurable platform. This thesis represents a contribution to both aspects.  
For example, it is becoming clear that issues such as extreme process variability in deep sub-micron 
transistors will impact the ability to achieve timing closure in high-performance synchronous FPGAs 
in the same way as it affects conventional ASIC architectures. This is particularly true for low-voltage, 
P a g e  | 2 
 
low power embedded and portable devices. Conventional FPGA architectures exhibit the behaviour 
and limitations common to all synchronous systems, so designing asynchronous systems for 
reconfigurability, such that asynchronous capability is embedded into the reconfigurable fabric may 
allow the emergence of novel capabilities. In addition, there are important differences in the 
behaviour of asynchronous systems that mean they are not simply an equivalent alternative 
implementation strategy to synchronous. Systems such as flexible interconnection networks (e.g., 
network on chip) can benefit from the low latency characteristics of asynchronous logic but only if 
they are designed with reconfigurability in mind and optimised to take full advantage of it.  
1.1 The Field Programmable Gate Array 
For many years, the semiconductor industry has been able to simply exploit ever smaller transistor 
size to improve computational performance and electrical characteristics [5]. However, this is unlikely 
to continue. While faster and smaller transistors support much higher clock frequencies, they can also 
result in higher power consumption and more complex circuit design [6, 7]. The same situation 
applies to large FPGA hardware devices. 
FPGA was first introduced in 1984 by the Xilinx company and became one of the most popular 
implementation platforms for digital designs [1]. Since its invention, the FPGA has gone through two 
further significant phases [8]. Firstly, during an “age of expansion”, the FPGA kept its advantage by 
process technology scaling. However, it has been argued that by the end of 1999, FPGA eventually 
became too large and lost its critical capacity advantage. A second, so-called “accumulation” stage 
emerged where the target was to reduce their overall area, power and design effort. The “Platform 
FPGA” was introduced with high-performance fixed blocks (multipliers, RAM blocks, etc.) and the 
FPGA became a system on chip design platform. However, these fixed functions have less flexibility, 
and the inefficiencies imposed by the unused blocks caused an explosion in the range of device 
options brought to market. The result was often much higher cost and lower speed on FPGA 
implementations.  
P a g e  | 3 
 
From 2010 to now, the challenges of FPGAs are their increasing design cost and time, which is 
closing the gap with ASIC design. Furthermore, technologies such as multicore processors, graphics 
processing units (GPU) and software-programmable application specific standard products (ASSPs) 
are starting to provide pre-engineered systems with software-like programmability that simplify the 
mapping problems and are becoming perhaps more attractive to the market than conventional FPGA 
devices. 
There are three main factors to achieve high-performance in FPGA systems [1]: 
• The transistor-level design; 
• The quality of the CAD tools used to map circuits into the device; 
• The flexibility of the FPGA architecture. 
It is currently the case that most conventional FPGAs are synchronous, and their tools have been 
optimised over many years to suit the mapping of synchronous digital circuits. Against this 
background, the challenges between the interrelated issues of power and performance in conventional 
clocked processor systems have begun to emerge. Techniques such as power gating, that switch off 
unused functionality are becoming common in synchronous designs [9]. However, these various 
techniques are, at best, ad hoc solutions that impact performance and can significantly increase 
hardware and software complexity. It is becoming clear that power consumption can be managed only 
by careful application of on-chip processing parallelism [10, 11] and by reducing the impact of the 
global clock. 
In order to solve the problems mentioned above, a number of researchers have attempted to improve 
the power-performance trade-offs in FPGA architectures with globally asynchronous locally 
synchronous (GALS), which localizes the clock distribution issues to a small part of the device [12]. 
Moreover, a small number of purely asynchronous reconfigurable designs have been introduced in 
recent years [13-16]. With the clock removed, an asynchronous FPGA may potentially exhibit a lower 
power consumption and offer a much more straightforward way to achieve high data throughput [17, 
18]. Recently, companies such as Fulcrum® Microsystem and Achronix® have developed 
P a g e  | 4 
 
asynchronous FPGA-like systems with varying degrees of success. It also appears that Intel® 
Corporation is considering the development of asynchronous reconfigurable logic [19]. At the time of 
writing, Achronix appears to be achieving some measure of commercial success, so it is reasonable to 
believe that asynchronous reconfigurable systems will have a continued impact on reconfigurable 
systems into the future.  
1.2 Asynchronous Logic 
Unlike synchronous design flows, all asynchronous designs use some form of “request and 
acknowledge” handshaking protocol for sequence control and data communication between their logic 
blocks [20]. There are two primary categories of asynchronous circuit models in use, the Bounded-
delay model and the Delay-insensitive model. The micro-pipeline technique [21], for example, uses a 
bounded-delay model which assumes that delays in both gates and wires are bounded. On the other 
hand, delay-insensitive circuits such as Null Convention Logic (NCL) make no assumptions about the 
delays in their logic or interconnects beyond a requirement that local wire intersection is isochronic 
(signals within essential components that fanout to two or more destinations exhibit approximately 
equal delays). Rather than relying on careful clock management, the delay of the individual 
components in an asynchronous logic circuit determines the rate of operation. Further, asynchronous 
systems tend to exhibit better control of the timing of data-dependent operations such as buffering, 
arbitration and flow control compared to synchronous designs [22].  
NCL, first developed by Karl Fant [23], is a symbolically complete quasi delay-insensitive (QDI) 
logic system. As opposed to conventional Boolean systems, which use ‘true’ or ‘false’ values for their 
logical expressions, NCL is based on the values data (i.e. a logical true/false ‘data0’ or ‘data1’) and 
null (meaning “not data”). It is the monotonic switching between the data and null states that controls 
the data flow in NCL circuits and the basic NCL gates exhibit majority logic or ‘threshold’ behaviour, 
represented as an ‘m-of-n’ gate. This is to say; a n-input NCL gate will assert its output when ‘m’ 
inputs have transitioned from their null to data states (data0 or data1). The gates also exhibit hysteresis 
P a g e  | 5 
 
in that an output transition back to null will only occur when all ‘n’ inputs have similarly returned to 
null. 
While the communication and interaction between registers in a synchronous circuit are controlled by 
a global clock, register transfers in NCL circuits are controlled by local completeness 
acknowledgement signals. As shown in Figure 1, an asynchronous register is a simple NCL circuit 
with feedback hysteresis that detects either input data completeness or null state. It stores the data 
state when it meets a complete set of data input, and the null state is asserted if all inputs are null. In 
response to the assertion of data or null, the register receives and sends acknowledgement signals to 
communicate with the next and previous combinational circuits. During a data state, a “Request-for-
Null” signal is delivered to the previous stage while a “Request-for-Data” signal is delivered to a 
preceding gate during the null state. Similarly, these same request/acknowledge signals are received 
by the register from the next (successor) stage. 
 
Figure 1 The Asynchronous Register in the NCL Circuit [24] 
As an asynchronous logic, NCL tends to exhibit low latency characteristics [25], which can benefit 
the interconnection between components of an asynchronous system. Network-on-chip (NoC) 
architectures have been proposed as a data routing system intended to provide an improved 
P a g e  | 6 
 
interconnection solution to current FPGA systems. As low latency is one of the critical factors of the 
NoC architecture, this thesis proposes and analyses a reconfigurable NoC built around a 
reconfigurable asynchronous array. 
1.3 Network-on-Chip 
NoC has been widely applied to meet the communication demands of large-scale chip multi-
processors [26]. In this scenario, the distributed computers (“cores”) can send messages (“packets”) to 
each of the machines on the same chip. NoC has been shown to be highly scalable, reliable, and 
modular, something that the FPGA architectures can exploit to achieve a better interconnect 
performance and low data transport latency and lower energy consumption [27]. NCL shares some of 
these basic characteristics. NoC architectures employ a number of routers to provide on-chip point-to-
point connections for packet transmission [28] that build up in a modular fashion into a higher-level 
protocol. When applied to synchronous FPGA devices, the interconnect delay depends on the number 
of clock cycles a packet takes to transmit from the starting point to the destination point. Previous 
research has indicated that optimisation, compilation and partial reconfiguration steps on an FPGA are 
more straightforward using an NoC interface. Due to its modular design, a newly configured module 
can simply replace an old one in the network instead of re-routing to each net on FPGA [27, 29].  
To build an efficient NoC architecture, it will be necessary to examine three primary design issues [28, 
30]: 
• Network Topology: encompasses the placement and connectivity of the NoC routers. It 
decides how the packets travel across the network by defining various channels and 
connection patterns. An appropriate topology can impact on the performance, cost, and 
scalability of the network. Two-dimensional topologies like 2D-Mesh, Ring, Torus, Butterfly 
Fat-Tree are regularly used for commercial network purposes. 
• Routing Algorithm: decides the correct direction to transfer the packet from its source to the 
destination endpoint. Routing algorithms can affect delay and power consumption and 
congestion conditions in the network. 
P a g e  | 7 
 
• Flow control: controls a packet’s travel from each router to the next in the correct order. The 
flow control process typically employs Virtual Channel (VC) techniques to solve the problem 
of route blockages. In addition, an output schedule will be applied to determine the priority 
order of packets to reach their output ports. The buffer of a router can easily affect the 
implementation cost and power consumption in the NoC. 
Taken together, these issues adjust the performance, cost and efficiency of an NoC system. However, 
the essential component which can carry a packet through the network is the NoC router. It is the 
performance of NoC router that primarily affects the overall data throughput, resource usage and 
energy cost, which are the critical factors in the implementation cost of an on-chip network. The 
power consumption, data throughput speed and dataflow control algorithm of NoC router are the key 
drivers in the development of future NoC organisations [31].  
The FPGA-based NoC is intended to improve interconnection solutions for current FPGA systems but 
is facing many challenges from highly abstract software related issues, across system topology to 
physical level implementation [31, 32]. The router architecture, at the physical level, is the critical 
part of improving NoC performance. Soft NoCs and hard NoCs are frequently discussed as they both 
have their own advantages and disadvantages and the performance of Soft NoCs relies on 
reconfigurable system architectures, which means advancements in the reconfigurable system can 
bring improvements to the reconfigurable NoC architecture. 
1.4 Research Approach & Methodology 
This research is based on the observation that traditional synchronous reconfigurable systems are now 
hitting a bottleneck with the high data throughput demands and strict energy consumption limits. 
From existing studies, an obvious solution to address the problems of clocked systems such as 
conventional FPGA system is to eliminate the clock, even if the removal of the synchronous 
assumption makes the design process significantly harder. 
P a g e  | 8 
 
Furthermore, NoC architectures have been proposed to provide a more flexible and scalable on-chip 
communication alternative for many-core systems on chip. When compared with the traditional bus-
based architecture, there is no doubt that NoC offers a better IP block integration solution for high-end 
designs [33]. The development of conventional FPGA brings challenges to the programmable 
interconnections as well. The hypothesis here is that a reconfigurable NoC using delay insensitive 
asynchronous logic techniques will have the advantage of low latency and straightforward 
implementation.  
To explore these assumptions, the following questions and the related experimental methods are 
developed to illustrate the approach of this research. 
Research Questions: 
Q1.  Can a purely asynchronous reconfigurable system implementation exhibit comparable 
latency and resource usage than conventional synchronous FPGA implementations at an 
equivalent technology node? 
Q2. How will the area, delay of NoC architecture implemented on NCL-based asynchronous 
reconfigurable system compare with the case on a conventional synchronous FPGA 
technique? 
As shown in the first part of the research flow (Figure 2), a new asynchronous reconfigurable system 
has been built to validate and compare with previous asynchronous FPGA proposals. The NCL-based 
reconfigurable logic block (NRLB) schematic, as a component of the reconfigurable system, was 
developed and evaluated using Cadence® Virtuoso® Layout Suite using a commercial 28nm FDSOI 
CMOS process technology [34] with models provided by the foundry. One of the key benefits of the 
SOI MOSFET are negligible body effect and reduced parasitic drain capacitance when compared to a 
bulk device [35]. Average propagation delay results derived from the transistor level simulations were 
used to create analog/mixed-signal models of the NRLB cell, and the Cadence® Spectre® AMS 
Simulator was employed to simulate the reconfigurable NCL gates. Gate propagation delay results 
P a g e  | 9 
 
were also used later in the architecture description file required by the Verilog-to-Routing (VTR) flow 
[36] used to measure the performance of typical applications such as the NoC. 
The Intel® Stratix V was chosen as a commercial benchmark as it uses a similar 28nm process1 [37]. 
Typical performance data drawn from the data sheets were used to set up the comparative tests against 
the NRLB asynchronous reconfigurable architecture. To compare between purely asynchronous 
reconfigurable system and conventional synchronous FPGA system, the VTR CAD flow [36] was 
modified to measure both latency and resource usage characteristics of the reconfigurable systems, as 
shown in the lower part of Figure 2. As can be seen in the diagram, synchronous digital circuits were 
analysed using an unmodified VTR CAD Flow.  
To extend this to asynchronous designs, an NCL VTR CAD flow was developed, which can 
implement the final place and route process on the asynchronous reconfigurable cells. The standard 
CAD Flow is made up of three general stages: (1) Elaboration & Synthesis, (2) Logic Optimisation & 
Technology Mapping and (3) Packing and Placement, Routing & Timing Analysis. The VTR flow 
variants were applied at stages (1) and (2). The synthesis tool UNCLE [38] replaced ODIN II [39] to 
perform the NCL gate-level netlist synthesis. The logic synthesis tool Yosys [40] was included as it is 
able to generate design netlists derived from technology mappers other than ABC [41] and is 
compatible with the requirements for mapping onto the NCL-based asynchronous reconfigurable 
system. After these two tool variants were included, VPR [42, 43] packs the netlist into the logic 
blocks within the specific FPGA architectures, places and then routes the circuit. VPR can analyse the 
implementation and produce various statistics such as the minimum number of tracks per channel 
required to successfully route, the total wire length, circuit speed, area and power [36]. 
                                                     
1 Although details of the actual process are difficult to obtain, it appears to be a 28nm bulk process rather than the FDSOI process used in 
this work. It is assumed that the characteristics are sufficiently similar to allow general comparisons to be made. 
P a g e  |  10 
 
 
 
 
 
 
 
 
 
 
 
  
 
Figure 2 Research Flow 
P a g e  | 11 
 
NoC allows high design flexibility with low infrastructure cost [44] and shares these same 
characteristics with reconfigurable systems in general. The VC router microarchitecture described in 
[28] was employed in these experiments as its five sub-modules have different circuit sizes and 
algorithmic complexities, and thus are able to demonstrate a range of area and latency behaviours. 
The experiments focused on a comparison between the characteristics of this NoC router when 
implemented on both NCL-based asynchronous reconfigurable system and a conventional 
synchronous FPGA system based on the Stratix-V. Finally, a conclusion has been produced to 
indicate the critical factors of the development of future asynchronous reconfigurable NoC router. 
1.5 Contributions and Outcomes 
The following specific results have arisen from the works in this thesis. 
1. Dual-rail NCL NRLB built using a 28nm FDSOI CMOS process technology. 
a. NRLB can create any of the 27 fundamental NCL threshold gates and NCL register; 
b. The area of NRLB is 522.43µm2. It has an average propagation delay of 160ps during 
the threshold gate logic phase and average static power consumption at 24µW; 
c. The default NRLB is eight inputs + eight outputs, which is considered as a 4-bit dual-
rail I/O logic element. A modified NRLB with 2 data outputs was found to be better 
suited to the net-lists generated by the UNCLE synthesis tool; 
d. Fewer resources are consumed by the asynchronous designs mapped onto the NRLB 
system compared to the same cases mapped on previous asynchronous reconfigurable 
works. 
2. A customised CAD Flow has been created to model NCL designs on the asynchronous 
reconfigurable system (NRLB). Experiments are implemented on both synchronous 
commercial FPGA (Stratix V) and asynchronous reconfigurable system (NRLB) to 
demonstrate the effectiveness of the flow. 
a. NCL-based benchmarks are 8x~10x larger than same synchronous cases. A higher 
ratio of registers and buffer memory in the designs leads to higher area ratio; 
P a g e  | 12 
 
b. There is an improvement of 30% ~ 50% on the data latency (from input to output) of 
NCL-based benchmarks when compared with the synchronous cases. 
3. The NCL-based NoC router shares the same characteristics of the basic benchmarks. It 
performs with 50% lower latency than the synchronous case with an area overhead of about a 
factor of ten. 
a. Due to the delay-insensitive characteristic, the three sub-modules which put great 
emphasis on routing logic: input module, VC allocator and switch allocator, have 
better performance on latency by the asynchronous cases. The input module has a 
massive area sacrifice as it includes the input buffer memory; 
b. Small-sized circuits with less routing algorithms, like output module and crossbar, are 
not good samples to evaluate the difference between synchronous and asynchronous 
reconfigurable systems. It is because the physical wire delay is significantly larger 
than the logic elements delay, which is happened on the small-sized circuit with 
fewer gate delays and fewer pipeline stages; 
c. The current study implies a positive correlation between the complexity of algorithms 
and latency on the enhancement of the asynchronous reconfigurable system. 
1.6 Publications 
An outcome on Dual-Rail NCL Look-Up-Table (LUT) has been reported in the following publication: 
Jing Yu and Paul Beckett. 2014. “A dual-rail LUT for reconfigurable logic using null convention 
logic”. In Proceedings of the 24th edition of the great lakes symposium on VLSI (GLSVLSI'14). ACM, 
New York, NY, USA, 261-266. DOI: https://doi.org/10.1145/2591513.2591589 
  
P a g e  | 13 
 
1.7 Outline 
Following this motivating chapter, the remainder of this thesis proceeds as follows. 
Chapter 2 focuses on the background of Null Conventional Logic and previous research on NCL. 
Recent reconfigurable NCL proposals and the CAD tools for modelling on FPGA are discussed in this 
chapter as well. Finally, NoC architectures are discussed, highlighting their potential benefits to 
FPGA systems. 
A dual-rail NCL reconfigurable logic block, the main component of the complete asynchronous 
reconfigurable system, is proposed and analysed in Chapter 3. The NRLB can perform the function of 
each 27 fundamental NCL threshold gate or NCL register. Some small circuits such as an NCL-based 
full adder and multiplier are built to illustrate and evaluate the NRLB performance. 
In Chapter 4, details of the methodology and experiment setup on NCL-based asynchronous 
reconfigurable system will be discussed. The performance of both NRLB asynchronous 
reconfigurable system and synchronous conventional FPGA device are analysed based on the same 
process technology.  
Chapter 5 analyses the advantages and disadvantages of a soft asynchronous NoC router organisation 
mapped onto the asynchronous reconfigurable system. 
Finally, Chapter 6 concludes the thesis, discusses the questions raised in Chapter 1, and identifies 
potential further work arising from the research.
P a g e  | 14 
 
  
P a g e  | 15 
 
  
Literature Survey 
 
 
 
 
 
 
 
 
 
It is becoming clear that the pace of improvement in semiconductor fabrications is slowing as device 
geometries reach their ultimate physical limits. Further, even though large system-on-chip and 
synchronous FPGA devices have the advantage of short time-to-market and small setup costs, it is 
becoming more challenging to achieve high performance in conventional clocked processing systems. 
Reconfigurable asynchronous techniques may be one potential solution to this challenge that comes 
with additional benefits such as overall robustness, modularity and average-case performance 
properties [45]. Although there is a lack of asynchronous reconfigurable devices and well-built design 
tools in the industry, a few researchers [13, 14, 46-51] have attempted to develop novel asynchronous 
reconfigurable logic devices, and have tried to find a straightforward approach to balancing the 
resource demands and the complexity of FPGA designs. This chapter reviews the current state of the 
art in asynchronous and reconfigurable systems, then introduces the NoC application chosen as a case 
study in this work. 
  
P a g e  | 16 
 
2.1 Asynchronous Logic Systems 
Many studies have attempted to eliminate the global clock driving signal in FPGA designs. A number 
have adopted asynchronous Muller C-elements [52] synthesis methods to express the Boolean control 
logic, coupled with dual-rail four phase protocols to represent handshake behaviour between circuits. 
However, structures that combine Boolean logic and C-elements become very large, and their power 
consumption can be significantly adversely affected [53]. As identified by Fant [24], this approach 
does not result in a theoretically complete solution as it still relies on Boolean logic. To address this, 
Fant has devised a coherent logic system that is complete and sufficiently expressive to implement 
entire systems. 
2.1.1 Null Convention Logic 
NCL is a symbolically complete logic system that implicitly and completely expresses logic processes. 
By comparison, Boolean logic is not complete as it requires the inclusion of an independent time 
variable (i.e., a clock) that has to be very carefully coordinated with the logic part of the expression, in 
order to entirely and adequately express an operation. NCL adds the control value null (i.e., not data) 
to the Boolean set to create a symbolically complete and delay-insensitive three-value logic system. A 
gate will only assert its output data when a complete set of (valid) data values are presented at its 
input, thereby enforcing a “completeness of input” criterion. A critical difference between NCL 
circuits and other delay-insensitive methods is that all NCL gates exhibit built-in hysteresis or state-
holding behaviour, instead of incorporating one single type of state element, i.e., the Muller C-
element [54]. 
 
Figure 3 NCL THmn Threshold Gate 
P a g e  | 17 
 
The primary circuit element of NCL is the THmn threshold gate (Figure 3) [55], where 1≤ m ≤ n. 
These gates have a total of n inputs, and at least m of n inputs must be asserted for the output to 
respond; otherwise, a null value remains asserted. This is the threshold behaviour in NCL. Hysteresis 
ensures that the gate will not return to null until all inputs are themselves null. Four-input threshold 
gates with different values of m (i.e., THm4 gates) can generate all Boolean functions of four or fewer 
variables, so the complete logic coverage can be established using only 24 fundamental gate functions. 
Table 1 illustrates a subset of NCL gates based on this idea.  
It can be seen that for a given ‘n’ inputs threshold gate, additional terms are added to the function for 
each increase in ‘m’. The weighting term, ‘w’ (1 < w ≤ m), means one of the inputs (default is the first 
input in order) will be added one or more “units”. For example, the TH34w2 function describes a 4-
input gate with a threshold of 3 and the input A exhibits a “weight” of 2 (equal to the threshold ‘m’ in 
this case), so the equivalent Boolean logic function becomes A.B+A.C+A.D+B.C.D. NCL gates are 
considerably more complex than their Boolean counterparts and offer a higher level of functionality to 
the logic decomposition process. In Table 1, it can be seen that the last three listed macro functions 
(THand0, THxor0, TH24comp) could be readily derived from combinations of the remaining 24. But 
in fact, they are implemented as standard gates for completeness, and they can potentially further 
reduce the logic complexity. As a result, there are 27 fundamental NCL gates in total commonly used 
in the NCL design. 
2.1.2 NCL Throughput Behaviour 
The management of data communication and interaction between each component is critical in a 
delay-insensitive circuit. Pipelining behaviour is most commonly used in NCL design to get high-
speed data paths with fewer gate delays [56, 57]. An asynchronous register (see Figure 1 on page 5) is 
an essential component to control the data throughput completeness in a pipelining circuit. 
  
P a g e  | 18 
 
Table 1 Equivalent Boolean Functions of a 4-Variable NCL Macros 
THmn Gate Boolean Function 
TH12 Z = A + B 
TH22 Z = A.B 
TH13 Z = A + B + C 
TH23 Z = A.B+A.C+B.C 
TH33 Z = A.B.C 
TH23w2 Z = A+B.C 
TH33w2 Z = A.B+A.C 
TH14 Z = A+B+C+D 
TH24 Z = A.B+A.C+A.D+B.C+B.D+C.D 
TH34 Z = A.B.C+A.B.D+A.C.D+B.C.D 
TH44 Z = A.B.C.D 
TH24w2 Z = A+B.C+B.D+C.D 
TH34w2 Z = A.B+A.C+A.D+B.C.D 
TH44w2 Z = A.B.C+A.B.D+A.C.D 
TH34w3 Z = A+B.C.D 
TH44w3 Z = A.B+A.C+A.D 
TH24w22 Z = A+B+C.D 
TH34w22 Z = A.B+A.C+A.D+B.C+B.D 
TH44w22 Z = A.B+A.C.D+B.C.D 
TH54w22 Z = A.B.C+A.B.D 
TH34w32 Z = A.+B.C+B.D 
TH54w32 Z = A.B + A.C.D 
TH44w322 Z = A.B + A.C +A.D + B.C 
TH54w322 Z = A.B + A.C + B.C.D 
THxor0 Z = A.B + C.D 
THand0 Z = A.B + A.D+ B.C 
TH24comp Z = A.C +A.D+ B.C + B.D 
  
P a g e  | 19 
 
Either full-word or bit-wise completion methods are used in NCL pipelining with NCL registers [58]. 
Full-word completion registers (Figure 4.a) combines all feedback signals in the same stage and then 
sends the combined signal to the registers in the previous stage to control the data flow. In contrast, 
bit-wise completion registers (Figure 4.b) send only the feedback signal in the same data flow to the 
previous registers. Smith et al. [58] argues that bit-wise completion has a higher data throughput than 
full-word completion, as the former reduces the completion logic delay from 2-gate delay to only 1-
gate delay. Kim et al. [57] analysed the data throughput in 2-D pipelining circuits with bit-wise 
completion behaviour, which was found to be 160% faster than the equivalent non-pipelined designs, 
and was 60% faster than the cases with 1-D pipelining behaviour. However, the area penalty is high. 
The 2-D pipelining circuits are nearly 4x larger than the non-pipelined designs. Embedded registration 
can also be used to merge the asynchronous register into the NCL combinational logic to increase 
throughput and decrease latency and area [58].  
2.1.3 Unified NCL Environment and Data-Driven Register 
As conventional synchronous circuits have long dominated the industrial market, most commercial 
EDA tools have been developed for synchronous designs only [59, 60]. From their very first 
introduction, NCL-based designs have been mostly implemented as full-custom designs. More 
recently, some design flows [61-63] have been proposed to synthesise the RTL behavioural 
description into NCL-based structural description building on the use of commercial synchronous 
EDA tools. The open-source synthesis tool, the Unified NCL Environment (UNCLE) [38], developed 
by Mississippi State University in 2011, was one of the first.2 UNCLE synthesises a Verilog RTL 
netlist into an NCL netlist and converts from single-rail to dual-rail at the same time, based on the 27 
fundamental NCL threshold gates in Table 1. It provides an easy and straightforward way for NCL 
design synthesis processing. Furthermore, most of the UNCLE scripts were designed using Python®, 
and we can modify the scripts for our specific asynchronous reconfigurable architectural requirements 
and circumstances. 
                                                     
2 All the executable files and the manual can be downloaded from https://sites.google.com/site/asynctools/. 
P a g e  | 20 
 
 
a) Full-word Completion 
 
b) Bit-wise Completion 
Figure 4 Two Different Completion Methods in NCL Handshaking Behaviour [58] 
P a g e  | 21 
 
 
Figure 5 UNCLE Synthesis Flow 
The overall UNCLE Synthesis Flow (Figure 5) can be divided into two parts. The first part on the left 
is the synthesis process from the behavioural Verilog RTL files to the Verilog gate-level structural 
netlist, which is performed by a commercial synthesis tool (Cadence® GenusTM). This structural netlist 
is constructed using single-rail Boolean logic gates which can be expanded into dual-rail logic. The 
second (UNCLE) part on the right is implemented using Python® scripts. A built-in NCL threshold 
gate library is used by the mapping process to convert single-rail Boolean logic gates to dual-rail NCL 
threshold gates, as illustrated by the examples in Table 2. The final Verilog dual-rail NCL netlist is 
then derived after the acknowledgement network is added along with some optional optimisation steps.  
As discussed in Section 2.1.2, NCL naturally exhibits linear pipeline behaviour. UNCLE uses either 
data-driven or control-driven design styles in its acknowledgement network generation to convert 
sequential flip-flops into asynchronous registers. This study will focus on the data-driven style, which 
has a better performance on linear pipelines. The Reset-to-Null half latch (drlatn), Set-to-Data1 half 
latch (drlats) and Reset-to-Data0 half latch (drlatr) are the dual-rail data-driven registers used by 
P a g e  | 22 
 
UNCLE to control data transfer. As shown in Figure 6, the Cr gate is C-element with async reset-to-0. 
It can be replaced by a TH22r which is an NCL TH22 gate with the reset-to-0 function added. On the 
other hand, the C-element with async set to 1 can be replaced by a TH22s gate (with set-to-1 function) 
which can be used in the drlats and drlatr designs. 
Table 2 Single-Rail Boolean Gates to Dual-Rail NCL Threshold Gates Conversion 
Boolean Gate NCL Threshold gate 
and2 (a, b, y) 
TH22 (.y(t_y), .a(t_a), .b(t_b)); 
THand0 (.y(f_y), .a(f_a), .b(f_b), .c(t_a), .d(t_b)); 
or2 (a, b, y) 
TH22 (.y(f_y), .a(f_a), .b(f_b)); 
THand0 (.y(t_y), .a(t_a), .b(t_b), .c(f_a), .d(f_b)); 
mux2 (a, b, s, y) 
TH33w2 (.y(n1), .a(f_s), .b(t_a), .c(f_a)); 
TH33w2 (.y(n2), .a(t_s), .b(t_b), .c(f_b)); 
THxor0 (.y(t_y), .a(t_a), .b(n2), .c(t_b), .d(n1)); 
THxor0 (.y(f_y), .a(f_a), .b(n2), .c(n1), .d(f_b)); 
xor2 (a, b, y) 
TH24comp (.y(f_y), .a(t_a), .b(f_b), .c(t_b), .d(f_a)); 
TH24comp (.y(t_y), .a(t_a), .b(t_b), .c(f_b), .d(f_a)); 
fa (a, b, ci, s, co) 
TH23 (.y(f_co), .a(f_ci), .b(f_a), .c(f_b)); 
TH23 (.y(t_co), .a(t_ci), .b(t_a), .c(t_b)); 
TH34w2 (.y(f_s), .a(t_co), .b(f_ci), .c(f_a), .d(f_b)); 
TH34w2 (.y(t_s), .a(f_co), .b(t_ci), .c(t_a), .d(t_b)); 
 
 
Figure 6 Reset-to-Null Half-latch Structure [38] 
P a g e  | 23 
 
During the data-driven mapping process, each drlatn half latch can perform a clocked D-latch 
function which is generated in the single-rail gate-level netlist, as shown in Figure 7. The D-latch 
behavioural Verilog statement with a clock signal is transformed into the dual-rail half latch drlatn, 
and the clock signal is dropped during the mapping process. 
 
 
Figure 7 Data-driven Half-latch  
Figure 8 illustrates the triple (three) half-latch structure with a Reset-to-Data0 function, which is one 
of the standard UNCLE implementation structures. The structure can replace a clocked D-flip-flop 
(DFF) with an asynchronous low-true reset in the single-rail gate-level netlist. The register with Set-
to-Data1 function employs a drlats gate to replace the drlatr gate in the middle of the three half-latch 
structure. In order to generate the acknowledgement hand-shaking network in UNCLE synthesis flow, 
DFFs / D-latches and the top-level clock signal are compulsory—they must be present in the RTL 
modules. The simple scripts used in the UNCLE tool have considerable difficulty translating purely 
combinatorial logic. 
UNCLE provides a method to synthesise clock-based behavioural Verilog RTL to Verilog NCL gate-
level netlist. Most UNCLE scripts are built using Python® so they can meet various design 
requirements and circumstances, as will be discussed in the following chapters. Except for the 
synthesis tool, effective “place and route” tools and reconfigurable asynchronous architecture will be 
discussed and analysed later in the thesis. 
/*D-latch behavioural Verilog description*/ 
always @* begin 
   if (clk == 1) q <= d ; 
end 
P a g e  | 24 
 
 
 
Figure 8 Three Half-latch Structure/ Data-driven Register [38] 
2.2 Reconfigurable Asynchronous Design 
A number of asynchronous reconfigurable designs have been developed previously based on different 
asynchronous techniques such as bundled data protocols (e.g., STACC [45]), phased logic [49, 64] 
and NCL [50]. A significant problem with the bundled data approach is the requirement to match the 
delays between the data and control paths. This becomes similar to the synchronous style and thus 
requires extensive worst-case timing analysis. An asynchronous pipelined memory compiler (AMC) 
was built to improve the average-case performance of SRAM memory access which used a 
combination of QDI control and bundled-data datapath [65]. QDI control can achieve average-case 
performance instead of worst-case as exhibited by the bundled-data model. 
In [66], Hromalik used both “asynchronisation” and “synchronisation” to derive the FPGA pixel array 
detector (PAD). The results showed that the asynchronous autocorrelation function (ACF) came with 
a lower time delay than the synchronous reconfigurable ACF, but the non-deterministic delay chain 
makes it unsuited for long time ranges (100ns – 1ms) over large numbers of pixels.  
/*DFF behavioural Verilog description*/ 
always @ (negedge reset or posedge clk) 
begin 
   if (reset == 0) q <= 0 ; 
   else q <= d ; 
end 
P a g e  | 25 
 
2.2.1 Asynchronous Designs on FPGA 
Mocho [16] attempted to apply four different dual-rail design styles in programmable devices: (1) 
Delay insensitive minterm synthesis (DIMS is based on C-element), (2) NCL, (3) Derivation from 
combinational circuits (using dual-rail encoding) and (4) Behavioural description with strong 
indication (i.e., the behavioural description includes the desired characteristics). NCL logic showed 
the best results in FPGA, mainly due to the characteristics of its threshold logic. The results from the 
asynchronous circuits cannot compete with the size and performance of the equivalent synchronous 
circuits. However, Mocho believes asynchronous designs can “trick” the synthesis tools into doing an 
acceptable job of converting VHDL code into asynchronous networks and so can be an affordable 
solution for prototyping asynchronous circuits.  
 
Figure 9 Processing Element Circuit [67] 
Chen and his team attempted to implement an asynchronous energy-efficient CNN accelerator on the 
Xilinx VC707 FPGA in [67]. Each computational core in the accelerator is constructed as a 5×5 array 
of processing elements (PEs) connected together by an asynchronous mesh network. The so-called 
“Click Elements” generate a pseudo-clock sequence from the handshake protocol (Figure 9) that drive 
the synchronous registers in the FPGA, resulting in asynchronous pipelining behaviour. The study 
achieved an 84% smaller dynamic power result with the asynchronous computing core than the case 
with a synchronous core. Another similar approach is described in [68]. The energy-efficient 
P a g e  | 26 
 
asynchronous accelerator Domino, which supports graph analytics applications, was implemented on 
an advanced Xilinx Virtex-7 board and was shown to consume around 5x lower power than GraphChi 
[69] implemented on an Intel Core i7 CPU. 
The TIMA asynchronous synthesis tools (TAST) methodology [70] has successfully implemented 
asynchronous circuits onto LUT-based FPGA devices targeting Xilinx® X4000 and Altera® APEX 
families. Lemberski [71] introduced a Sum-of-Minterm (SOM) method that can simplify the circuit 
complexity when the single-rail circuits are transformed into dual-rail QDI networks and then mapped 
onto the LUT-based FPGA. The Disjoint Sum-of-Product (DSOP) [72, 73] and ABC [74] synthesis 
functions were employed to implement asynchronous designs on FPGA systems in Lemberski’s work. 
While the work was ultimately successful on minimizing the area of asynchronous networks when 
mapped to conventional FPGA platforms, it could still be argued that the synchronous nature of 
conventional FPGA devices makes this an inherently sub-optimal approach.  
An asynchronous 8-bit processor mapped onto a commercial FPGA device was described by Herrera 
[75] in which is a delay-insensitive asynchronous circuit was implemented with a 4-phase 
handshaking protocol. All processor modules work sequentially, and the data-loop is controlled by the 
sequencing unit, which implements a modified Muller [76] pipeline. Although this processor was 
functional and it was designed in a simple way that encourages the exploration of asynchronous 
circuits, the FPGA architectures were optimised to implement synchronous circuits which is the main 
disadvantage of the asynchronous circuits. The asynchronous processor, on the other hand, used an 
enormous amount of resources and somewhat increased critical path delays when compared with the 
synchronous case.  
Kim [77] employs a similar approach to map an NCL-based asynchronous circuit on the commercial 
FPGA. This study illustrated five different methods: (1) Verilog behavioural model: uses Verilog 
hardware description language to directly describe each NCL threshold gate behaviour in terms of a 
gate module library; (2) Verilog LUT model: uses Verilog to create dedicated NCL-based LUT 
function; (3) Verilog UDP model: sets up the NCL threshold gates by Verilog user defined primitive 
P a g e  | 27 
 
(UDP); (4) Verilog Boolean gate model: uses Boolean structural netlist to describe the target NCL 
design and (5) Schematic design model: comprising an initialisation stage, input logic stage and 
hysteresis function by basic schematic design style. Methods (1), (3) and (4) have no limitations and 
can the implemented on any FPGA systems or, indeed, any ASIC synthesisable from Verilog. On the 
other hand, methods (2) and (5) rely on the structure of a specific FPGA or the synthesis design tool. 
Based on these approaches, an NCL-based full-adder occupies at least 44 LUTs on Xilinx-G FPGA 
and a slightly smaller number (37 ALUTs) on an Altera Statix-4 FPGA. As it takes 131 cores on the 
Actel ProASIC3E FPGA to implement a full adder, it is not as practical as the other two devices. 
Also, Kim has concluded that the synthesis results of NCL based circuits are much larger than the 
same functional synchronous circuits. It is not directly comparable with the purely synchronous 
FPGA designs or even other purely asynchronous reconfigurable designs.  
It can be seen that each of these approaches target existing synchronous FPGA systems that contain 
no specific delay-insensitive functions and it is clear that asynchronous implementations are non-
optimal on conventional FPGAs. Thus, there would be an advantage in providing specific support for 
asynchronous systems. 
2.2.2 Asynchronous Reconfigurable Systems 
The reconfigurable NCL system described in [48] is based on a logic element (LE) that can 
implement a small range of threshold gates (eight of the possible 27 fundamental threshold gates) plus 
their inverted functions and with reset and registered outputs. Each LE consists of 32 transistors and 
the cell layouts were demonstrated in a 0.5µm 2-layer metal silicon-on-insulator process. The range of 
threshold gates is insufficient to implement even a medium-sized circuit generated by the UNCLE 
toolset. Thus, it is not practical for use in real designs. 
The reconfigurable LE described in [78] uses a LUT organisation that is identical to a conventional 
FPGA. A tree of pass transistors is used as a multiplexer to connect the gate inputs. The minterms of a 
particular NCL function is enabled by configuring the corresponding configuration register group 
with 14 latches. The LUT output then drives a single pull-down transistor that forms the equivalent to 
P a g e  | 28 
 
the pull-down network shown in Figure 10. Since the pull-up/pull-down function inverts the 
multiplexer output, the source input from the latches is inverted. The overall logic block can generate 
the full 27 fundamental NCL gate functions, including resettable and inverting variations. 
Interconnection has not been discussed in this study, and the author recognises a latch is required to 
configure each wire connection when using single-wire routing, as it is the case for standard 
synchronous FPGAs. Moreover, a fundamental difference between the interconnect structures of 
synchronous and asynchronous FPGAs is in the asynchronous “forking” mechanism that requires 
delay-intensive interconnects to either activate the sender or to use acknowledgement signals [79]. 
 
Figure 10 NCL-based 16-address LUT (portion) [76] 
P a g e  | 29 
 
NCL gates can be implemented in either a fully static or semi-static style. The latter uses a weak 
feedback inverter to achieve latching behaviour (Figure 11) so only the set (Go-to-Data) and reset 
(Go-to-Null) equations need to be explicitly implemented. Overall, static NCL gates require lower 
supply voltages than their semi-static counterpart, but speed and area are not very critical [58]. 
This so-called “weak keeper” maintains static operation by supplying charge lost through leakage at 
the output node of the array and is typically implemented with transistors that are long (e.g., W:L in 
excess of 5:1), so that they are easily over-driven by the “stronger” pull-up or pull-down circuits. 
Diode-connected semi-static NCL gate was first proposed in [80]. It used two minimum-sized diode-
connected transistors to limit the current and weaken the feedback inverter, as shown in Figure 11. 
This style of semi-static NCL gate exhibits greater noise susceptibility, smaller area and slightly more 
power consumption than its corresponding static gate. It can also be seen from Figure 11 that both 
PMOS pull-up circuits Go-to-Null) are the same for all NCL logic functions. Thus, the design 
problem here reduces to the organisation of the reconfigurable NMOS pull-down network.  
                 
 
A 
B 
C 
A 
B 
C 
Z 
NMOS  
Pull-down 
Network 
“Go to 
NULL” 
weak  
keeper 
“Go to 
DATA” 
 
                                                 a)                                                                                                     b)   
Figure 11 NCL 3-Input Threshold Gate  
a) Standard Semi-Static; b)Diode-connected Semi-Static. 
P a g e  | 30 
 
A memristor LUT-based asynchronous nanowire reconfigurable crossbar architecture, as one of the 
NCL-based reconfigurable architecture studies, is proposed in [81]. The programmable gate macro 
block (PGMB) is the primitive logic block that can implement all 27 fundamental NCL threshold 
gates with 8×4 design. The 8×4 design with all 4 inputs active is the most practical gate when 
comparing to 2×4 design and 4×4 design, which relate to 2 and 3 inputs. Moreover, PGMB is non-
volatile by virtue of its memristor-crossbar structure. In the example shown in Figure 12, the 8×4 
PGMB is programmed as a TH23 gate.  
To perform the hysteresis function for state-holding behaviour in NCL threshold gate, a Z* feedback 
signal is connected with the output port to control the row Demux and Mux. When Z* = 1, the output 
is asserted, and can only be de-asserted if A+B+C = 0. In contrast, when Z* = 0, the output is asserted 
when 2 out of 3 inputs are logic ‘1’. Each dot represents a memristor crosspoint. Grey dots are 
programmed as logic ‘0’, and black dots are logic ‘1’. To express a TH23 gate on the PGMB using its 
Boolean equation would be: Z = A.B + A.C + B.C + Z*.(A+B+C) . The PGMB is claimed to be easily 
programmable on the crosspoints and has high defect-tolerance. As an asynchronous reconfigurable 
design, it might conceivably exhibit low power with good manufacturability, although this has not 
been directly demonstrated, as yet. The number of I/O ports has a direct impact on the capacity of the 
LUT, something that will be discussed in a later chapter. 
 
Figure 12 TH23 Gate Realised on an 8×4 Programmable Gate Macro Block [81] 
P a g e  | 31 
 
As the systems based on fine-grained LUTs such as Montage [46] have to synthesise the required 
state behaviour using configured feedback paths, all the studies mentioned above implicitly rely on 
the quality of the “placement and routing” tools to ensure all delays are correctly matched. However, 
the current challenge is the absence of suitable processes of mapping asynchronous reconfigurable 
systems. It is clear that CAD toolsets and corresponding asynchronous reconfigurable architecture 
need to be re-evaluated for this domain.  
2.2.3 FPGA CAD Flows 
As in [77, 81], some researchers used commercial CAD tools that support specific devices (e.g., 
Altera [37], Xilinx [82]) to derive programmable circuit behaviours. But that, in itself, may not be 
enough to answer such questions as “What is the performance of this novel architecture?” or “How to 
enhance algorithms on different stages of the CAD flow?”. To solve these problems, an open-source 
framework for modelling the hypothetical FPGA devices architecture and CAD research tool is 
introduced in [36]. VTR CAD flow is modifiable and easy to start, which enables further research in 
the reconfigurable system designs area.  
In contrast to tools such as the Intel® Quartus® FPGA Design Software, which is a commercial-ready 
compact development tool including elaboration, synthesis, optimisation and time analysing, VTR 
represents a CAD flow that models and targets hypothetical devices. The VTR flow (Figure 13) 
supports synchronous designs with multiple clocks in both timing analysis and optimisation. There are 
several companion tools involved in VTR flow process. The first is the “Elaboration & Synthesis” 
tool, Odin II [39]. Odin elaborates and synthesises the Verilog RTL into a netlist of primitives. The 
“Logic Optimisation & Technology Mapping” with the netlist is completed by ABC [41], which 
optimises the soft logic in the netlist and maps it with the appropriate-size LUT and flip-flops. Finally, 
“Packing, Placement, Routing & Timing Analysis” are done by VPR [36]. VPR implements the 
placement and routing with a description file of the target FPGA architecture. It also performs all of 
the analysis with the resulting quality (i.e., timing analysis, area estimation and power estimation). For 
asynchronous logic design, Odin II and ABC, which are specifically designed for the clocked system, 
will be replaced by UNCLE (Section 2.1.3) and Yosys [40]. Yosys is a Verilog HDL synthesis tool 
P a g e  | 32 
 
that performs any synthesis jobs by combining existing algorithms using synthesis scripts. It is also 
compatible with mapping to cell libraries and creating BLIF format netlist file. This modification is 
specific for the novel asynchronous reconfigurable architecture and will be discussed in the following 
chapters. 
 
Figure 13 Typical VTR CAD Flow [36] 
2.3 Network on Chip on Reconfigurable Systems 
The interconnection network within an FPGA device plays a significant role in the control of the 
system data flow. Power consumption and latency of interconnections are important considerations 
P a g e  | 33 
 
which could be improved by employing an appropriate interconnect method and asynchronous logic 
techniques. It has been previously shown that compared to traditional interconnect methods, NoC is 
highly scalable, reliable and modular when applied to reconfigurable architectures. Compared to 
traditional interconnect methods, NoC typically exhibits much better interconnect performance, with 
low transport latency and lower energy consumption on FPGAs. For example, the white paper from 
Intel® [27] illustrates two scenarios. As shown in Table 3 and Table 4 (re-drawn from [27]), although 
it takes 10% more resource usage, the NoC doubles the Fmax at a 3 cycle network latency compared to 
the traditional interconnect, and it could reach up to three times higher Fmax with more ALMs resource 
usage.  
Table 3 16-Master/16-Slave system: Performance Results [27] 
INTERCONNECT IMPLEMENTATION FMAX (MHZ) 
RESOURCE 
USAGE 
(ALMS) 
Traditional interconnect 131 12766 
Platform Designer NoC, fully combinational 161 13999 
Platform Designer NoC, 1 cycle network latency 225 11260 
Platform Designer NoC, 2 cycle network latency 243 12761 
Platform Designer NoC, 3 cycle network latency 254 14206 
Platform Designer NoC, 4 cycle network latency 314 26782 
 
Table 4 4-Master/16-Slave burst- and width-adaptation system: Performance Results [27] 
INTERCONNECT IMPLEMENTATION FMAX (MHZ) 
RESOURCE 
USAGE 
(ALMS) 
Traditional interconnect 123 11658 
Platform Designer NoC, fully combinational 125 9655 
Platform Designer NoC, 1 cycle network latency 150 9423 
Platform Designer NoC, 2 cycle network latency 164 9847 
Platform Designer NoC, 3 cycle network latency 154 13156 
Platform Designer NoC, 4 cycle network latency 171 16925 
 
P a g e  | 34 
 
Besides improved performance, the NoC architectures offer a number of other benefits to FPGA-
based implementations. Designers are able to focus on the transport layer design rather than changing 
the transaction layer to optimise the design. Application customisation need only work at the network 
layer to support packet transport. NoC supports multiple topologies in the same network and can 
make the development of new features much more manageable by applying the new features to the 
required layer rather than to the whole design. 
2.3.1 Network on Chip Router 
The primary component that carries a packet through the on-chip network is the NoC router. Thus, the 
characteristics of the NoC router mainly define the data throughput behaviour, transistor resource 
usage and energy consumption of interconnection. 
 
Figure 14 Router Block Diagram [28] 
Figure 14 shows the microarchitecture of a typical NoC router. At the input ports of this architecture, 
FIFO (first in first out) buffers hold the packets and send requests to the VC allocator and switch 
allocator. Once they receive permission to move forward, the data packets travel to their various 
destination output ports via the crossbar. Multiple virtual channels (VCs) are created in the network to 
avoid deadlocks and to allow a flexible allocation [83]. It can be seen from Figure 14 that the router 
microarchitecture comprises five sub-modules: input module, virtual-channel allocator, switch 
allocator, crossbar and output module. 
P a g e  | 35 
 
The NoC router employs four steps to route each packet from the input port to the selected output port 
in the correct order. 
1) Router computation: When each packet arrives at its input port, the router will apply a routing 
algorithm to determine the suitable output port and output VCs candidates. Only the fixed-
size parts of the packet, also known as flit, will be applied to the algorithm and the remaining 
part of the packet will inherit its destination from the head flit. 
2) VC allocation: In this stage, an exclusive output VC will be assigned to the head flit of the 
packet and make sure it will arrive at its selected output port. The remaining part of the 
packets inherits from the head flit. 
3) Switch allocation: A crossbar timetable is created at this stage that makes sure there is no 
congestion between different packets with the same output port allocation. The time slot is 
specified at the router’s input port. 
4) Switch traversal: In the last step, the packet travels through the crossbar module in the next 
cycle. It arrives at the selected output port and continues towards the next router or the 
destination element. An efficient crossbar design is important to maintain high performance 
through the router. 
Much prior research has focused on enhancing the NoC router microarchitecture to reduce latency and 
power consumption [28, 84-89]. For example, Becker [28] indicated that router input buffer 
management is essential, and sharing buffer space among virtual channels by adaptive backpressure 
(ABP) can improve the network performance and reduce the cost. The open-source state-of-the-art 
VC router architecture from Becker’s dissertation, a full-featured router, is claimed to exhibit low 
latency behaviour on an FPGA. CONNECT, a single pipeline stage, tightly-coupled architecture built 
by Papamichael [90] is an FPGA-tuned router that offers an efficient approach for building NoCs 
within FPGA systems. Compared to Becker’s state-of-the-art VC router, it uses only a quarter of the 
Xilinx FPGA resources, and it is highly configurable. On the other hand, the state-of-the-art VC 
Router achieves a 60% higher frequency when implemented on an FPGA system [29]. Monemi [85] 
has introduced a two-clock-cycle pipelined wormhole virtual channel NoC router microarchitecture 
P a g e  | 36 
 
that combines router computation, the VC allocation and the switch allocation in the first pipeline 
stage and switch traversal in the second stage. The data throughput on Monemi’s router architecture 
was shown to be 20%~35% faster than the CONNECT architecture. 
The state-of-the-art VC Router is more appropriate on FPGA as it can efficiently traverse many 
network nodes with its low latency behaviour. Furthermore, the state-of-the-art VC router has a 
suitable circuit size for the further comparison between the synchronous and the asynchronous 
reconfigurable system. For this reason, it will be used later as the case study in this dissertation.  
2.3.2 Hard vs. Soft Router on FPGA 
When it comes to applying NoC on FPGA applications, there is always an argument about whether to 
use hard or soft NoC router design. In [29], the size of the ASIC (“hard”) NoC router is around 2.2% 
of the overall Stratix III FPGA chip area, which is significantly smaller than 43% of the FPGA area 
for an equivalent reconfigurable (“soft”) implementation. Abdelfattah has also identified the ASIC 
NoC router on an FPGA can achieve the full speed at 943 MHz with hard-wire connections, which is 
another massive benefit of the hard NoC compared to the reconfigurable case. 
Table 5 Comparison of Different FPGA-Based NoCs by Stratix III [29] 
(64-node, 32-bit flit data width, 2 VCs) 
Routers Links Frequency (MHz) Area (mm
2) Bandwidth (GB/s) 
Area per BW 
(mm2/TBps) 
Soft 64-NoC Soft 167 269 54.4 4960 
Hard 64-NoC Soft 730 14.1 238 59.4 
Hard 64-NoC Hard 943 11.3 307 36.8 
 
Although an ASIC NoC router on FPGA has a much better data throughput performance, we cannot 
overlook the reconfigurability provided by the soft NoC router. As the reconfigurable router is 
implemented by the IP blocks on FPGA without any architectural changes, the reconfigurable 
interconnect solution is an exclusive advantage that ASIC NoC routers with hardwire connection 
cannot offer. The soft NoC is scalable and can be optimised after the silicon design has been finalised. 
P a g e  | 37 
 
Many different topologies have been suggested for the soft interconnection between routers. Besides, 
the hard NoC router can be area inefficient so that the cost of unused hard blocks must always be 
included in the area of any network. Clearly, this is not the case with the soft NoC router, as unused 
resources are simply not implemented in the network design [91]. 
2.3.3 Asynchronous Network on Chip 
To improve the data throughput of an NoC router, many researchers have contributed to the 
development of effective router architectures, particularly on traditional topologies improvement, soft 
or hard routers applications.  
Hoskote’s study [92] indicated that around 28% of the power on the Intel Teraflop Research Chip is 
consumed by the on-chip network (router + links) which means that the NoC router architecture 
presents an opportunity to achieve lower power consumption as shown in Figure 15. Further, clock 
power takes up 33% of the total router power consumption, which is the most significant part of the 
chart. It is clear that large clock buffer trees will cause high power consumption, which leads us to the 
question, “if the clock is eliminated, does a purely clock-less NoC router have a better performance on 
a clock-less system?”  
 
Figure 15 Power Breakdown for Intel Teraflop Research Chip [92] 
10-port RF
4%
IMEM + DMEM
21%
Clock Distribution
11%
Dual FPMACs
36%
Clocking
9%
Links
5% Other14%Router + Links
28%
P a g e  | 38 
 
There are few research publications on asynchronous NoC to date. This section will discuss the 
findings of previous asynchronous NoC studies. 
Chain, which uses delay-insensitive data encoding combined with a return-to-zero signalling protocol, 
was introduced by Bainbridge [93]. The design is based on Muller C-elements delay-insensitive logic. 
With 0.35-micron VLSI technology, Chain delivers a single link throughput of around 700Mbps. 
Moreover, its performance becomes greater than 1 Gbps with 0.18-micron CMOS technology. 
Bainbridge mentioned that Chain offers better throughput compared to the similar synchronous 
interconnect operating at frequencies over 100MHz, and the latency improves with a broader group of 
links. Also, Chain can connect with synchronous IP blocks in large-scale SoCs without facing the 
problems of synchronous interconnect. Chain successfully avoids the clock skew problems and 
eliminates the needs for post-layout timing adjustments. 
The QDI timing model of [94] was used by Fulcrum Microsystems [95] to build Nexus, a GALS 
interconnect solution. It features a 16-port, 36-bit asynchronous crossbar in 130nm process 
technology. The system can achieve 1.35GHz at 1.2V, with an overall size of around 5mm2. The 
latency is around 2ns, and the power consumption is 10.4pJ per bit. Andrew suggests that Nexus is a 
solution to the increasing global clock distribution and timing variability and the system supports an 
easier migration from synchronous to asynchronous multi-core interconnections. 
 
Figure 16 NoC 2D Mesh Architecture [89] 
P a g e  | 39 
 
In [89], Muller C-element gates and four-phase handshaking signals are applied in the asynchronous 
NoC router design. The router is designed with five I/O ports (North, East, Local, South and West) to 
apply in the 2-D mesh-connected topology network (Figure 16). Each router is connected with four 
others and a processing logic element. The asynchronous design is based on the Speed Independent 
State Transition Graph model. This asynchronous router was synthesised using Synopsys Design 
Compiler with a 0.35um standard cell library to compare it with a synchronous router of the same 
functionality. Although the asynchronous router takes up a larger silicon area (4881mm × 4881mm) 
than the synchronous case (4200mm × 4200mm) in a 5×5 2D mesh NoC, it offers a significantly 
better data rate (80 Mflits/s) than its corresponding synchronous design (66 Mflits/s).  
Yaghini [87] compared an asynchronous router implemented by CSP-Verilog language with a similar 
synchronous version implemented by VHDL language. The asynchronous NoC router architecture 
employs a four-phased handshake protocol, and it is constructed using three sub-modules: the input 
buffer, the crossbar switch and the routing unit. The synchronous NoC router architecture comprises 
three similar sub-modules and uses a state machine to control the flit transfer timing. The main 
difference on both designs is the additional handshaking links which generated a more complex 
asynchronous router circuit. Apart from that, both synchronous and asynchronous router designs are 
very similar. Yaghini indicated that the synchronous router design consumed slightly less dynamic 
power than the asynchronous case in the data flow component circuits (buffer and switch), as more 
transitions happened when a flit passed the component in asynchronous scheme. However, the control 
flow (routing unit) in the asynchronous router exhibits lower power consumption compared to 
synchronous router design. 
Each study above shows that the asynchronous NoC router design can result in lower power 
consumption and low latency in the on-chip system by handshaking behaviour. However, most studies 
[87, 89, 93, 95-97] rely on the synchronous design environments/devices which are not purely 
asynchronous. Furthermore, some asynchronous designs simply share the same structures with 
synchronous designs, which makes it a simple conversion from synchronous to asynchronous. These 
P a g e  | 40 
 
are the limitations with asynchronous designs as there are not many well-built synthesis tools and 
suitable clock-less reconfigurable systems. 
2.4 Summary 
This chapter has discussed the theory and tool flow for NCL. NCL exhibits quasi delay-insensitive 
behaviour which does not require assumptions on the delay of wire and gates. This can simplify the 
design effort for asynchronous systems, which are otherwise typically difficult and time consuming to 
design. Most clocked-systems can be re-designed into an asynchronous NCL version by applying the 
27 fundamental NCL threshold gates and NCL registers. UNCLE is one of the few existing synthesis 
tools that can achieve this for NCL logic systems. The structural netlist of NCL gates, which is 
generated by UNCLE can pass into the VTR CAD flow and produce further “place and route” on the 
asynchronous reconfigurable systems. The NCL-based reconfigurable system would bring a better 
delay and power performance when compared to the similar synchronous FPGA designs. As there has 
been a lack of NCL reconfigurable systems on the market, the 45nm process dual-rail NCL LUT was 
introduced, inspired by previous studies. Different from the single-rail NCL LUT, the dual-rail NCL 
LUT can maintain isochronic forks much easier on the technology mapping process in asynchronous 
reconfigurable environments. The VTR CAD Flow is one of the options for modelling both 
synchronous and asynchronous reconfigurable system designs, as the variants of VTR flow are 
possible with other asynchronous tools. NoC was invented to be highly scalable, reliable and modular, 
and it has better interconnect performance, with low transport latency and lower energy consumption 
on FPGAs than the traditional interconnect methods. However, the synchronous NoC is facing the 
problems of the increasing global clock distribution and timing variability. The asynchronous NoC 
may bring better performances on the reconfigurable system. A reconfigurable NoC router would, 
therefore, be a good candidate to analyse and evaluate asynchronous reconfigurable systems.   
P a g e  | 41 
 
  
Dual-Rail NCL Reconfigurable Logic Block 
 
 
 
As discussed in Section 2.1, asynchronous techniques such as NCL can tolerate wide variability 
arising from both manufacturing (dopant levels, line roughness, mask alignment errors, etc.) and 
environment (voltage, temperature, etc.). The intrinsically uncorrelated switching noise in 
asynchronous systems also tends to generate less electromagnetic interference. Lastly, as NCL is 
delay-insensitive and essentially correct-by-construction, it can be more straightforward to achieve a 
system using pre-built IP blocks, which can be connected in a “plug-and-play” manner, eliminating 
the problems related to synchronous timing closure. While some of these advantages are shared by 
asynchronous techniques in general, NCL systems exhibit specific behaviour that is self-determined, 
locally autonomous, self-synchronising, delay insensitive and inherently fault detecting. 
Typical synchronous reconfigurable systems, such as FPGAs, tend not to be well-suited to 
asynchronous techniques, due to their fine-grained synchronous logic topologies. This chapter 
proposes and analyses a novel reconfigurable logic block that is intended to form the basis of an 
asynchronous reconfigurable system that directly supports NCL. The logic block can be configured to 
represent any of the 27 fundamental functions that have been defined for 2 to 4 inputs in NCL, and it 
is implemented in an advanced 28nm FDSOI CMOS process. 
  
P a g e  | 42 
 
3.1 NCL Look-Up-Table 
In a similar way to [78], the reconfigurable LUT network proposed in this thesis is based on the 
decomposition of the NCL functions. Nonetheless, instead of simply decomposing the function into 
its minterms, we start by recognising the fact that there are only 15 combinations of terms used in the 
27 NCL functions defined for four input values of both Z1 and Z0 (i.e., A, B, C, D, A.B, A.C, A.D, 
B.C, B.D, C.D, A.B.C, A.B.D, A.C.D, B.C.D, A.B.C.D). In this way, it is unnecessary to develop the 
inverse of the input literals ( A  ,  B , C  ,  D ), the LUT is simplified, and the decoder tree can be 
made more regular. Figure 17 shows the basic concept. The configuration memory lines (ConfABCD) 
enable a particular term in the function. For example, the partial circuit of Figure 17 shows how the 
two terms (A.B, A.C.D) can be configured together to form the Z function of the TH54w32 gate in 
Table 1. As can be seen, the configuration memory simply enables a path to ground for that term in 
the given function to form: A.B+A.C.D  , which is then inverted by the output latch.  
 
A B C D 
Z 
ConfAB 
ConfACD 
PMOS Pull-up 
network 
Output latch 
to more cells 
 
Figure 17 Partial NCL Gate Circuit 
In [98], we begin with a simplified example of the single-rail approach to demonstrate this idea. These 
layouts were performed using the Cadence Virtuoso® design system based on a generic 45nm 6-metal, 
single-polysilicon bulk CMOS process. 
P a g e  | 43 
 
Logic array cells of this type have always tended to exhibit very regular layouts and for that reason 
were amongst the first structures to be algorithmically generated. However, design for manufacture 
(DFM) considerations at deep-submicron feature sizes mandate a number of additional layout rules 
that force even greater regularity. Put simply, even small deviations from regular repetitions of 
particular features will result in manufacturing difficulties. For example, a “classical” method of 
creating regular logic array structures (PLAs) of this type is to run (vertical) lines of metal and 
provide “stubs” of polysilicon where it is necessary to create a pass-gate. However, this layout 
technique is not suitable at or below 45nm and would require additional “dummy” stubs of polysilicon 
and/or metal 1 (M1) to be inserted to fill the empty space on that layer in order to correct the resulting 
spatial modulation and to simplify the following oxide growth and planarization stages in manufacture. 
 
A1 C1 B1 D1 
Z1 
Configuration pin 
PMOS Pull-up 
network (x2) 
Output latches 
to more cells 
Z0 
NACK 
VDD A    VDD B    VDD C    VDD D 
 
 
Z1 Z0 
 
Figure 18 Simplified LUT Organisation 
As shown in Figure 18, this work chose to set up each term as an identical four transistor path and 
directly tie the unused transistors to VDD (i.e., turning them on permanently). In practice, this is 
achieved by inserting additional lines pulled up to VDD (using an appropriate “tie-high” gate from the 
library) between each input line (A, B, C, D at the top of the simplified LUT) and choosing the 
appropriate connection to the gate stub. This technique has the added advantage that all paths to the 
P a g e  | 44 
 
ground become identical, which makes it easy to adjust the width/length ratio of the LUT transistors 
to optimise its propagation delay. At the same time, obeying the isochronic timing requirement of 
NCL makes the careful control of signal skew (i.e. the variation in propagation delay) between the 
various outputs easier. The technique has resulted in signal skews of less than 50ps for this specific 
45nm process. This certainly represents a lower bound on signal skew, which is likely to be higher 
and more variable in a fully commercial process. Adding configuration registers to the LUT brings its 
overall size to approximately 21µm x 28µm ( ≅ 590 m2). 
The requirements for maintaining isochronic forks in NCL place significant constraints on the place 
and route process in both FPGA and ASIC environments. One key reason why asynchronous logic 
has rarely been implemented successfully on standard FPGA parts is the widely variable skew caused 
by divergent routing paths, and the consequent need to identify and constrain these individual paths. 
Applying these timing constraints within a conventional FPGA environment represents a significant 
amount of work and is not always entirely successful [99]. 
The simplified organisation shown in Figure 18 can be extended to handle the full dual-rail signals by 
including the remaining input signals (A0, B0, C0, D0) and connecting these to the corresponding 
output line (Z0, in this case). However, it is not necessary for the dual-rail implementation to be 
constrained to be a simple doubling of the single-rail case. Instead, we may take advantage of the 
extra logic available to extend the functionality of the block.  
For example, Figure 1 (see page 5 in Chapter 1) illustrates two critical functions required by NCL 
systems: asynchronous “registers” and completion detection logic (NACK). These form the basis of a 
handshaking mechanism that controls dataflow through the NCL paths. NCL registers are formed 
from an array of TH22 gates, eight TH22 gates in this case, where one input is derived from the 
completion signal from following NCL logic. Implementing these with an array of single-rail 
reconfigurable gates would result in an extremely inefficient implementation. One option would be to 
include these specific register structures as part of the LUT cell, to be enabled or disabled as required. 
We have taken an alternative approach that enhances the connectivity of the LUT structure so 
P a g e  | 45 
 
individual terms (A, B, C, D) can be combined with the completion detection (NACK) signal to create 
these register structures within a single LUT. The NACK ( acknowledge  ) signal can be implemented 
as a 4-of-8 threshold gate with inversion. The objective is to detect when all active input rails are 
delivering data and send this completion signal to the previous stage. Similarly, the asynchronous 
registers for this stage are controlled by the completion detection signal from the following stage. 
As NCL signals are defined in terms of three values (two data and null), multi-bit rails are required to 
transfer each signal. In this case, this work assumes that a single NCL signal is transferred on a 2-bit 
rail, comprising a ‘0’ and a ‘1’ line (e.g., Z1, Z0 and Cout1, Cout0 of the example in Figure 19). A 
null is defined when the signals are both zero.  
 
Figure 19 Example NCL Full Adder Circuit 
As shown in Figure 19, two TH23 gates and two TH34w2 gates can assert outputs Z and Cout with 2-
bit rails. The Boolean logic equations are:  
Equation 1 NCL full adder equivalent Boolean equations 
 
Cout1 = X1.Y1 + X1.Cin1 + Y1.Cin1  (TH23) 
Cout0 = X0.Y0 + X0.Cin0 + Y0.Cin0  ( TH23  ) 
Z1 = Cout0.X1 + Cout0.Y1 + Cout0.Cin1 + X1.Y1.Cin1 (TH34w2) 
Z0 = Cout1.X0 + Cout1.Y0 + Cout1.Cin0 + X0.Y0.Cin0 ( TH34w2  ) 
 
P a g e  | 46 
 
As both rails can never be asserted simultaneously (Cout0 = Cout1  , Z0 =  Z1 ), Table 1 (see page 
18 in Chapter 2) can be redesigned into Table 6 as dual-rail NCL functions macros.  
Table 6 Dual Rail 4-Variable NCL Macros 
THmn  
Gate 
Boolean Function 
Z1 Z0 
TH12 A1 + B1 A0 B0 
TH22 A1.B1 A0+B0 
TH13 A1 + B1 + C1 A0.B0.C0 
TH23 A1.B1+A1.C1+B1.C1 A0.B0+A0.C0+B0.C0 
TH33 A1.B1.C1 A0+B0+C0 
TH23w2 A1+B1.C1 A0.B0+ A0.C0 
TH33w2 A1.B1+A1.C1 A0 + B0.C0 
TH14 A1+B1+C1+D1 A0.B0.C0.D0 
TH24 A1.B1+A1.C1+A1.D1+B1.C1+B1.D1+C1.D1 A0.B0.C0+A0.B0.D0+A0.C0.D0+B0.C0.D0 
TH34 A1.B1.C1+A1.B1.D1+A1.C1.D1+B1.C1.D1 A0.B0+A0.C0+A0.D0+B0.C0+B0.D0+C0.D0 
TH44 A1.B1.C1.D1 A0+B0+C0+D0 
TH24w2 A1+B1.C1+B1.D1+C1.D1 A0.B0.C0+A0.B0.D0+A0.C0.D0 
TH34w2 A1.B1+A1.C1+A1.D1+B1.C1.D1 A0.B0+A0.C0+A0.D0+B0.C0.D0 
TH44w2 A1.B1.C1+A1.B1.D1+A1.C1.D1 A0+B0.C0+B0.D0+C0.D0 
TH34w3 A1+B1.C1.D1 A0.B0+A0.C0+A0.D0 
TH44w3 A1.B1+A1.C1+A1.D1 A0+B0.C0.D0 
TH24w22 A1+B1+C1.D1 A0.B0.C0+A0.B0.D0 
TH34w22 A1.B1+A1.C1+A1.D1+B1.C1+B1.D1 A0.B0+A0.C0.D0+B0.C0.D0 
TH44w22 A1.B1+A1.C1.D1+B1.C1.D1 A0.B0+A0.C0+A0.D0+B0.C0+B0.D0 
TH54w22 A1.B1.C1+A1.B1.D1 A0+B0+C0.D0 
TH34w32 A1.+B1.C1+B1.D1 A0.B0 + A0.C0.D0 
TH54w32 A1.B1 + A1.C1.D1 A0+ B0.C0 + B0.D0 
TH44w322 A1.B1 + A1.C1 +A1.D1 + B1.C1 A0.B0 + A0.C0 + B0.C0.D0 
TH54w322 A1.B1 + A1.C1 + B1.C1.D1 A0.B0 + A0.C0 +A0D0 + B0.C0 
THxor0 A1.B1 + C1.D1 A0.C0 + A0.D0 +B0C0 + B0.D0 
Thand0 A1.B1 + A1.D1+ B1.C1  A0.B0 + A0.C0 + B0.D0 
TH24comp A1.C1 +A1.D1+ B1.C1 + B1.D1 A0.B0 + C0.D0 
P a g e  | 47 
 
 
Figure 20 Dual-rail LUT Layout [100] 
Figure 20 illustrates a trial layout of the complete LUT, which was developed by a generic 45nm 
process3. This layout presents a number of the critical characteristics of the LUT cell, including its 
overall regularity and the fact that the configuration switches occupy more than half of the total LUT 
area. The overall dimensions of the block are approximately 28µm × 10.5µm. The average static 
power for the LUT, based on the 45nm planar CMOS process, was in the region of 900nW. 
                                                     
3 This work has been reported in [100].  
P a g e  | 48 
 
 
Figure 21 NCL Look-Up-Table Schematic 
P a g e  | 49 
 
These experiments were then repeated using a more recent, 28nm FDSOI CMOS process. The results 
for this process are reported in the remainder of the thesis. At the same time, the LUT block (Figure 
21) was further improved with a better completion detection circuit. The I/O interface encompasses 
four sets of dual-rail inputs and four sets of dual-rail outputs, with a single acknowledge signal input 
and one feedback signal output. 
 
Figure 22 PMOS Pull-up Network “Go-to-Null” 
The “Go-to-Null” sub-module (Figure 22) is built using 32 regular P-type MOSFET transistors 
(w=500n, l=30n). Each term is comprised of four PMOS transistor switches for maintaining 
isochronic forks which are controlled by the input lines. The output will be asserted as null only if all 
connected PMOS switches are ‘on’ and pull-up by the supply voltage when the corresponding inputs 
P a g e  | 50 
 
are logical low (not data). The intermediate terms that control outputs Z0 and Z1 are shown on the 
Spectre simulation waveform of Figure 23 (see net89, net90, net96, net0392 below). The pull-up 
network characteristics can be seen in the figure as the voltage on each can be driven high only if all 
four inputs are low. 
 
Figure 23 Spectre simulation of pull-up network (TH34w2) 
In Table 1 (see page 18 in Chapter 2), there are only 15 element terms (A, B, C, D, A.B, A.C, A.D, 
B.C, B.D, C.D, A.B.C, A.B.D, A.C.D, B.C.D, A.B.C.D) being used to form the 27 combinations of 
different functional NCL threshold gates, which can be expended to 30 element terms in the dual-rail 
circuit (Table 6). In Figure 24, the NMOS pull-down network shows that each term is comprised of 
four NMOS transistor switches. The specific switch in use is controlled by one of the inputs and the 
remaining unused switches are tied to VDD (i.e., turned on permanently). This array of 120 NMOS 
transistors create the “Go-to-Data” module. There are 120 regular N-type MOSFET (w=300n, l=30n), 
P a g e  | 51 
 
also known as configuration switches, controlled by a dynamic latch array with a second static 
“shadow” latch to maintain the configuration state which will be introduced in the next section. The 
selected terms are then forwarded to the selected output port by the connection of corresponding 
configuration switches. The signals of terms A1, B1, C1, D1, …, A1.B1.C1.D1 (data1) can arrive at 
output ports Z1, Z3, Z5, and Z7. The signals of terms A0, B0, C0, D0, …, A0.B0.C0.D0 (data0) arrive at 
the output ports Z0, Z2, Z4, and Z6. Terms A0B0 and A1B1, which are configured to be active in the 
TH34w2 gate, can be seen on the Spectre simulation waveform of Figure 25. The pull-down network 
characteristics are presented in this figure as well. 
 
Figure 24 NMOS Pull-down Network “Go-to-Data” 
P a g e  | 52 
 
 
Figure 25 Spectre simulation of pull-down network (TH34w2) 
In the output interface, as well as the eight data output ports that represent the NCL combinational 
gate outputs, the additional feedback output delivers the completion detection signal. All outputs are 
state-holding as a result of a semi-static implementation with “diode-connected weak inverter” as 
shown in Figure 26. 
 
A0B0 
A1B1 
P a g e  | 53 
 
 
Figure 26 NCL LUT Semi-static Output 
The completion detection module is designed to detect the condition of all output ports for the 
feedback function of the embedded register. In Figure 27, the completion detection module can be 
divided into two parts, each with four sub-modules to monitor each pair of dual-rail outputs. In Figure 
27a a group of 12 regular PMOS transistors (w=500n, l=30n) are organised as a parallel circuit to 
generate “Request-for-Null” (logical low) signal which is triggered by any output is asserted as data. 
In contrast, Figure 27b shows a group of 12 regular NMOS transistors (w=600n, l=30n) that generate 
the “Request-for-Data” (logical high) signal when the active outputs are all asserted as null. Unused 
detection sub-modules are permanently ‘on’, shorted by the rf switches under the control of the 
configuration latch array. 
P a g e  | 54 
 
 
a) PMOS Request-for-NULL 
 
b) NMOS Request-for-DATA 
Figure 27 Complete Detection Signal Generation  
 
3.2 Programming Bits and Shift Register 
To configure a reconfigurable system, a LUT mask is required to store the programming values and 
configure the LUT output. Smith designed the LUT with 14 latches to be programmed by a Dp[14:1] 
configuration code [78]. The Shift Register [101], as shown in Figure 28, contains 130 master-slave 
flip-flops that connect with their corresponding 130 configuration switches on NCL LUT. The NCL 
LUT can implement the required NCL threshold gate by the configuration memory, which is stored in 
the shift register. A packet of all 130 configuration bits is generated as a serial code in the initial block 
and delivered by a clock system. The shift register operates during the pre-programming 
(configuration) phase to hold the configuration bits. After this time, all selected configuration 
switches are turned on, and the LUT is then configured to be one of the 27 threshold gates in Table 6 
(see page 46).  
P a g e  | 55 
 
 
Figure 28 Shift Register Configuration Memory (portion) and Master-Slave Flip-Flop 
SER_IN [129:0] is defined as a bus wire to deliver 130 configuration bits packet for the configuration 
phase. SER_IN [1:0] control the ack signal configuration switches, SER_IN [9:2] control unused sub-
modules in the completion detection module. Based on the setting of ackin signal switches and 
completion detection switches, the LUT can be configured as a logic element with or without the 
embedded registration. SER_IN [129:10] are the threshold gate programming bits to select logic term 
outputs. These 120 configuration bits can also be divided into 15 groups to match with the 15 logic 
P a g e  | 56 
 
terms respectively, and each group controls one of the dual-rail term elements such as SER_IN[17:10] 
are the signals to configure A1 and A0 term outputs. 
For example, TH34w2 gate can be represented as  
 
Figure 29 TH34w2 Programming Bits.  
For the SER_IN code, Logic 1 turns the related switches ‘on’ and logic 0 ‘off’ 
 
3.3 NCL Reconfigurable Logic Block Layout Area 
The LUT block has 316 transistors, and the shift register has 1040 transistors. In [98], we have 
developed a complete NRLB layout, including both LUT block and shift register, using Cadence® 
Virtuoso® with GPDK045 process technology. The total area of the NRLB layout under 45nm CMOS 
process technology is 26.1955µm × 43.356µm ≈ 1135.71 µm2. From [102], the area scaling factor 
from 45nm to 28nm is 0.46, so the NRLB area is estimated to be 522.43µm2 by stretching to 28nm 
CMOS process. 
P a g e  | 57 
 
 
Figure 30 Lookup Table and its Shift Register Layout 
P a g e  | 58 
 
 
Figure 31 Minimum Width Transistor Area [103] 
From Figure 31, the minimum width transistor area equation can be performed as  
Equation 2 The minimum width transistor area equation 
 
In the 45nm CMOS process, the minimum transistor width is 120nm. Therefore, the minimum width 
transistor area of 45nm CMOS process technology is 29.952 × 10-3 um2.  
In a conclusion, the area of NRLB is  
 
In the open-source academic CAD tool VPR (Versatile Place and Route) [36], all logic block areas are 
described by the minimum width transistor unit (MWTU). This area unit will be employed as a 
default unit on the architectural area evaluation in the rest of the thesis. 
3.4 NCL Reconfigurable Logic Block Delay and Power Analysis 
The schematic design of the NRLB has been used to evaluate the representative propagation delay and 
static power consumption figures for the block. In this section, the NRLB is configured to be a 
16λ × 13λ = 208 λ2 
1135.71 µm2 ÷ 29.952 × 10-3 µm2 ≈ 37917 MWTUs 
 
P a g e  | 59 
 
TH34w2 gate, which is one example of the configurated cell from Section 3.2. It is simulated by 
Cadence® Spectre® AMS simulator with models of a 28 nm FDSOI CMOS process. Supply voltage  is 
set to 1V, the typical value for this process. The configuration phase, which takes around 260ns, is not 
shown in the waveform (Figure 32). The SER_IN configuration code is derived from Figure 29. At 
the end of this phase, each configuration value is held by its corresponding master-slave flip-flop. 
CLOCK3 triggers the connection latches between the shift register and the LUT. All programming 
bits are delivered simultaneously to their respective configuration switches on the LUT to select the 
correct terms. At this point, this NRLB has been set up to implement the dual-rail TH34w2 gate.  
 
At the 271.5ns point on the waveform, both A and B are asserted as data1 (A1 is high, A0 is low, B1 is 
high, B0 is low), and both C and D are asserted as data0. As the term A.B = 1, the output Z is asserted 
as data1 (Z1 = 1, Z0 = 0). At the same time, the feedback pulls logical low to become “Request-for-
Null”. Slightly later, at 272.5ns, all signals from the input ports are pulled down to 0 (null), and 
feedback becomes logical high to represent “Request-for-Data” until the next “Go-to-Data” event 
happens. 
In Table 7, all propagation delays for NRLB are presented in groups of active input transitions. The 
symbol “” indicates the propagation delay from input ports to the output ports during the output 
voltage pull-up 0 to 1 transition, and “” indicates the corresponding pull-down propagation delay 1 
to 0 transition as shown in Figure 32. The first group of single input transitions shows that each term 
has an almost identical “Go-to-Data” transition time and “Go-to-Null” transition time. Thus, the total 
cycle times (“Average ” + “Average ”) are close to each other as well. The small differences are 
most likely caused by the specific position of the active transistor in the array causing small variations 
in the way that the intermediate nodes charge and discharge. This hypothesis is reinforced by the 
observation that transitions on the D input have the smallest impact on the propagation delay. The D 
input drives the transistor that is the closest one to the output port. In contrast, the transistor connected 
Z1 = A1.B1+A1.C1+A1.D1+B1.C1.D1 
Z0 = A0.B0+A0.C0+A0.D0+B0.C0.D0 
 
P a g e  | 60 
 
to input A is the furthermost one, and the transition times of A vary more widely in response to the 
different combinations of terms.  
 
Figure 32 Spectre Simulation of NRLB (TH34w2) 
The second and the third groups have two and three non-simultaneous active input transitions with the 
same average total cycle time of 206ps, average  transition time of 88ps and average  transition 
time of 118ps. The last group of four active input transitions has the mid-range transition times and 
cycle time. Overall, the NRLB has an average “Go-to-Data” time of 90ps, an average “Go-to-Null” 
time of 119ps and an average cycle time of 209ps. 
P a g e  | 61 
 
Table 7 Propagation Delays for Reconfigurable Block vs. Input Transitions (ps) 
  A B C D 
         Average  
Average
  
Total 
Cycle 
Ave. 
Delay 
A 95 127 \ \ \ \ \ \ 95 127 222 111 
B \ \ 95 125 \ \ \ \ 95 125 220 110 
C \ \ \ \ 95 124 \ \ 95 124 219 109 
D \ \ \ \ \ \ 93 114 93 114 207 104 
             
AB 92 127 95 128 \ \ \ \ 93 127 220 110 
BC \ \ 89 125 94 125 \ \ 92 125 217 108 
CD \ \ \ \ 87 120 93 113 90 116 206 103 
AC 81 114 \ \ 94 126 \ \ 87 120 207 104 
BD \ \ 79 112 \ \ 92 113 85 112 197 99 
AD 72 102 \ \ \ \ 92 113 82 108 190 95 
             
ABC 91 126 90 125 96 127 \ \ 92 126 218 109 
BCD \ \ 87 122 88 121 94 114 90 119 209 105 
ACD 79 112 \ \ 87 120 94 114 86 115 201 101 
ABD 76 112 80 110 \ \ 93 114 83 112 195 98 
             
ABCD 88 124 85 122 90 123 95 116 90 121 211 105 
 indicates 0 to 1 transition 
 indicates 1 to 0 transition 
 
Table 8 and Table 9 compares the propagation delay results from the NRLB and the other two 
reconfigurable NCL LEs from Smith’s work [51]. Table 8 shows the comparisons between “0  1” 
and “1  0” propagation delays in three different NCL reconfigurable systems. Table 9 presents the 
average propagation delays as they vary with the number of active inputs. Both NCL LEs were 
simulated using a 1.8V, 180-nm TSMC CMOS process. As a result, it is not possible to directly 
compare with the NRLB without estimating the effect on delay of scaling from 180-nm to 28-nm. In 
P a g e  | 62 
 
this case, the scaling equations from [102] have been employed to estimate the delay values on both 
of Smith’s LEs that would be seen using 28nm process technology. There are two versions of the 
scaling factors related to these Predictive Technology Models [104], high performance (HP) and low 
power (LP). The FDSOI 28nm technology is developed from a low-power process cmos32lp library4, 
so the LP scaling factors would be more appropriate in this case.  
 
Figure 33 Reconfigurable NCL LE with extra embedded registration [51] 
It is also worth noting two issues. Firstly, these estimates are only approximate and will depend 
somewhat on the specific process used in [51], the details of which are unknown. Secondly, these 
figures are for the unloaded case, i.e., for an individual cell in isolation. Both the power and delay 
figures will increase by a small percentage in a real FPGA architecture loaded by the routing network. 
On the other hand, this is the same set up as other reported reconfigurable cells, so will allow 
comparisons with prior work. 
                                                     
4 The 28FDSOI geometry was developed from an IBM 32nm bulk process. The minimum drawn transistor length is 30nm, which is then 
optically scaled to 28nm during fabrication. 
P a g e  | 63 
 
The equations used to scale an example propagation delay value of 282ps from 1.8V in 180nm to 1V 
in 28nm, are shown as follows: 
Equation 3 Propagation delay scale equations 
 
Keeping in mind the limitations to the scaling pointed out above, it can be seen that the propagation 
delay results indicate that the NRLB is approximately 24% faster than Smith’s LE without embedded 
registration and up to 36% faster than the case with embedded registration. When programmed to be a 
4-input NCL threshold gate, the propagation delays on the NRLB exhibits the best performance and is 
around 38% faster than the corresponding LE with embedded registration. Furthermore, the NRLB 
supports dual-rail IO and includes a fixed embedded registration which has the potential to be more 
area-effective than the Smith’s LE. When comparing the block diagram of Smith’s LE with embedded 
registration shown in Figure 33 and the NCL LUT proposed in this thesis (shown in Figure 21), it can 
be seen that the latter merges all of the separate features appearing in Smith’s design (i.e., registration, 
pull-up/pull-down logic and output inversion logic), which leads to a lower overall propagation delay. 
Table 8 Scaled Propagation Delay Comparison 
 NCL Reconfigurable  
Logic Block (ps) 
NCL Logic Element 
without ER (ps) [51] 
NCL Logic Element 
with ER (ps) [51] 
 0  1 1  0 0  1 1  0 0  1 1  0 
A1 88 124 97  173 110  206  
B1 85 122 99  177  113  216 
C1 90 123 104  180  119  224  
D1 95 116 107  183  121  230 
 Ave. Delay = 105 Ave. Delay = 140  Ave. Delay = 168  
DelayFactor32 = -325.9 × (1)3 + 1374 × (1)2 - 1922× (1) + 913.2 = 39.3 
DelayFactor180 = 97.09 × (1.8)2 - 356.7 × (1.8) + 406.5 = 79.0116 
Delay32 = 39.3 / 79.0116 × 282 = 140 ps 
P a g e  | 64 
 
Table 9 Average Propagation Delay Comparison based on the Number of Active Input Ports 
# Active 
inputs 
NCL Reconfigurable 
Logic Block (ps) 
NCL Logic Element 
without ER (ps) [51] 
NCL Logic Element 
with ER (ps) [51] 
1 108 135 159 
2 103 136 162 
3 103 138 165 
4 105 140 168 
 
Table 10 Static Power Consumption on Input Transition on TH34w2 
A B C D P (µW) 
1 0 0 0 21.8 
0 1 0 0 27.4 
0 0 1 0 27.4 
0 0 0 1 27.6 
1 1 0 0 21.4 
1 0 1 0 21.9 
1 0 0 1 21.9 
0 1 1 0 21.8 
0 1 0 1 21.8 
0 0 1 1 21.5 
1 1 1 0 27.0 
1 1 0 1 27.6 
1 0 1 1 27.5 
0 1 1 1 21.8 
1 1 1 1 25.4 
 
Table 10 shows the power consumption of NRLB when implementing TH34w2 over the complete 
Null  Data  Null cycle. Again, each pull-up signal is connected to a 1V supply. While the average 
power consumption on the NRLB is around 24µW, it can be seen to vary depending on the input data. 
P a g e  | 65 
 
Note that this is ~130% higher than the 10.5µW observed in the original experiments on the cell 
performed using the generic 45nm process technology. As mentioned above, the final in-circuit 
dynamic power consumption will significantly increase depending on its post-mapping output load. 
 
3.5 NCL Full Adder with N-output Look-Up-Table 
Previous sections have discussed the default NRLB which comprises an 8-output LUT and a shift 
register. In general, this will exhibit the most flexibility to implement the broadest range of NCL-
based designs. However, in some simple circuits, only one or two groups of dual-rail outputs are 
required, and the rest output ports are therefore wasted. Taking the optimised dual-rail NCL full adder 
design in [24] as an example; if the 8-output NRLB is employed to implement the circuit, 2 gates and 
3532 transistors (Figure 34) are utilised. The full adder circuit implemented on NRLB is 2x larger 
than the one implemented on Lamb’s LE [48] (10 gates, 1420 transistors), and 3x larger than the one 
implemented on Smith’s LE [51] (6 gates, 1014 transistors). Therefore, it can be seen that an 8-output 
NRLB organisation is not the most appropriate in this scenario. 
 
Figure 34 Optimised Dual-rail NCL Full Adder 
The choice of the number of output ports in the NCL LUT architecture represents a critical trade-off 
between flexibility and configuration performance. The ultimate choice will depend to a large degree 
on this balance between the configuration performance, its area and the ability of the design tools to 
efficiently map circuits to a particular organisation such that the cells are optimally utilized. This is 
the same dilemma facing all FPGA architects. 
P a g e  | 66 
 
Table 11 Transistors Numbers in N-output LUT Architecture 
 Number of Transistors 
Output Ports 8 4 2 
LUT 336 232 180 
Shift Register 1430 726 374 
NRLB (LUT + Shift Register) 1766 958 554 
Full Adder 3532 1916 1108 
 
Reducing the number of output ports involves removing the relevant lines along with their 
configuration control switches and master-slave flip-flops, as shown in Figure 35. Table 11 indicates 
the number of transistors required to construct different designs depending on the number of LUT 
outputs N (2 ≤ N ≤ 8) and shows that the 2-output LUT has 69% fewer transistors than the full set 
output (8-output) LUT. With the reduction of output ports, fewer master-slave flip-flops are required 
for the configuration bits, resulting in a progressively faster configuration phase, as shown in Table 
12. This has a significant impact on the configuration time. At 68ns, the 2-output NRLB has the 
fastest configuration time while the 4-output and 8-output organisations are increasingly slower, 
almost doubling each time. 
Table 12 Time of Programming Phase on Different Numbers of Output Ports 
Number of Output Ports 8 4 2 
Time of Programming Phase (ns) 260 132 68 
 
The disadvantage here is that the 2-output NRLB has less flexibility compared with the 8-output 
organisation. On the other hand, it takes nearly 70% fewer transistors and requires only 25% of the 
configuration time. As an example, the full adder circuit mapped onto 2-output NRLB only consumes 
2 gates (1108 transistors) and is smaller than reconfigurable LE in [48] (10 gates, 1420 transistors). In 
addition, it has a similar area compared to Smith’s LE with extra embedded registration (6 gates, 1014 
transistors).  
P a g e  | 67 
 
The full adders have similar worse-case delays of one and two gate delays for outputs Co and S when 
implemented on both 2-output NRLB and Smith’s LE with embedded registration. Although the 
reduction of output ports brings benefit for the transistor resource usage, there is virtually no change 
to the input to output propagation delay on the NRLB. The full adder implemented on 2-output NRLB 
system results in 164 ps for the one-gate delay, and 265 ps for the two-gate delay cases as shown in 
the waveforms of Figure 36.  
In summary, although the NRLB has similar area and worse-case delay performance to Smith’s 
single-rail LE, it has been shown that fewer blocks are required on the NRLB to implement small 
circuits such as the NCL-based dual-rail full adder analysed above. This is likely to offer an advantage 
over Smith’s technique when considering the synthesis process and the interconnection/routing 
network. 
 
P a g e  | 68 
 
a) 8-output  
 
P a g e  | 69 
 
b) 4-output  
 
c) 2-output  
Figure 35 N-output TH34w2 Schematic 
P a g e  | 70 
 
 
 
 
Figure 36 NCL Full Adder with 8x2 LUT Simulation Waveform 
 
P a g e  | 71 
 
3.6 NCL Register and Multiplier 
Section 2.1.2 has described NCL register behaviour and different pipelining methodologies. The 
flexible NRLB can perform handshaking behaviour as well. The 8-output NRLB can be pre-
programmed into a 4-bit dual-rail full-word register (Figure 37) or a combinational logic block with 
embedded registration (bit-wise). The 4-output NRLB can be programmed as a 2-bit dual-rail register, 
and 2-output NRLB is a 1-bit dual-rail register. 
 
Figure 37 4-Bit Dual-rail Full-word Register 
In Section 2.1.3, the data-driven register introduced by UNCLE (Figure 8) is redefined as a dual-rail 
bit-wise NCL register implemented on 2-output NRLB (TH22) with a “Reset” function added, as 
shown in Figure 38. Besides the A0 and A1 data-driven input ports, the other six inputs are unused. 
The outputs Z0 and Z1 are correspondingly connected. The reset logic is programmed to be “Reset-to-
Null”, “Reset-to-Data1” or “Reset-to-Data0”.  
Table 13 indicates the propagation delay performance of the bit-wire NCL register implemented on 
NRLB. The propagation delay of the NCL register is faster than the NCL threshold gate 
implementation. 
P a g e  | 72 
 
 
Figure 38 Bit-wise NCL Register 
 
Table 13 Propagation Delay based on Input to Output Transition of Bit-wise NCL Register 
Output Ports 
Input Ports 
Ao0 (f_q) Ao1 (t_q) feedback (ko) 
A0 (f_d) 90 ps / 200 ps 
A1 (t_d) / 105 ps 200 ps 
ack (ki) 128 ps 128 ps 230 ps 
rst 83 ps 83 ps 117 ps 
 
The N-bit dual-rail full-word multipliers have been implemented by bundling K units of 4-bit dual-rail 
full-word completion registers (N = 4×K). A non-pipelined 4-bit × 4-bit multiplier schematic with 
full-word completion registration (Figure 39) was built to validate the area performance of NRLB and 
draw a comparison between the LEs previously developed by Lamb and by Smith. The non-pipelined 
4×4 multiplier schematic was built using 51 8-output NRLBs (90066 transistors). In [56], a 7-stage 
pipelining 4x4 multiplier using full-word completion achieved a faster throughput compared to an 
equivalent non-pipelined design. The same design can be implemented using 74 8-output NRLBs 
(130684 transistors). Figure 40 shows the 7-stage 4×4 multiplier applying bit-wise completion, which 
has 21% faster data throughput than the same 7-stage 4x4 multiplier using full-word completion in 
P a g e  | 73 
 
[56]. This latter design with bit-wise completion is slightly larger, occupying a total of 83 8-output 
NRLBs (146578 transistors).  
 
Figure 39 Non-pipelined 4x4 Multiplier Schematic (8x8 NRLB) 
P a g e  | 74 
 
 
Figure 40 7-Stage 4×4 Multiplier with Bit-wise Completion 
Table 14 shows comparisons between different NCL multiplier designs mapped onto six NCL-based 
reconfigurable systems. Although the designs mapped on N-output NRLBs are more complex than the 
one on Smith’s LEs in most cases, the number of blocks on NRLB is 50% to 80% less than Smith’s 
LEs, which may bring a faster interconnection in the physical environment. It is interesting to observe 
here that most dual-rail full-word completion pipelined multiplier designs utilise more NRLBs than 
the bit-wise completion designs, except the cases mapped on the 8-output NRLB. The implementation 
P a g e  | 75 
 
of the bit-wise register is more feasible on the 2-output NRLB, so the 2-output NRLB is the optimum 
solution to implement the 4x4 multiplier using bit-wise completion in this scenario. 
Table 14 Unsigned 4×4 Bits NCL Multipliers mapped onto Different Reconfigurable Systems 
 Lamb [48] Smith [51] without ER 
Smith [51] 
with ER 
2-output 
NRLB 
4-output 
NRLB 
8-output 
NRLB 
Multiplier Architecture Number of Blocks 
Dual-Rail Non-Pipelined 176 141 128 65 55 51 
Dual-Rail Full-word Pipelined 400 334 302 152 102 74 
Dual-Rail Bit-Wise Pipelined 365 275 231 83 83 83 
2D Pipelining with Triangle 
Buffers [57] / / / 121 121 121 
Dual-Rail Non-Pipelined 
Synthesised by UNCLE [38] / / / 42 42 42 
 Number of Transistors 
Dual-Rail Non-Pipelined 24992 22419 21632 36010 52690 90066 
Dual-Rail Full-word Pipelined 56800 53106 51038 84208 97716 130684 
Dual-Rail Bit-Wise Pipelined 51830 43725 39039 45982 79514 146578 
2D Pipelining with Triangle 
Buffers [57] / / / 67034 115918 213686 
Dual-Rail Non-Pipelined 
Synthesised by UNCLE [38] / / / 23268 40236 74172 
Kim [57] suggests that the 2D pipelining design style can achieve a 165% faster data throughput 
behaviour than non-pipelined multipliers, but consumes 265% more area (transistors). When 
implementing the 2D pipelined multiplier on an 8-output NRLB, the transistors usage has a similar 
237% ratio compared with non-pipelined multiplier mapped on 8-output NRLB. However, 2-output 
NRLB is more suitable to implement the 2D pipelining design as it reduces the usage ratio to 186%, a 
30% improvement. When applying the synthesis and optimisation steps using UNCLE, the non-
pipelined multiplier mapped on 2-output NRLB is 35% smaller than the same case without using 
UNCLE, and it is the same size as when mapped onto Smith’s LE as shown in Table 14. 
P a g e  | 76 
 
3.7 NCL Threshold Gate Verilog-AMS modules 
With the evaluation results from Section 3.4, the emulated NCL threshold gate model was built and 
simulated using the Verilog-AMS mixed-signal hardware description language. These models include 
static power consumption described as current flow and the average propagation delay. 
The UNCLE toolset provided the Verilog description of all 27 fundamental NCL threshold gates for 
the simulation purposes. An example of a behavioural description, in this case of a TH34w2 gate, is 
shown in Figure 41. In the Verilog-AMS module, the current consumption process responds to the 
analog block at the end of transitions on different combinations of the input ports, as shown in Figure 
42. Each if statement condition term resulting in a unique value of current referred to the particular 
input combination in the truth table (Table 10, see page 62), which are then added to the overall sum 
by the analog block. The assignment line “assign #1.3 y = yi ;” sets the average propagation delay - 
to 130ps in this example. 
Once the VAMS NCL threshold gate model library was set up, the next step has been to evaluate the 
power consumption using the Cadence® Spectre® AMS simulator. The power consumption of the 
NCL-based full adder is estimated to be 94µW (Figure 43). The static thermal power of the 
synchronous Full Adder block mapped onto a Stratix V FPGA device is about 90µW, as reported by 
Intel® Quartus® simulation, close to the power consumption observed for the NRLB. 
  
Figure 41 TH34w2 Netlist in UNCLE Library 
always @ (a or b or c or d) 
begin 
          if  (( (a&b)  |  (a&c)  |  (a&d)  |  (b&c&d) ))  
          begin 
            yi <=  1; 
          end 
          else if (( (a==0)  &  (b==0)  &  (c==0)  &  (d==0) ))  
          begin 
            yi <=  0; 
          end 
end 
P a g e  | 77 
 
 
Figure 42 TH34w2 Analog Description in VAMS Module Block 
 
3.8 Summary 
This chapter has discussed the dual-rail NRLB, which includes two main components: dual-rail NCL 
LUT and shift register. The NRLB is more complex and larger than previously proposed LEs which 
have been universally single-rail blocks. The area of an NRLB is 522.43µm2 (37917 MWTUs) which 
is estimated to be 30% smaller than the configurable logic block (CLB) in an approximately 
equivalent commercial FPGA architecture (53894 MWTUs). 
    always @ (posedge a or posedge b or posedge c or posedge d) 
    begin   
     if (a & !b & !c & !d) begin val = 21.8e-6;  end 
     else if (!a & b & !c & !d) begin val = 27.4e-6;  end 
     else if (!a & !b & c & !d) begin val = 27.4e-6;  end 
     else if (!a & !b & !c & d) begin val = 27.6e-6;  end 
 
     else if (a & b & !c & !d)  begin val = 21.4e-6;  end 
     else if (a & !b & c & !d)  begin val = 21.9e-6;  end 
     else if (a & !b & !c & d)  begin val = 21.9e-6;  end 
     else if (!a & b & c & !d)  begin val = 21.8e-6;  end 
     else if (!a & b & !c & d)  begin val = 21.8e-6;  end 
     else if (!a & !b & c & d)  begin val = 21.5e-6;  end 
 
     else if (a & b & c & !d)   begin val = 27e-6;  end 
     else if (a & b & !c & d)   begin val = 27.6e-6;  end 
     else if (a & !b & c & d)   begin val = 27.5e-6;  end 
     else if (!a & b & c & d)   begin val = 21.8e-6;  end 
 
     else if (a & b & c & d)    begin val = 25.4e-6;  end 
 
          state = 1; 
          #2 state = 0; 
    end 
  
    assign #1.3 y = yi;  
 
    analog begin 
          I(Vs)<+ transition((state ? val:0),0,1p,1p); 
    end 
P a g e  | 78 
 
 
Figure 43 NCL Full Adder Simulation with Vams Modules 
 
P a g e  | 79 
 
The NRLB can be configured into specified NCL threshold gates via a 130-bit serial code, with this 
pre-programming/configuration phase taking 260ns, longer than the 5ns reported for Smith’s less 
complicated, single rail LE. However, it might be noted that configuration time is mainly irrelevant 
unless run-time reconfiguration is an objective. The NRLB has an average “Go-to-Data” transition 
time of 90ps, an average “Go-to-Null” transition time of 119ps, and its average cycle is 209ps.  
With respect to the estimated average propagation delays, the NRLB is around 24% faster than 
Smith’s LE without embedded registration and 36% faster than the one with embedded registration. 
Although the advantage will change with the actual environment, the NRLB has been shown to be 
more efficient than a single-rail LE approach. The average static power consumption of the NRLB is 
24µW. 
NCL designs have a more straightforward implementation on the dual-rail NRLB system when 
comparing to the other single-rail asynchronous architectures, which can partially offset the area and 
delay overheads. For example, an NCL-based full adder circuit can be implemented using only two 
NRLBs. In contrast, the same design requires six of Smith’s LEs to implement. Using fewer 
reconfigurable blocks is likely to result in a smaller and faster circuit depending on the design 
complexity and the critical path depth of the physical implementation. Furthermore, the NRLB system 
may result in a faster and more successful “place and route” in the architectural simulation as fewer 
soft links are needed on the routing network.  
This chapter has also investigated the impact of different output port sizes on the NRLB and shown 
that the 2-output NRLB is more suitable to implement bit-wise completion designs and/or the 
structural netlist styles generated by UNCLE. Ultimately, the NRLB system has a similar power 
performance to the commercial FPGA device when implementing the simple example designs 
evaluated here. 
In the next chapter, a customised CAD Flow will be introduced to implement Verilog RTL on the 
NRLB using all of the area and delay parameters derived in this chapter.  
  
P a g e  | 80 
 
 
  
P a g e  | 81 
 
  
NCL CAD Flow 
 
 
 
This chapter proposes and validates a CAD flow for both commercial FPGA architecture and NRLB 
architecture. The synchronous FPGA system architecture merges some of the characteristics of the 
largest capacity Stratix V FPGA device (5SGXEABN3F45C2) as well as the VTR [36] FPGA so-
called “flagship” architecture. The asynchronous reconfigurable platform employs the area and delay 
parameters which were determined and validated in Chapter 3.  
As was shown in the methodology outline (Figure 2 in Chapter 1), two groups of standard 
benchmarks are implemented and simulated on both reconfigurable architectures via the customised 
CAD flow. Experiments of standard benchmarks produce a rough comparison between the 
conventional FPGA system and NRLB system, which can validate the design flow as well.  
4.1 Experimental CAD Flow 
In order to implement VTR benchmarks and OpenCores benchmarks on both synchronous and 
asynchronous reconfigurable systems, a variant to the standard VTR CAD flow has been proposed 
and implemented. The first step is to build the FPGA architecture and the asynchronous 
reconfigurable architecture with 28nm process technology, which are not available in the existing 
library. The second is the design flow methodologies for the specific asynchronous reconfigurable 
system. The comparisons are then made on the area and timing characteristics of both architectures to 
validate the CAD flow. 
P a g e  | 82 
 
4.2 Synchronous VTR CAD Flow 
In the original VTR architecture description file library, only 22nm, 40nm and 130nm technology 
process devices are provided. Thus, the first step had to be the creation of the FPGA architecture 
description file at the relevant process technology node to compare against the characteristics of 28nm 
FDSOI CMOS process technology. The new 28nm customised architecture description file 
(k6_frac_N10_mem32K_28nm.xml) 5  was generated based on VTR’s flagship architecture file 
(k6_frac_N10_mem32K_40nm.xml) and the commercial Stratix V device (5SGXEABN3F45C2) as 
shown in Figure 44. The simulation of commercial FPGA devices ran on Altera® Quartus® prime 
v18.0 software with the highest optimisation options set. The remaining settings in Quartus® were 
kept at their default. The critical path delay (CPD) was generated from the timing reports of the “Slow 
900mV 85˚C” Model.  
 
Figure 44 Stratix V Devices ALM High Level Block Diagram [105] 
                                                     
5 See Appendix A for details of this XML file 
 
P a g e  | 83 
 
To make sure the new 28nm description file has the best-estimated delay model, VTR heterogeneous 
benchmarks were implemented on both k6_frac_N10_mem32K_40nm architecture with VTR flow 
and Stratix IV device (EP4SGX530NF45C2) with Quartus®. The ratios between the critical path 
lengths were calculated as (CPDvtr - CPDFPGA) / CPDFPGA, and they are shown in the first column of 
Table 15. The subsequent step is to validate the delay parameters on the new 28nm architecture 
description file and make sure these align with the characteristics of the 5SGXEABN3F45C2 
architecture. The ratio of the critical path delays on both technology processes should be the same or 
close to each other. 
Overall, the Quartus® software with Stratix IV results in a lower CPD than VTR with the 
comprehensive architecture. This is to be entirely expected as Quartus® optimises the critical path by 
exploiting different LUT input delays, something that cannot be achieved by VPR as it cannot 
perform LUT rebalancing. Note also that the average delay input is used at each input port to obtain 
more stable results in VPR. Further, Stratix IV has a nearest-neighbour interconnect function that lets 
the registers feeding the carry chain bypass the general routing to improve the critical path, which is 
not described in the VTR flagship architecture. Thus, VTR results in a worse critical path delay 
performance compared to Quartus on the same process technology [36].  
As shown in Table 15, Quartus with Stratix IV device has the CPD difference ratio from the range -
2% to 10211% when compared to VTR with 40nm FPGA architecture. Quartus with Stratix V device 
has a lower CPD from the range -12% to 9671% when compared to VTR with 28nm FPGA 
architecture. The results from VTR with the new customised 28nm architecture description file has an 
average of 55% higher CPD, close to 60% in the case of 40nm architecture, which is acceptable.  
The second column of Table 15 (CPD40nm – CPD28nm) / CPD40nm compares different benchmarks 
implemented on the same platform with different process technologies. The latency of most designs 
would benefit from this technology update, and it can be seen that an average of 27% shorter critical 
path is achieved by the 28nm CMOS technology process on both the Quartus and VTR platforms. 
 
P a g e  | 84 
 
Table 15 VTR Heterogeneous Benchmarks CPD Ratio (%) Chart 
 (CPDvtr - CPDFPGA) / CPDFPGA (CPD28nm – CPD40nm) / CPD40nm 
Circuits 40nm 28nm Intel_FPGA VTR 
bgm 112 30 20 -26 
blob_merge 61 39 -13 -25 
boundtop 46 -6 7 -31 
ch_intrinsics -2 46 -52 -29 
diffeq1 114 116 -24 -23 
diffeq2 86 88 -25 -24 
LU8PEEng 8756 9671 -34 -27 
LU32PEEng 7782 7883 -27 -26 
mcml 196 176 -16 -22 
mkDelayWorker32B 10211 3425 79 -29 
mkPktMerge 291 309 -43 -40 
mkSMAdapter4B 4082 2704 10 -26 
or1200 149 149 -28 -28 
raygentop 58 -12 19 -33 
sha 397 326 -13 -26 
stereovision0 172 159 -27 -31 
stereovision1 372 253 0 -25 
stereovision2 261 11 148 -24 
stereovision3 325 553 -49 -21 
 
The 28nm area parameter is much easier to adjust as the parameters are described by “MWTU” in the 
VPR architecture description files, a unit that is universal to all process technologies. This means that 
the area parameters can remain at their default without any changes from the 45nm technology 
architecture file. 
P a g e  | 85 
 
Table 16 I/O Pads Parameters 
 Tdel(s) 
 40nm 28nm 
Inpad 4.24E-11 3.73E-11 
Outpad 1.39E-10 1.23E-11 
 
Table 17 Interconnection Parameters 
 Tdel (s) R (Ω) Cin (F) Cout (F) 
 40nm 28nm 40nm 28nm 40nm 28nm 40nm 28nm 
Switch 
(MUX) 5.80E-11 5.10E-11 551 100 7.70E-16 7.00E-16 4.00E-15 3.50E-15 
Ipin_cblock 7.25E-11 6.38E-11 2231.5 2000 1.47E-15 1.30E-15 0 0 
Wires / / 101 90 2.25E-14 2.00E-14 / / 
 
Table 18 Configurable Logic Block Parameters 
 Tdel (s) T_setup (s) T_clock_to_Q (s) 
 40nm 28nm 40nm 28nm 40nm 28nm 
5LUT 2.35E-10 2.06E-10 / / / / 
6LUT 2.61E-10 2.30E-10 / / / / 
Output with latch 4.50E-11 3.96E-11 / / / / 
Output without latch 2.50E-11 2.20E-11 / / / / 
Flip-Flop / / 6.60E-11 5.80E-11 1.24E-10 1.09E-10 
 
Table 16 to Table 18 show the effects of all the parameter changes on delays, resistances and 
capacities in the new 28nm architecture file. Most of them are scaled by ratio and validated by the 
comparisons from Table 15. 
P a g e  | 86 
 
4.3 Asynchronous VTR CAD Flow 
As discussed in Section 1.4, the standard VTR Flow has been established to synthesise conventional 
synchronous reconfigurable systems. This is not directly compatible with the dual-rail NRLB 
architecture, and thus the CAD flow requires significant changes for this experimental setup. 
 
Figure 45 Customised VTR CAD Flow (Synchronous and Asynchronous) 
P a g e  | 87 
 
In the first stage of Figure 45, Cadence® GenusTM is applied to flatten the HDL design into a single-
rail Boolean logic gate netlist. Then UNCLE [38], which was introduced in Section 2.3.3, synthesises 
the single-rail Boolean gate netlist from the previous stage into a dual-rail netlist using NCL threshold 
gates. Following this, Yosys [40] performs the NCL gate netlist mapping step with a library file of 
NCL threshold gate format black-boxes, and then outputs the logic-level hierarchical circuit BLIF file. 
In the final stage, VPR performs a “Place, Route and Analysis” on the NRLB architecture. To 
compare with the statistics of the corresponding synchronous system, three NRLB architecture 
description files are generated based on their area and delay characteristics, as listed in Table 19. The 
N-output NRLB architecture description files share the same delay constant and similar block type 
description. Each NRLB can be mapped with two single-rail NCL threshold gates or one dual-rail 
register gate. Based on the number of outputs, the unused inputs can connect directly to the unused 
outputs, which operate as a simple wire connection (i.e., TH11). 
Table 19 NCL Reconfigurable Block Architecture Description File 
 Architecture Description File 6 NCL Reconfigurable Logic 
Block Area (MWTUs) 
2-Output ncl_reconfigurable_logic_op2.xml 11895 
4-Output ncl_reconfigurable_logic_op4.xml 20569 
8-Output ncl_reconfigurable_logic_op8.xml 37917 
 
4.4 Synchronous vs. Asynchronous Benchmark Comparison 
Table 20 illustrates the differences in the area between several synthesised benchmark circuits, 
comparing the conventional synchronous case with the mappings onto the NRLB cell with 2, 4 or 8 
outputs. It, therefore, shows the relative performance of the asynchronous vs. synchronous CAD flows 
using these benchmark architectures drawn from the VTR heterogeneous benchmarks [36] and the 
IWLS 2005 benchmarks (OpenCores) [106]. It is often the case that, as a simple script-based tool, 
                                                     
6 See Appendix A for details of these XML files 
P a g e  | 88 
 
UNCLE completely fails to operate on larger circuits or fails to create the feedback (acknowledge) 
network. The objective here has been to select benchmarks that are sufficiently complex but not too 
large to be synthesised by UNCLE in a reasonable time.  
Table 20 Benchmark Area Comparisons – Synchronous vs. Asynchronous 
 Area (MWTUs*10-7) 
Circuits FPGA NRLB(OP2) NRLB(OP4) NRLB(OP8) 
VTR Heterogeneous Benchmarks [36] 
blob_merge 2.39 33.34 49.59 91.07 
diffeq1 1.10 8.51 13.52 24.76 
diffeq2 1.02 7.90 12.67 23.00 
sha 0.94 8.90 14.00 25.75 
stereovision0 4.48 46.38 70.70 129.40 
stereovision1 11.2 91.47 133.90 244.40 
IWLS 2005 benchmarks (OpenCores) [106] 
systemc_aes 0.88 8.39 13.53 24.87 
mem_ctrl 0.98 8.85 14.33 26.35 
pci_bridge32 2.53 21.19 35.52 65.36 
aes_core 2.93 25.29 39.42 72.25 
wb_conmax 5.93 53.97 88.82 163.70 
des_perf 4.48 100.00 157.10 287.80 
 
All circuit areas in Table 20 are measured in units of MWTU, which is used to describe the area in 
VPR. For example, the total number of logic blocks used for the synchronous version of the diffeq1 
circuit (shaded in Table 20) when mapped onto 28nm FPGA architecture is about 1.1x107 MWTUs, 
while for the asynchronous diffeq1 circuit mapped onto NRLB(OP2) architecture it is ~8.51x107 
MWTUs. The area ratio is, therefore approximately 7.74 for these two circuits. This massive area 
P a g e  | 89 
 
difference can be explained in part by the transformation from single-rail to dual-rail and partly by the 
(lack of) optimisation by the UNCLE tool, as will be explained below. 
A comparison between the resources used on a commercial FPGA versus those used in the NCL-
based asynchronous flow targeting an NRLB(OP2) is shown in Table 21. In this case, the comparison 
is between the use of components such as 6-LUTs on the one hand and the NRLB cell on the other. 
Using the diffeq1 example again, the VTR CAD flow starts with GENUSTM synthesising the 
behavioural HDL circuit into a structural Verilog description. UNCLE then converts the single-rail 
Boolean structural netlist to a dual-rail NCL structural netlist and each Boolean gate can be 
represented by two or more of the NCL threshold gates (described in Table 2 of Section 2.1.3, above). 
The asynchronous design flow then maps diffeq1 to 13836 NCL threshold gates (Table 21). In 
comparison, the synchronous design flow maps the same netlist to 2706 6-LUTs, approximately 5x 
fewer — the next step clusters these 2706 6-LUTs and 193 latches into 204 CLBs on FPGA system. 
The 13836 THxx gates and 193 registers are clustered into 7157 NRLBs on the asynchronous 
reconfigurable system. As a result, it can be seen that the area of diffeq1 mapped on the 
k6_frac_N10_mem32K_28nm FPGA architecture is 204 × 53894 = 1.10x107 MWTUs, and 7157 × 
11895 = 8.51x107 MWTUs when mapped on ncl_reconfigurable_logic_op2 asynchronous 
reconfigurable architecture. See Appendix B for some examples of the synthesis flow operating on the 
benchmark files. 
In addition, it can be seen from Table 21, the number of I/O ports in the asynchronous diffeq1 circuit 
has been expanded to (258-1) × 2 + 3 = 517 after removing the CLOCK pin, and adding the RESET, 
ackin, ackout pins. Therefore, it uses roughly twice the number of regular inputs and outputs 
compared to the synchronous design (258 I/O ports), caused simply by the transformation from 
single-rail to dual-rail. 
In summary, due to the expansion from single-rail to dual-rail and the poor optimisation capabilities 
of UNCLE and Yosys, the area measured for these standard benchmarks mapped onto the 
NRLB(OP2) architecture is an average of ~9x larger than their synchronous counterparts on the 
P a g e  | 90 
 
k6_frac_N10_mem32K_28nm FPGA architecture, as shown in Table 21. Compared to the 
NRLB(OP2) reconfigurable architecture, the asynchronous designs mapped onto NRLB(OP4) and 
NRLB(OP8) have an average of 58% and 189% larger areas, respectively. These results are in line 
with the general conclusions of Section 3.5. 
Table 21 Benchmark Resource Usage 
 Commercial FPGA System 
(synchronous) 
NCL-based Asynchronous Reconfigurable 
System 
Benchmarks I/O 6-LUT Latches CLBs I/O TH_gates Registers NRLB(OP2)s 
VTR Heterogeneous Benchmarks 
blob_merge 136 5456 552 444 274 54327 552 28027 
diffeq1 258 2706 193 204 517 13836 193 7157 
diffeq2 162 2428 94 190 325 13011 96 6638 
sha 74 2063 893 174 149 13085 893 7479 
stereovision0 366 13445 11619 832 734 54391 11619 38981 
stereovision1 278 27446 11660 2074 558 128752 11756 76860 
IWLS 2005 benchmarks (OpenCores) 
systemc_aes 389 1733 670 163 777 12608 670 7052 
mem_ctrl 267 2212 1052 181 532 12593 1051 7439 
pci_bridge32 367 5368 3220 470 732 28754 3219 17817 
aes_core 388 6384 530 544 775 40931 530 21264 
wb_conmax 2546 12285 770 1100 5091 87278 770 45359 
des_perf 298 12648 8808 832 597 148592 8808 84046 
 
However, there is one exception that needs to be pointed out. While the CLB count for the 
synchronous mappings of des_perf and stereovision0 are roughly the same when mapped onto the 
conventional FPGA device, this is not the case for their asynchronous versions. The resources used by 
asynchronous des_perf is more than twice that of stereovision0 when both are mapped on the NRLB 
system. This results in a 22.3x area difference between the synchronous des_perf circuit mapped on 
P a g e  | 91 
 
the conventional FPGA system and the asynchronous counterpart on the NRLB system. This is strong 
evidence that ODIN II and ABC have much better synthesis and logic optimisation capabilities 
compared to UNCLE and Yosys. 
Table 22 shows the latency comparisons on the benchmark circuits mapped onto different 
reconfigurable architectures. For the asynchronous benchmarks, all combinational functions are free 
to operate at the fastest rate allowed by the technology without having the events slowed by the 
worse-case cycle of a clock. The critical path delay figures generated by VPR represent the actual 
propagation delay from input to output, which is different from the synchronous case in which the 
critical path delay is the maximum delay between any latches to measure the minimum CLOCK 
frequency. Thus, the latency of synchronous design is the synchronous critical path delay time 
multiplied by the number of pipeline stages. 
As for the diffeq1 circuit, the latency is about 409ns (87 pipeline stages) when mapped onto the 
NRLB(OP2) architecture, and its synchronous counterpart on conventional FPGA is 1061ns (12.2ns × 
87), which is 61% higher. Moreover, the latency of the Stratix V FPGA device simulated by Quartus® 
is 770 ns, which is 46% higher. The latency in both the synchronous and asynchronous cases is 
directly proportional to the number of pipeline stages, which means that a heavily pipelined 
asynchronous design is likely to exhibit much lower latency. The sha benchmark is a good example as 
the synchronous latency is 5.5x higher than its asynchronous counterpart. 
However, not all asynchronous reconfigurable circuits can achieve better performances on time delay. 
The latency performances of asynchronous aes_core (Advanced Encryption Standard IP core) and 
asynchronous des_perf (Data Encryption Standard IP core) are less favourable on NRLB(OP2) 
architecture, which are 47% and 676%, respectively, slower than their synchronous counterparts 
respectively. Asynchronous des_perf also exhibits an enormous disadvantage on area performance, 
which also explains that UNCLE has a worse synthesis optimisation on the circuits with encoding and 
decoding functions embedded. This type of circuits usually comes with a significant Memory resource 
usage, which consequentially brings a massive disadvantage on the fully-reconfigurable NRLB 
P a g e  | 92 
 
architecture. Except for these two designs, the rest designs mapped on NRLB architecture are 35% ~ 
85% faster than the 28nm conventional FPGA device. 
Table 22 Benchmarks Latency 
 Latency (ns) 
Circuits FPGA NRLB(OP2) NRLB(OP4) NRLB(OP8) 
VTR Heterogeneous Benchmarks 
blob_merge 446 300 294 292 
diffeq1 1061 409 422 435 
diffeq2 1229 601 619 633 
sha 11507 1763 1778 1733 
stereovision0 212 157 183 183 
stereovision1 400 163 195 207 
IWLS 2005 benchmarks (OpenCores) 
systemcaes 1787 608 603 612 
mem_ctrl 1413 530 528 541 
pci_bridge32 1687 704 783 762 
aes_core 1749 2564 2531 2620 
wb_conmax 275 260 273 284 
des_perf 155 1206 1314 1296 
 
When comparing the latency with the different numbers of output ports on NRLB architecture, the 
average difference between each latency figure is only 3%, which is not significant. This further 
indicates that NRLB(OP2) is the better choice on medium-sized circuits when mapped by the 
customised VTR flow. The benchmark circuits mapped onto NRLB(OP2) have a smaller area with the 
same latency performance. 
With an average of 51% lower data transfer latency, the designs mapped onto NRLB(OP2) 
architecture has a nine times larger resource usage area than the cases of synchronous conventional 
P a g e  | 93 
 
FPGA architecture. UNCLE brings some uncertainty to these results with its poorer synthesis 
optimisation on the various circuits. This will be highlighted further in the next chapter with the 
comparisons between the simulated synchronous and asynchronous reconfigurable versions of the 
NoC router sub-modules. 
4.5 Summary 
In this chapter, a specific CAD Flow, modified from the standard VTR CAD flow, has been built for 
synchronous designs mapped on the FPGA system and asynchronous designs mapped onto the NRLB 
system. With this customised CAD Flow and the specific architecture files, a series of comparisons 
have been drawn to validate both synchronous and asynchronous systems using the same 28nm 
process technology. VTR is an open-source framework for modelling FPGA devices, which does not 
have a performance as powerful as Intel® Quartus® Prime CAD toolset, but it has the scalability and 
flexibility to model any reconfigurable systems and is thus more appropriate in this thesis. Based on 
these results, a revised architecture description file (called k6_frac_N10_mem32K_28nm) has been 
generated based on the original flagship 40nm FPGA architecture file to complement the new 
architecture description file (ncl_reconfigurable_logic_op2) that was derived from the NRLB 
architectural characteristics described in Chapter 3. 
All the benchmarks used in the experiments are medium-sized circuits, and they are translatable into 
NCL-based asynchronous designs. The results show that under the same process technology, the 
asynchronous designs mapped onto NRLBs have an average of 51% lower latency than the 
synchronous designs mapped onto FPGAs. The asynchronous designs have an average of 9x larger 
area than their synchronous counterparts. As described in Section 4.4, as well as exhibiting a doubling 
in size due to the transformation from single-rail to dual-rail, the UNCLE toolset has worse logic 
optimisation when transforming the circuit into an NCL-based structural netlist. Furthermore, each 
data-driven NCL register requires a whole NRLB to implement, which is also a waste of area. The 
area and latency performance have also been compared across NRLB organisations with different 
P a g e  | 94 
 
numbers (N) of outputs, following the findings in Chapter 3. A further conclusion is that the NRLB 
architecture that best suits the UNCLE synthesis process is N = 2. 
The next chapter will build a representative system using this CAD flow on the NRLB platform. The 
overall objective is to determine if the NoC router can capitalise on the low latency characteristics of 
the NRLB to achieve similar low latency interconnection when mapped to a complete asynchronous 
reconfigurable platform.  
P a g e  | 95 
 
  
Network on Chip Router 
In Section 2.3.1, it was identified that a vital performance target for NoC Routers is low latency, and 
this might be achieved using asynchronous techniques as these tend to exhibit similar low latency 
behaviour.  In this chapter, the NoC router microarchitecture chosen for this study is mapped on both 
synchronous and asynchronous reconfigurable systems, and comparisons are made between these 
implementation strategies. It was shown previously, in Section 4.4, that the NRLB exhibits 
significantly better latency on a number of small to medium-sized benchmarks. The objective here is 
to determine if this NoC router can achieve similar low latency interconnection results when mapped 
to a complete asynchronous reconfigurable platform based on the NRLB.  
The microarchitecture of the NoC router comprises five sub-modules: input module, output module, 
crossbar, VC allocator and switch allocator. These sub-modules represent different sizes of circuits, 
and they have a range of resource requirements. Further, each of them has a different level of 
algorithmic complexity, which can cover a range of delay schemes.  Thus, the objective has been to 
use this architecture as a case study to identify the relative performance advantages or disadvantages 
of the NRLB asynchronous reconfigurable approach. 
5.1 Experimental Setup 
As mentioned in Section 1.4 on page 7, the virtual channel router microarchitecture described in [28] 
was used for these experiments. This VC router has five sub-modules of differing circuit sizes and 
algorithmic complexities, and thus they are able to demonstrate a range of area and latency behaviours. 
The baseline architectural description for the overall router can be found in Appendix B. 
Based on the customised VTR CAD Flow described in the previous chapter (see Figure 45), the 
experiments started with Verilog RTL models of the NoC router sub-modules that were synthesized 
with Cadence® GenusTM. Then the synchronous structural netlist was passed through logic 
P a g e  | 96 
 
optimisation and technology mapping stages with ODIN II and ABC, respectively. UNCLE was 
employed to generate the asynchronous structural netlist, and the new netlist was technology mapped 
using Yosys in the asynchronous design flow. Finally, the size and timing statistics of both the 
synchronous and asynchronous router sub-modules were derived by VPR for later analysis. 
Four router parameters, Flit Data Width, Number of Ports, Number of Virtual channels and Buffer 
Depth, are tuned in the RTL models to control the size and depth of five sub-modules to analyse each 
effect in isolation. In each experiment, one parameter is varied, and the others keep at their baseline 
values (Table 23). The range of router parameters that can present indicative area and delay for an 
NoC was chosen based on the studies of Abdelfattah [29]. In the end, the area and latency data statics 
generated by VPR are presented and analysed.  
Table 23 Baseline and Range of Router Parameters 
 Flit Data Width 
(bits) 
Number of Ports  Number of Virtual 
Channels on each port 
Buffer Depth on 
each port (Flits) 
Baseline 32 5 2 32 
Range 16 ~ 256 2 ~ 16 1 ~ 10 4 ~ 64 
 
In the following sections, the area statistics normalised to the MWTU of the synchronous NoC sub-
modules and the area of corresponding asynchronous designs are represented by the line charts below 
(see Figure 46—Figure 49, Figure 54—Figure 56, Figure 60, Figure 61, Figure 64, Figure 65, Figure 
68, Figure 69, below). In every case, each figure describes three results: the synchronous designs in 
blue, asynchronous designs in orange and the area ratio which shows the area of asynchronous 
designs against the area of synchronous designs (Areaasynch / Areasynch) in grey. As discussed in Section 
4.4, the area ratio is expected to be around 9x because of the size expansion from single-rail to dual-
rail logic operation and the sub-optimal mapping optimisation in VTR. 
The latency data statistics in nanoseconds of the sub-modules implemented on both reconfigurable 
systems are shown in the bar charts below (see Figure 50—Figure 53, Figure 57—Figure 59, Figure 
62, Figure 63, Figure 66, Figure 67, Figure 70, Figure 71, below). In the same way as the area graphs, 
P a g e  | 97 
 
each figure exhibits three result outputs: the synchronous designs in blue, asynchronous designs in 
orange and the latency ratio which is the ratio of the difference between the asynchronous circuit 
latency and the synchronous circuit latency measured against the synchronous circuit latency 
performance (Latencyasynch - Latencysynch / Latencysynch) in grey.  
In this case, a negative latency ratio is expected to see as the asynchronous style on the NCL-based 
reconfigurable system has a better latency performance than the synchronous style on FPGA system. 
In contrast, a positive ratio indicates that the asynchronous style spends more time transferring the 
data from input to output than the synchronous style in the experiments, and therefore represents a 
disadvantage compared to the synchronous case. 
5.2 Input Module 
The input module is the biggest component in NoC router architecture and includes input memory 
buffers and hardware implementation of the routing and control algorithms. GenusTM has synthesised 
the input module into a gate-level netlist comprising groups of Boolean logic gates, so the memory 
buffers are implemented as a “soft” memory circuit in this scenario. 
(1)  Area 
 
Figure 46 Area of Input Modules with Various Flit Data Width 
0
2
4
6
8
10
12
0.00E+00
1.00E+08
2.00E+08
3.00E+08
4.00E+08
5.00E+08
6.00E+08
16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256
Ra
tio
Ar
ea
 (M
W
TU
)
Flit Data Width (Bits)
synch asynch area_ratio
P a g e  | 98 
 
As can be seen in Figure 46 and Figure 47, the overall trend is that both parameters Flit Data Width 
and Buffer Depth have a similar correlation to area, that is, when data width or buffer depth increases, 
the areas of both synchronous and asynchronous input module increase on average by 55% and 57%, 
respectively. This is entirely expected as, with each increase in data width or buffer depth the same 
increment in logic blocks and registers are required to implement packet processing or buffering, 
which linearly increases the area. With regards to the area ratios, it is observable from both graphs 
that they experienced some minor fluctuations, but the ratios mostly remain steady at around 11x.  
 
Figure 47 Area of Input Modules with Various Buffer Depth 
 
Figure 48 Area of Input Modules with Various Numbers of Ports 
0
2
4
6
8
10
12
0.00E+00
3.00E+07
6.00E+07
9.00E+07
1.20E+08
1.50E+08
1.80E+08
4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64
Ra
tio
Ar
ea
 (M
W
TU
)
Buffer Depth (Flits)
synch asynch area_ratio
0
2
4
6
8
10
12
0.00E+00
2.00E+07
4.00E+07
6.00E+07
8.00E+07
1.00E+08
1.20E+08
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Ra
tio
Ar
ea
 (M
W
TU
)
Number of Ports
synch asynch area_ratio
P a g e  | 99 
 
The parameter Number of Ports has the least impact on the size of the input module, as shown in 
Figure 48. The areas of both synchronous and asynchronous input modules have only increased by 
0%~2% when the number of ports is more than 5. The average area ratio also sits at about 11x. 
 
Figure 49 Area of Input Modules with Various Numbers of Virtual Channels 
When the parameter Number of Virtual Channels varied, the area comparison between the 
asynchronous input module and the synchronous input module has reported an average of 10x area 
difference (Figure 49). With each increase of one VC, the areas of both synchronous and 
asynchronous input modules would have a rough 10% increase. What needs to be pointed out is that 
the area ratio has slightly dropped when the number of VCs is 6 or above. This is because the rate of 
increase of the asynchronous input module is lower than its synchronous counterpart, which means 
there are fewer additional logic blocks synthesized to expand the asynchronous input module from six 
VCs to eight. 
All area statics indicate an average of 10x~11x area ratio on synchronous and asynchronous 
reconfigurable input modules, which is as predicted. The results exhibit the same trends as the 
benchmark samples discussed in Section 4.4. 
 
0
2
4
6
8
10
12
0.00E+00
3.00E+07
6.00E+07
9.00E+07
1.20E+08
1.50E+08
1.80E+08
1 2 3 4 5 6 7 8 9 10
Ra
tio
Ar
ea
 (M
W
TU
)
Number of VCs
synch asynch area_ratio
P a g e  | 100 
 
(2)  Delay 
As the Flit Data Width increases, the latency ratio shows an escalating trend ranging from -38% to 33% 
with some fluctuations in Figure 50. The pipeline stages are generated by the design synthesis and the 
number of stages can easily impact the latency and will cause these fluctuations. Because more logic 
blocks are used as the data width increases without any increase in the number of pipeline stages, the 
asynchronous input module loses its latency advantage when the data width is 112 bits and above. 
Nevertheless, the latency of the asynchronous input module performs the best at its second-lowest 
delay value at 69ns, as at this point, the ratio hits its lowest value at -38% when data width is 32 bits. 
 
Figure 50 Latency of Input Modules with Various Flit Data Width 
In Figure 51 showing the Input Buffer Depth variations, the average latency ratio is -22%. The buffer 
space is the dominant area component vs. Data Width. But this has an impact on the complexity of the 
algorithm as the pipeline stage delay increases slightly with the increase of buffer depth, which 
enables the asynchronous input module to retain its superiority in latency. There are two small drops 
where the buffer depths are 16 flits (24) and 32 flits (25) on the latency ratio line. At these points, the 
implementation requires 4-bit and 5-bit memory address for the buffer organisations, which are more 
efficiently mapped than when the parameters are not a power of 2. In these two scenarios, the 
asynchronous input modules are 29% and 35% faster than the synchronous counterparts. 
-100%
-80%
-60%
-40%
-20%
0%
20%
40%
60%
80%
100%
0
15
30
45
60
75
90
105
120
135
150
16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256
Ra
tio
La
te
nc
y 
(n
s)
Flit Data Width (Bits)
synch asynch latency_ratio
P a g e  | 101 
 
 
Figure 51 Latency of Input Modules with Various Buffer Depth 
 
Figure 52 Latency of Input Modules with Various Numbers of Ports 
The value of Number of Ports is calculated via three parameters in this router microarchitecture: 
Dimensions, Number of Adjacent Routers in Each Dimension and Number of Network Nodes Attached 
to Each Router. Dimensions are fixed as 2 to result in a planar network, and adjacent routers are 2 
(north-south and east-west). The change of port numbers is consequently firmly bound to the number 
of network nodes value. Therefore, the latencies of both synchronous and asynchronous input 
-100%
-80%
-60%
-40%
-20%
0%
20%
40%
60%
80%
100%
0
15
30
45
60
75
90
105
120
135
150
4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64
Ra
tio
La
te
nc
y 
(n
s)
Buffer Depth (Flits)
synch asynch latency_ratio
-100%
-80%
-60%
-40%
-20%
0%
20%
40%
60%
80%
100%
0
15
30
45
60
75
90
105
120
135
150
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Ra
tio
La
te
nc
y 
(n
s)
Number of Ports
synch asynch latency_ratio
P a g e  | 102 
 
modules do not increase with the port numbers. The asynchronous input module retains its advantage 
with an average of -32% lower latency than the synchronous input module, as shown in Figure 52.  
The performances on Number of Virtual Channels in Figure 53 are similar to the one on Number of 
Ports (Figure 52), as the asynchronous input module is, on average, 36% faster than the synchronous 
input module by varying the VC numbers. The variants of VC numbers do not change the latencies a 
lot. Thus, the difference ratio remains steady.  
 
Figure 53 Latency of Input Modules with Various Numbers of Virtual Channels 
The input module mapped on the NRLB system has a nearly 40% improvement on propagation delay 
compared with the case of synchronous FPGA system. The improvement is mostly contributed by the 
effect of the complexity of the routing algorithms and straight data forwarding behaviour from NCL; 
however, at the expense of the expected 10x area sacrifice. 
5.3 Output Module 
The output module is the smallest sub-module in the NoC router microarchitecture. It acts as a register 
that traverses flits from crossbar to the downstream router to improve the pipelining behaviour. 
Except for the packets forwarding, it also delivers message feedback to the VC allocator and the 
switch allocator.  
-100%
-80%
-60%
-40%
-20%
0%
20%
40%
60%
80%
100%
0
20
40
60
80
100
120
140
160
180
200
1 2 3 4 5 6 7 8 9 10
Ra
tio
La
te
nc
y 
(n
s)
Number of VCs
synch asynch latency_ratio
P a g e  | 103 
 
(1)  Area 
On the FPGA systems, the increases of Flit Data Width and Number of Ports do not have a significant 
impact on the size of the synchronous output module. Figure 54 illustrates the synchronous output 
module has two slightly different stages. The first period is when the data width is from 16 to 112, 
followed by the second range of 128 to 256. The number of used logic blocks increases by one CLB 
with each 16-bit data width increment at the first stage, then the area of the synchronous output 
module slightly drops with better VPR clustering as data width passes 128 flits and enters the second 
stage. By contrast, the asynchronous output module has experienced a steady upward linear growth. In 
terms of the ratio, during the first range from 16 to 112, the asynchronous output module mapped on 
NRLB system is about 8x more significant than the synchronous counterpart mapped on FPGA 
system, before reaching an average of 11x at the second stage.  
 
Figure 54 Area of Output Modules with Various Flit Data Width 
In Figure 55, both synchronous and asynchronous output modules see a sudden drop in resource usage 
when the number of ports is 8, before continuing to climb. This is because fewer logic gates are 
synthesised in the output module structural netlist of 8 output ports than the case of 7 output ports by 
GenusTM. Thus, fewer CLBs and NRLBs are used in the VPR clustering stage, and the areas decrease 
at this specific point. Even though the decline in resource usage occurs at the same phase for both 
0
2
4
6
8
10
12
0.00E+00
3.00E+06
6.00E+06
9.00E+06
1.20E+07
1.50E+07
1.80E+07
16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256
Ra
tio
Ar
ea
 (M
W
TU
)
Flit Data Width (Bits)
synch asynch area_ratio
P a g e  | 104 
 
synchronous and asynchronous output modules, the synchronous output module has a higher area 
reduction at 30% than the 10% of asynchronous counterpart. The area ratio, on the other hand, stays 
within a relatively constant range, fluctuating between 8x and 10x. 
 
Figure 55 Area of Output Modules with Various Numbers of Ports 
The Number of Virtual Channels has a more significant impact on the size of the output module. It is 
noticeable that the area rapidly grows with the change of VC numbers, and it increases faster than 
previous parameters in Figure 56. The area ratio climbs from 8x to 10x due to the area of the 
asynchronous output module has a faster increase than the synchronous counterpart. 
 
Figure 56 Area of Output Modules with Various Numbers of Virtual Channels 
0
2
4
6
8
10
12
0.00E+00
2.00E+06
4.00E+06
6.00E+06
8.00E+06
1.00E+07
1.20E+07
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Ra
tio
Ar
ea
 (M
W
TU
)
Number of Ports
synch asynch area_ratio
0
2
4
6
8
10
12
0.00E+00
5.00E+06
1.00E+07
1.50E+07
2.00E+07
2.50E+07
3.00E+07
1 2 3 4 5 6 7 8 9 10
Ra
tio
Ar
ea
 (M
W
TU
)
Number of VCs
sync async area_ratio
P a g e  | 105 
 
The average area ratio of the output module is from 8x to 10x, which is as expected. As these three 
independent parameters varied, the areas of both synchronous and asynchronous output modules have 
linear increases in most cases. 
(2)  Delay 
The variation of Data Width does not have much impact on the complexity of algorithms on the 
output module, which is directly reflected in its latency. The latencies of the synchronous and the 
asynchronous output modules hit the lowest value at 41ns and 40ns, respectively, while Data Width is 
64 bits as shown in Figure 57. Overall, the output modules with two techniques do not have much 
difference in the latency due to the average of 4% ratio. 
 
Figure 57 Latency of Output Modules with Various Flit Data Width 
Likewise, the latency is also insensitive to the change of Number of Ports. In Figure 58, the difference 
between synchronous and asynchronous output modules throughout the graph is around 2%, which is 
again negligible. The latencies of both modules have reduced by almost 50%, from 60ns to 32ns when 
the number of ports is 8. More output ports can reduce the latency and increase the data throughput 
rate, but 8-port output module has the most optimal performance as shown in the graph, and there are 
no other significant improvements beyond this point. 
-100%
-80%
-60%
-40%
-20%
0%
20%
40%
60%
80%
100%
0
10
20
30
40
50
60
70
80
90
100
16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256
Ra
tio
La
te
nc
y 
(n
s)
Flit Data Width (Bits)
synch asynch latency_ratio
P a g e  | 106 
 
 
Figure 58 Latency of Output Modules with Various Numbers of Ports 
In Figure 59, the latency ratio decreases with the increase of number of VCs. The difference ratio 
starts at 22%, which means the latency of the asynchronous output module is higher than the 
synchronous counterpart, and it drops to -20% at the end when VC is 10. More Virtual Channels can 
achieve higher data throughput, and this is reflected in the latency of the asynchronous output module, 
which becomes faster than the synchronous case, but the area overhead is enormous.  
 
Figure 59 Latency of Output Modules with Various Numbers of Virtual Channels 
-100%
-80%
-60%
-40%
-20%
0%
20%
40%
60%
80%
100%
0
10
20
30
40
50
60
70
80
90
100
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Ra
tio
La
te
nc
y 
(n
s)
Number of Ports
synch asynch latency_ratio
-100%
-80%
-60%
-40%
-20%
0%
20%
40%
60%
80%
100%
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5 6 7 8 9 10
Ra
tio
La
te
nc
y 
(n
s)
Number of VCs
sync async latency_ratio
P a g e  | 107 
 
Overall, the output module is a very small design. It has no significant impact on both synchronous 
FPGA system and asynchronous NRLB system. The area of the asynchronous output module remains 
the range of 8x~12x larger than the synchronous counterpart, and the latencies of both synchronous 
and asynchronous Output modules do not have much contrast.  
5.4 Crossbar Module 
The crossbar is multiplexer-based, and it traverses all the flits from input ports to the selected output 
ports operated by switch allocator. The crossbar module is a combinational logic block without any 
sequential register. To compare synchronous and asynchronous crossbar performances, a flip-flop is 
manually added at the output ports of the crossbar. 
(1)  Area 
The changes of Flit Data Width are only reflected in the size of the  data vectors that traverse through 
the crossbar module. Therefore, both synchronous and asynchronous crossbars have a linear growth 
with the increase of data width, as shown in Figure 60. The synchronous design increases by 24 CLBs 
and NRLB by 814 with every 16-bit Data Width increase. The areas of the asynchronous crossbars are 
consistently 7.5x more significant than the equivalent point on the synchronous curve in this graph. 
 
Figure 60 Area of Crossbars with Various Flit Data Width 
0
2
4
6
8
10
12
0.00E+00
3.00E+07
6.00E+07
9.00E+07
1.20E+08
1.50E+08
1.80E+08
16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256
Ra
tio
Ar
ea
 (M
W
TU
)
Flit Data Width (Bits)
synch asynch area_ratio
P a g e  | 108 
 
The growth of Number of Ports on the crossbar affects both input ports and output ports; therefore, N 
multiplexers with N input ports are joined to construct the N-port crossbar. The SELECT signal vector 
size grows by choosing output ports as well. As a result, the area will increase much faster than data 
width growth. Although the area of the asynchronous crossbar grows exponentially, the area of the 
synchronous crossbar is fluctuating, as shown in Figure 61. When mapped on the FPGA system, the 
number of LUTs has a slightly unusual growth which results in an unstable area increment. This will 
almost certainly be due to way that the FPGA mapping and placement algorithms optimally partition 
the circuit amongst the CLB elements, taking any opportunity to pack the CLBs more densely. 
Overall, the average area ratio is around 7.6x. 
 
Figure 61 Area of Crossbars with Various Numbers of Ports 
(2)  Delay 
As shown in Figure 62 and Figure 63, Flit Data Width and Number of Ports, the latency ratios 
increase simultaneously with the crossbar size. Crossbar is a non-pipelined design, and there is only 
2~3 gate delays in the synchronous crossbar and 3~7 gate delays in the asynchronous crossbar. In the 
architectural simulation, the wire delays are enormous in the crossbar against the insignificant gate 
delays. This is observed by the latency of asynchronous crossbar being around 2x slower than its 
synchronous counterpart as there is a much more significant number of NRLBs in the design. There is 
0
2
4
6
8
10
12
0.00E+00
4.00E+07
8.00E+07
1.20E+08
1.60E+08
2.00E+08
2.40E+08
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Ra
tio
Ar
ea
 (M
W
TU
)
Number of Ports
synch asynch area_ratio
P a g e  | 109 
 
a drop when the number of ports is 11 in the second graph, which is because the synchronous crossbar 
exhibits 3 gate delays at this point and the latency ratio reaches the lowest point at 1.5x. 
 
Figure 62 Latency of Crossbars with Various Flit Data Width 
 
Figure 63 Latency of Crossbars with Various Numbers of Ports 
The crossbar module is not a good sample to make comparisons on synchronous and asynchronous 
reconfigurable systems. Each increase of ports costs large additional numbers of CLBs or NRLBs to 
implement corresponding numbers of multiplexers. While the maximum gate delays are only a small 
portion of the whole critical path delay in the asynchronous crossbar, the physical wire delay greatly 
0.0
0.5
1.0
1.5
2.0
2.5
0
2
4
6
8
10
16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256
Ra
tio
La
te
nc
y 
(n
s)
Flit Data Width (Bits)
synch asynch latency_ratio
0.0
0.5
1.0
1.5
2.0
2.5
3.0
0
2
4
6
8
10
12
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Ra
tio
La
te
nc
y 
(n
s)
Number of Ports
synch asynch latency_ratio
P a g e  | 110 
 
impacts the latency, which makes the asynchronous crossbar has 2x higher latency than its 
synchronous counterpart. 
5.5 Virtual Channel Allocator 
The VC allocator arbitrates a selected VC output at the router’s input ports to wait for the packet 
before it can proceed. As will be discussed in the next section, both the VC allocator and switch 
allocator use input-first technique to process all requests. Round-robin type arbiters are applied in the 
allocator design. Only the head flit request is used in the router process by the VC allocation, so there 
are no parameters such as data width and buffer size to describe these sub-modules. As an allocator, 
more combinational logic blocks are used to construct the computing algorithms other than memory 
buffers. 
(1)  Area 
 
Figure 64 Area of VC Allocators with Various Numbers of Ports 
As the Number of Ports increases, both the areas of synchronous and asynchronous VC allocators 
exponentially increase, as shown in Figure 64. The size of the synchronous VC allocator exceeds the 
synchronous input module with same parameter settings when the number of ports is 9, and the 
0
2
4
6
8
10
12
0.00E+00
5.00E+07
1.00E+08
1.50E+08
2.00E+08
2.50E+08
3.00E+08
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Ra
tio
Ar
ea
 (M
W
TU
)
Number of Ports
synch asynch area_ratio
P a g e  | 111 
 
asynchronous circuit has the same situation when Number of Ports is 10. The average ratio of area is 
8x, and the ratio line chart is stable. 
Although areas in Figure 65 of Number of Virtual Channels follow the same patterns in Figure 64, the 
areas of both synchronous and asynchronous VC allocator rise much faster as Number of VCs grows. 
The VC allocators with both techniques are larger than the corresponding input modules with the 
same parameters settings when Number of VCs exceeds 4. The average area ratio by varying Number 
of VCs is 8x, which is the same as the previous graph. 
 
Figure 65 Area of VC Allocators with Various Number of Virtual Channels 
(2)  Delay 
In the comparisons of latency, both synchronous and asynchronous VC allocators become slower with 
the increase of Number of Ports. However, at the same time, the latency of asynchronous VC allocator 
is lower than its synchronous counterpart in most cases, and the gap keeps increasing, as shown per 
latency ratio in Figure 66. The maximum difference of 35% occurs at when the number of ports is 11, 
but the latency of the asynchronous VC allocator is 395ns, and the latency of the synchronous VC 
allocator is 610ns which are significant delays in the data transfer of the whole NoC router . 
 
0
2
4
6
8
10
12
0.00E+00
1.50E+08
3.00E+08
4.50E+08
6.00E+08
7.50E+08
9.00E+08
1 2 3 4 5 6 7 8 9 10
Ra
tio
Ar
ea
 (M
W
TU
)
Number of VCs
synch asynch area_ratio
P a g e  | 112 
 
 
Figure 66 Latency of VC Allocators with Various Number of Ports 
Similar to the performance of Number of Ports (Figure 66), the latency ratio drops as the number of 
VCs increases, as shown in Figure 67. The ratio starts at 9% and reaches the lowest -60% when the 
number of VCs is 8. Except when the number of VCs is 1 or 3, the asynchronous VC allocator is an 
average of 37% faster than its synchronous counterpart. 
 
Figure 67 Latency of VC Allocators with Various Number of Virtual Channels 
The asynchronous VC allocator mapped onto the NRLB system has significant latency advantages, 
and the advantage keeps growing when compared to the synchronous counterparts. At the same time, 
-100%
-80%
-60%
-40%
-20%
0%
20%
40%
60%
80%
100%
0
100
200
300
400
500
600
700
800
900
1000
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Ra
tio
La
te
nc
y 
(n
s)
Number of Ports
synch asynch latency_ratio
-100%
-80%
-60%
-40%
-20%
0%
20%
40%
60%
80%
100%
0
400
800
1200
1600
2000
2400
2800
3200
3600
4000
1 2 3 4 5 6 7 8 9 10
Ra
tio
La
te
nc
y 
(n
s)
Number of VCs
synch asynch latency_ratio
P a g e  | 113 
 
the areas of both designs increase exponentially, which means more logic blocks are spent to 
implement the computing logic, and the algorithmic complexity grows at the same time. As this sub-
module focuses on the flow control other than buffer storage, the complexity of the designs has a 
direct relationship with the reconfigurable system performance. 
5.6 Switch Allocator 
The switch allocator uses the same configuration as VC allocator, and they both focus on the routing 
logic algorithms. Switch allocation establishes a crossbar schedule to solve the conflict issue of flits 
arriving at the same output ports. It allocates correct timing for each flit which is waiting at the input 
ports to be traversed to its destination by the crossbar. 
(1)  Area 
The areas of both synchronous and asynchronous switch allocators exponentially grow in Figure 68 of 
Number of Ports. The asynchronous switch allocator is around 9x more significant than the 
synchronous design. The synchronous switch allocator has an average of 39% smaller area than the 
synchronous VC allocator mapped onto the FPGA system, and the asynchronous switch allocator is 
32% smaller than the asynchronous VC allocator mapped onto the NRLB system. The size of the 
switch allocator is smaller than the case of the VC allocator. 
 
Figure 68 Area of Switch Allocators with Various Number of Ports 
0
2
4
6
8
10
12
0.00E+00
3.00E+07
6.00E+07
9.00E+07
1.20E+08
1.50E+08
1.80E+08
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Ra
tio
Ar
ea
 (M
W
TU
)
Number of Ports
synch asynch area_ratio
P a g e  | 114 
 
Unlike the increasing trend with Number of Ports (Figure 68), the areas of both synchronous and 
asynchronous switch allocator grow linearly when the Number of VCs varies, as shown in Figure 69. 
The area of the asynchronous switch allocator has a steady increase at the rate of 50% with each unit 
of VC increase. The switch allocator emphasises on grant and request signals process other than the 
process of VC flow control as the VC allocator. Thus, a small number of logic blocks rises with the 
increase of VCs. Overall, the asynchronous switch allocator is around 8x more substantial than its 
synchronous counterpart in this graph. 
 
Figure 69 Area of Switch Allocators with Various Number of Virtual Channels 
(2)  Delay 
In Figure 70 of Number of Ports, the latencies of both synchronous and asynchronous switch 
allocators dramatically increase by each port increase. The routing logic on the switch allocator is not 
as complex as the VC allocator, so there is no significant advantage to the latency of the asynchronous 
switch allocator when compared to the synchronous case. However, the latency ratio has a decreasing 
trend which starts at 49% and finishes at -4%. When number of ports is 9, the asynchronous switch 
allocator has the best performance, which is 9% faster than the synchronous design. 
As discussed in Figure 69, the parameter Number of Virtual Channels is not a key parameter in the 
switch allocator. There are only a few pipeline stages that grow with increasing VCs. The latency 
0
2
4
6
8
10
12
0.00E+00
1.00E+07
2.00E+07
3.00E+07
4.00E+07
5.00E+07
6.00E+07
1 2 3 4 5 6 7 8 9 10
Ra
tio
Ar
ea
 (M
W
TU
)
Number of VCs
synch asynch area_ratio
P a g e  | 115 
 
ratio drops from 35% and remains stable when the number of VCs is 5 and above, as shown in Figure 
71. Although the latency of the asynchronous switch allocator is around 15% lower than its 
synchronous switch allocator counterpart, we should not disregard the enormous latencies on both 
reconfigurable systems. 
 
Figure 70 Latency of Switch Allocators with Various Number of Ports 
 
Figure 71 Latency of Switch Allocators with Various Number of Virtual Channels 
Overall, the performance of the switch allocator is similar to the VC allocator. The area of the switch 
allocator has more influenced by the change of Number of Ports rather than Number of VCs. The 
-100%
-80%
-60%
-40%
-20%
0%
20%
40%
60%
80%
100%
0
60
120
180
240
300
360
420
480
540
600
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Ra
tio
La
te
nc
y 
(n
s)
Number of Ports
synch asynch latency_ratio
-100%
-80%
-60%
-40%
-20%
0%
20%
40%
60%
80%
100%
0
25
50
75
100
125
150
175
200
225
250
1 2 3 4 5 6 7 8 9 10
Ra
tio
La
te
nc
y 
(n
s)
Number of VCs
synch asynch latency_ratio
P a g e  | 116 
 
asynchronous switch allocator mapped on the NRLB system does not have any apparent advantages 
when compared with the input module and the VC allocator, but it indicates that the circuits with high 
complexity will have a better implementation on the NRLB system than on synchronous FPGA 
systems. 
5.7 Summary 
In this chapter, the five sub-modules making up the NoC router microarchitecture have been 
implemented and analysed on both synchronous and asynchronous reconfigurable systems and their 
results compared. Four main parameters are varied to control the sizes and depths of the sub-modules. 
Under the baseline parameters setting (see Table 23), the input module and VC allocator are the 
largest sub-modules as they are gate intensive designs, and the VC allocator has the highest latency as 
well. The switch allocator has the second-highest latency as it manages the highly complex routing 
algorithms. The output module and crossbar have the lowest latency as they are pipelining intensive 
circuits, and the output module is the smallest sub-module in the router. The parameters FLIT Data 
Width, and Total Buffer Depth are the two most influential factors on the size of the designs. Number 
of Ports and Number of VCs have more effects on the complexity of algorithms. 
The input module, the VC allocator and the switch allocator generally achieve a 20%~30% lower 
latency on the asynchronous reconfigurable systems when compared with their synchronous 
counterparts, and this advantage widens with increasing algorithmic complexity. The output module 
and the crossbar have similar characteristics in that they both exhibit small gate delays, which is 
reflected in a similar or worse latency performance on the NRLB systems compared to the 
implementations on FPGA systems. The smaller circuits and simple designs of the final two modules 
(output module and crossbar) are not a fair test when comparing between FPGA system and NRLB 
system, except that they serve to highlight the impact of the single to dual-rail transformation and the 
inefficient conversion achieved by the UNCLE tool. This is indicated by the fact that, regardless of 
the complexity of the particular sub-module, the area ratio between synchronous and asynchronous 
implementations stays within the range of 9x~12x. 
P a g e  | 117 
 
The complete asynchronous NoC router mapped onto NRLB is 50% faster than the synchronous NoC 
router mapped onto FPGA, but it comes with a ~10x area penalty. The result is close to the other 
medium-sized benchmarks discussed in the previous chapter. In conclusion, the NRLB system could 
be a better low-latency option than FPGA system to implement NoC, but only where the area is not an 
essential constraint.  
  
P a g e  | 118 
 
 
  
P a g e  | 119 
 
  
Conclusions 
Reconfigurable systems are increasingly in demand in the modern semiconductor industry, which can 
benefit from their flexibility, responsiveness and reusability. However, the FPGA, as an example of a 
mature reconfigurable system, faces many challenges, such as higher power consumption and 
significantly higher resource usage compared to similar ASIC implementations. Another limitation is 
that the semiconductor industry has reached a bottleneck and cannot continue to escalate silicon 
complexity, which directly affects the FPGA developments. Asynchronous techniques may provide a 
solution by reducing the impact of the global clock on the reconfigurable system.  
This thesis has proposed an NCL-based reconfigurable system and analysed its characteristics. It has 
also proposed a proper CAD flow based on the existing VTR flow and the UNCLE toolset. 
Architectural simulations on both conventional FPGA system and asynchronous reconfigurable 
system were then implemented and compared.  
As one approach to an asynchronous reconfigurable system, the dual-rail NRLB has been proposed 
and analysed in this work. The NRLB includes embedded registration and reset functions, and it is 
able to perform the 27 fundamental NCL gate functions. Compared to previous single-rail NCL LUT 
proposals, it is approximately 50% larger (1,766 transistors) and 70% faster.  
The LUT of the NRLB was first designed with 4-bit dual-rail inputs/outputs, which is able to be 
flexibly mapped onto most NCL-based designs. However, more than 1000 transistors are required in 
its configuration memory shift register to deliver the programming bits, which represents 81% of the 
overall NRLB area. This was considered to be an excessive overhead so to reduce the area while 
maintaining the full complement of NCL gate functions, a 1-bit dual-rail output LUT was proposed. 
The NRLB(OP2) has similar latency performance to NRLB(OP8) but is only 1/3 its size.  
P a g e  | 120 
 
The NRLB(OP2) uses the same or fewer transistor resources when used for small circuits such as a 
Full-adder and has a lower propagation delay compared to previous asynchronous reconfigurable 
works. Moreover, NRLB(OP2) is better suited to implement the designs via the UNCLE gate library, 
which use a combination of basic NCL gate functions and NCL registers with set and reset signals.  
UNCLE, an NCL-based synthesis toolset, is able to synthesis a synchronous Verilog RTL design into 
a dual-rail NCL structural netlist. In general, UNCLE simply converts the Boolean gates into NCL 
threshold gates from single-rail to dual-rail, which in itself does not necessarily result in any 
performance improvement. Although UNCLE includes the acknowledge network (hand-shaking 
signal) generation function and net buffering optimisation to obey the isochronic timing requirement, 
the designs are turned into NCL logic without any algorithm advancements or other optimizations.  
The NCL-based automatic design flow targeting the NRLB system architecture has been modified 
from the basic VTR CAD Flow, which is itself based on a hypothetical FPGA devices architecture for 
research purposes. Referring to the approximate performance of the Intel® Stratix V device (28nm 
process technique), a 28nm hypothetical FPGA architecture has been derived. The synchronous 
designs have less optimisation on VTR toolset and result in worse critical path delay when compared 
to the implementation of a commercial FPGA synthesis system (e.g. Intel Quartus). The asynchronous 
reconfigurable architecture includes size and all delay parameters from the transistor level evaluation 
in Chapter 3, and the rest of the parameters are kept at their default from original (so-called) “flagship” 
architecture. The latency and area comparisons on both synchronous and asynchronous reconfigurable 
systems were shown to answer the first research question.  
Can a purely asynchronous reconfigurable system implementation exhibit comparable latency 
and resource usage than conventional synchronous FPGA implementations at an equivalent 
technology node? 
The NCL-based LUT is larger than the 6-LUT in FPGA system under an equivalent technology 
process, and each NRLB can only implement two to four NCL gates which are not comparable to 
CLB that comprises ten 6-LUTs. Furthermore, the resource usage of the standard benchmarks mapped 
P a g e  | 121 
 
onto NCL-based reconfigurable system is significantly higher than on the conventional FPGA system. 
This is because UNCLE doubles the I/O ports as a result of the transformation from single-rail to 
dual-rail. Further, the number of NCL gates in the NCL structural netlist are two to three times more 
than the Boolean gates in the synchronous structural netlist, which results in nine to ten times area 
difference between the total number of gates used for asynchronous reconfigurable circuits and the 
synchronous circuits. Although there is no advantage when considering area, the NRLB exhibits 
general benefits of the NCL technique in that it is delay-insensitive and correct-by-construction. It 
also exhibits up to a 50% reduction in latency compared to the corresponding synchronous design, a 
characteristic which is very useful to the NoC application examined in this work. However, this 
significant improvement is not appropriate for all designs. It is constrained by the circuit size and the 
complexity of the computing algorithm.  
NoC has been proven to be an effective, scalable interconnect solution for FPGA devices in previous 
research and the reconfigurable (“soft”) NoC can be implemented on existing reconfigurable fabrics 
without any architectural changes. Thus, the soft NoC router sufficiently uses the reconfigurable 
system silicon area. The state-of-the-art VC router microarchitecture is employed by its better data 
throughput and high flexibility. It has five different-size sub-modules, and each of them has its own 
algorithm and pipelining strategy, which can demonstrate a range of area and latency behaviours. To 
answer the second research question, all five modules are separately modelled by the customised 
CAD flow with four main parameter variations. 
How will the area, delay of NoC architecture implemented on NCL-based asynchronous 
reconfigurable system compare with the case on a conventional synchronous FPGA 
technique? 
Each sub-module, implements on NCL-based asynchronous reconfigurable system, has a stable nine 
to twelve times area difference compared to the cases on the conventional synchronous FPGA. This 
further indicates that the area difference does not have a direct relationship with the size of the design, 
but the logic optimisation and technology mapping from the tools in the CAD flow have more impact 
P a g e  | 122 
 
on the final size. Thus, when choosing the applications mapped on the NRLB system, two factors 
need to be considered. The first is the size of the design. For example, the output module is the 
smallest component in the NoC router microarchitecture. There are no advantages from the 
asynchronous reconfigurable system because the global clock does not have much impact on the 
design mapped on the conventional synchronous FPGA system. The second factor is the complexity 
of the design algorithms. To illustrate this point, we can take the crossbar as an example. It is only 
responsible for the flit traversal, and it is constructed using large numbers of multiplexer blocks. 
There are only a few gates participating in the flit traversal, which lead to significant physical wire 
delays in the critical path delay on the architectural level simulation, and over ten times the resource 
usage on the NRLB system. As a result, the crossbar has a much higher latency on the asynchronous 
reconfigurable system than the case on the conventional synchronous FPGA. Other than these two 
cases, NCL-based asynchronous reconfigurable system can perform 20%~30% lower latency when 
mapped on the other three NoC router sub-modules. Furthermore, it can achieve 50% lower latency 
with the complete asynchronous NoC router design mapped on the NRLB system compared to the 
case mapped on the conventional FPGA system. 
Therefore, the reconfigurable NoC designed using the NCL methodology has been shown to improve 
the latency when implemented on the NCL-based reconfigurable system, compared to the same 
application mapped onto the conventional synchronous reconfigurable system. While the NCL-based 
reconfigurable NoC uses ten times more transistors than the synchronous case, this may be acceptable 
in many cases, where the improvement in latency is the most critical performance consideration. This 
advantage does not apply to all designs in general, and the limitations of size and complexity should 
be considered when selecting the application. These are the significant findings of this thesis. 
As identified by Alan Kay, “The best way to predict the future is to invent it.” Thus, although 
synchronous techniques are still the preferred option, it is likely in the future that low latency 
asynchronous reconfigurable systems may prove to be a better solution to meet higher demands of on-
chip interconnection.  
P a g e  | 123 
 
6.1 Future Work 
The research in this thesis has raised some questions that would ideally be the objective of additional 
work. Many of these relate to the relationship between the hardware and the (lack of) asynchronous 
design tools. As was identified in Chapter 3, the optimum architecture of the NRLB depended very 
carefully on the UNCLE synthesis process. It was found that the placer had few opportunities to use 
more than a single dual-output from the NRLB and so the cell was modified to reduce the number of 
outputs. However, this was a direct result of the manner in which UNCLE creates data-driven NCL 
registers between blocks of logic, which then occupies an entire NRLB, which is a waste of resources. 
In a similar way to modern FPGA tools, it would be an essential step to develop a dedicated NRLB-
based synthesis methodology to maximise the mapping efficiency of the NRLB. If this became 
available, it might be possible to further optimise the architecture of the NRLB in concert with the 
tool flow. 
By introducing the NCL-based reconfigurable architecture in Chapter 4, the research approach in the 
first place is to compare the performance of area and latency on the both synchronous and 
asynchronous reconfigurable system. As the asynchronous reconfigurable system has a different data 
throughput behaviour compared to the conventional synchronous FPGA, it is hard to monitor the 
cycle time of asynchronous designs in the “packing, placement and routing” tool VPR, which was 
developed originally for conventional synchronous FPGA architectures. Thus, this work on NCL-
based reconfigurable systems was only evaluated with its latency from input to output. Furthermore, 
VTR provides a function of transistor-level dynamic and static power estimation with the signal 
activities and technology properties files. The activity estimation tool ACE-2.0 [107] that generates 
the signal activities file was created by techniques which are FPGA friendly but is not as appropriate 
for asynchronous systems. The fact that a technique for estimating power in asynchronous circuits is 
not available, which is something that has been known since 1994 [108]. It is probably the time this 
gap was filled. 
P a g e  | 124 
 
The VC NoC Router microarchitecture used in this thesis was developed as a synchronous design. 
Other than using UNCLE to synthesise the RTL into a dual-rail NCL-based netlist, an NCL-based 
NoC router developed from scratch would be a more appropriate research vehicle. Very recently, 
asynchronous NoC simulators have started to emerge (e.g., [109]) that support system-level 
simulation of asynchronous networks on a chip. This type of simulator will allow the much easier 
evaluation of asynchronous router architectures. 
  
P a g e  | 125 
 
Appendix A 
VTR uses an XML-based architecture description language to describe the targeted reconfigurable 
system. The customised architecture description XML files are available at: 
https://drive.google.com/open?id=1XEV1Rro7YJPIG8CAjcYyWCesf_jG48QT . 
 k6_frac_N10_mem32K_28nm.xml 
 ncl_reconfigurable_logic_op2.xml 
 ncl_reconfigurable_logic_op4.xml 
 ncl_reconfigurable_logic_op8.xml 
  
P a g e  | 126 
 
Appendix B 
Both synchronous and asynchronous benchmarks and NoC router are synthesised as this architectural 
simulation example. The files generated for the synthesis of this diffeq1 example are available at: 
https://drive.google.com/open?id=1yYkK5fcXm85uDjnM5f1jLX5PTBUmG49G . 
 
  
diffeq1_rtl.v
Implement on FPGA architecture
 diffeq1_odin.blif
 diffeq1_abc.blif
 synch_results
Implement on NRLB architecture
 diffeq1_ncl_uncle.v
 diffeq1_ncl_yosys.blif
 asynch_results
P a g e  | 127 
 
References 
[1] V. Betz, J. Rose, and A. Marquardt, Architecture and CAD for Deep-Submicron FPGAS. Boston, 
MA: Boston, MA: Springer US, 1999. 
[2] INTEL®. INTEL® SOC FPGAS. Available: https://www.intel.com/content/www/us/en/ 
products/programmable/soc.html 
[3] Xilinx. SoCs, MPSoCs & RFSoCs. Available: https://www.xilinx.com/products/silicon-devices/ 
soc.html 
[4] J. C. Lyke, C. G. Christodoulou, G. A. Vera, and A. H. Edwards, "An Introduction to 
Reconfigurable Systems," Proceedings of the IEEE, vol. 103, no. 3, pp. 291-317, 2015. 
[5] J. G. Koomey, S. Berard, M. Sanchez, and H. Wong, "Implications of Historical Trends in the 
Electrical Efficiency of Computing," IEEE Annals of the History of Computing, vol. 33, no. 3, pp. 
46-54, 2011. 
[6] (2011). International Technology Roadmap for Semiconductors. Available: http:// 
www.itrs.net/Links/2011ITRS/Home2011.htm 
[7] R. R. Schaller, "Moore's law: past, present and future," IEEE Spectrum, vol. 34, no. 6, pp. 52-
59, 1997. 
[8] S. M. Trimberger, "Three Ages of FPGAs: A Retrospective on the First Thirty Years of FPGA 
Technology," Proceedings of the IEEE, vol. 103, no. 3, pp. 318-331, 2015. 
[9] B. H. Calhoun, F. A. Honore, and A. Chandrakasan, "Design Methodology for Fine-Grained 
Leakage Control in MTCMOS," in International Symposium on Low Power Electronics and 
Design, Seoul, Korea, 2003, pp. 104-109: ACM Press. 
[10] P. Beckett, "Low-Power Spatial Computing using Dynamic Threshold Devices," in 
International Symposium on Circuits and Systems, ISCAS'05, Kobe, Japan, 2005, pp. 2345-
2348. 
[11] P. Beckett and S. C. Goldstein, "Why Area Might Reduce Power in Nanoscale CMOS," in 
International Symposium on Circuits and Systems, ISCAS'05, Kobe, Japan, 2005, pp. 2329-
2332. 
[12] A. Iyer and D. Marculescu, "Power and performance evaluation of globally asynchronous 
locally synchronous processors," in Proceedings 29th Annual International Symposium on 
Computer Architecture, 2002, pp. 158-168. 
[13] C. Traver, R. B. Reese, and M. A. Thornton, "Cell designs for self-timed FPGAs," in 
Proceedings 14th Annual IEEE International ASIC/SOC Conference (IEEE Cat. No.01TH8558), 
2001, pp. 175-179. 
[14] J. Teifel and R. Manohar, "An asynchronous dataflow FPGA architecture," IEEE Transactions 
on Computers, vol. 53, no. 11, pp. 1376-1392, 2004. 
[15] D. Fang, J. Teifel, and M. Rajit, "A high-performance asynchronous FPGA: test results," in 
13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines 
(FCCM'05), 2005, pp. 271-272. 
[16] R. U. R. Mocho, G. H. Sartori, R. P. Ribas, and A. I. Reis, "Asynchronous circuit design on 
reconfigurable devices," presented at the Proceedings of the 19th annual symposium on 
Integrated circuits and systems design, Ouro Preto, MG, Brazil, 2006.  
[17] M. Najibi, K. Saleh, M. Naderi, H. Pedram, and M. Sedighi, "Prototyping globally 
asynchronous locally synchronous circuits on commercial synchronous FPGAs," in 16th IEEE 
International Workshop on Rapid System Prototyping (RSP'05), 2005, pp. 63-69. 
[18] P. Dabholkar, "A null convention logic based platform for high speed low energy IP packet 
forwarding," ed: RMIT University, 2018. 
[19] C. Maxfield. (2013). The efficient implementation of asynchronous logic in COTS FPGAs. 
Available: https://www.eetimes.com/document.asp?doc_id=1280278# 
P a g e  | 128 
 
[20] P. A. Beerel, "Asynchronous circuits: an increasingly practical design solution," in 
Proceedings International Symposium on Quality Electronic Design, 2002, pp. 367-372. 
[21] I. E. Sutherland, "Micropipelines," Communications of the ACM, vol. 32, no. 6, pp. 720-738, 
1989. 
[22] G. Gopalakrishnan, "Developing micropipeline wavefront arbiters," IEEE Design & Test of 
Computers, vol. 11, no. 4, pp. 55-64, 1994. 
[23] K. M. Fant and S. A. Brandt, "Null Convention Logic System," US Patent 5,305,463, 1994. 
[24] K. M. Fant and S. A. Brandt, "NULL Convention LogicTM: a complete and consistent logic for 
asynchronous digital circuit synthesis," in Proceedings of International Conference on 
Application Specific Systems, Architectures and Processors, ASAP 96., 1996, pp. 261-273. 
[25] P. Dabholkar and P. Beckett, "A high throughput, low latency null convention logic 16×16-bit 
multiplier," in 2016 10th International Conference on Signal Processing and Communication 
Systems (ICSPCS), 2016, pp. 1-8. 
[26] L. Benini and G. De Micheli, "Networks on chips: a new SoC paradigm," Computer, vol. 35, no. 
1, pp. 70-78, 2002. 
[27] K. Orthner, "Applying the Benefits of Network on a Chip Architecture to FPGA System 
Design," in "white paper," Intel® FPGA  
[28] D. U. Becker, "Efficient Microarchitecture for Network-On-Chip Routers," PhD PhD, Electrical 
Engineering and the Committee on Graduate Studies of Stanford University, Stanford 
University, 2012. 
[29] M. S. Abdelfattah and V. Betz, "Networks-on-chip for FPGAs: Hard, soft or mixed?," ACM 
Transactions on Reconfigurable Technology and Systems (TRETS), vol. 7, no. 3, p. 20, 2014. 
[30] M. Palesi, M. Daneshtalab, and M. Palesi, Routing algorithms in networks-on-chip. New York: 
Springer, 2014. 
[31] M. S. Gaur, V. Laxmi, M. Zwolinski, M. Kumar, N. Gupta, and Ashish, "Network-on-chip: 
Current issues and challenges," in 2015 19th International Symposium on VLSI Design and 
Test, 2015, pp. 1-3. 
[32] K. Paramasivam, "Network on Chip and its Research Challenges," ICTACT Journal on 
Microelectronics, vol. 01, no. 2, pp. 83-87, July 2015. 
[33] R. Kamal and N. Yadav, NOC and Bus Architecture: A Comparison. 2012. 
[34] CMP. IC 28nm CMOS28FDSOI. Available: https://mycmp.fr/datasheet/ic-28nm-cmos28fdsoi 
[35] B. Vandana, "Study of floating body effect in SOI technology," International Journal of 
Modern Engineering Research, vol. 3, pp. 1817-1824, 01/01 2013. 
[36] J. Luu et al., "VTR 7.0: Next generation architecture and CAD system for FPGAs," ACM 
Transactions on Reconfigurable Technology and Systems (TRETS), vol. 7, no. 2, p. 6, 2014. 
[37] INTEL®. (2015). Stratix V Device Overview. Available: https://www.intel.com/content/www/ 
us/en/programmable/documentation/sam1403476018909.html 
[38] R. B. Reese and R. A. TAYLOR. Uncle (Unified NCL Environment) [Online]. Available: 
http://my.ece.msstate.edu/faculty/reese/uncle/UNCLE.pdf 
[39] P. Jamieson, K. B. Kent, F. Gharibian, and L. Shannon, "Odin II - An Open-Source Verilog HDL 
Synthesis Tool for CAD Research," in 2010 18th IEEE Annual International Symposium on 
Field-Programmable Custom Computing Machines, 2010, pp. 149-156. 
[40] C. Wolf. Yosys Open SYnthesis Suite. Available: http://www.clifford.at/yosys/ 
[41] ABC: A System for Sequential Synthesis and Verification. Available: http://www.eecs. 
berkeley.edu/~alanmi/abc/ 
[42] V. Betz and J. Rose, "VPR: a new packing, placement and routing tool for FPGA research," 
Berlin, Heidelberg, 1997, pp. 213-222: Springer Berlin Heidelberg. 
[43] J. Luu et al., "VPR 5.0: FPGA cad and architecture exploration tools with single-driver routing, 
heterogeneity and process scaling," presented at the Proceedings of the ACM/SIGDA 
international symposium on Field programmable gate arrays, Monterey, California, USA, 
2009.  
P a g e  | 129 
 
[44] R. Gindin, I. Cidon, and I. Keidar, "NoC-Based FPGA: Architecture and Routing," in First 
International Symposium on Networks-on-Chip (NOCS'07), 2007, pp. 253-264. 
[45] R. E. Payne, "Self-Timed FPGA Systems," presented at the 5th International Workshop on 
Field Programmable Logic and Applications, 1995.  
[46] S. Hauck, S. Burns, G. Borriello, and C. Ebeling, "An FPGA for implementing asynchronous 
circuits," IEEE Design & Test of Computers, vol. 11, no. 3, p. 60, 1994. 
[47] K. Maheswaran, Implementing Self-timed Circuits in Field Programmable Gate Arrays. 
University of California, Davis, 1995. 
[48] D. R. Lamb, "Self-Timed Circuits for Adaptive Processing Systems," presented at the Military 
and Aerospace Applications of Programmable Devices and Technologies Conference, 1998.  
[49] M. Aydin and C. Traver, "Implementation of a Programmable Phased Logic Cell," presented 
at the 45th Midwest Symposium on Circuits and Systems, Vol. 2, 2002.  
[50] K. Meekins, D. Ferguson, and M. Basta, "Delay Insensitive NCL Reconfigurable Logic," 
presented at the IEEE Aerospace Conference, Vol. 4, 2002.  
[51] S. C. Smith, "Design of an FPGA Logic Element for Implementing Asynchronous NULL 
Convention Logic Circuits," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 
vol. 15, no. 6, pp. 672-683, 2007. 
[52] D. E. Muller and W. S. Bartky, A Theory of Asynchronous Circuits (no. v. 1). Univ., 1956. 
[53] I. David, R. Ginosar, and M. Yoeli, "An efficient implementation of Boolean functions as self-
timed circuits," Computers, IEEE Transactions on, vol. 41, no. 1, pp. 2-11, 1992. 
[54] D. E. Muller, "Asynchronous logics and application to information processing," (Switching 
Theory in Space Technology. Stanford, CA: Stanford University Press, 1963, p.^pp. Pages. 
[55] G. E. Sobelman and K. M. Fant, "CMOS circuit design of threshold gates with hysteresis," in 
IEEE International Symposium on Circuits and Systems (II), 1998, pp. 61-65. 
[56] S. C. Smith, R. F. DeMara, J. S. Yuan, M. Hagedorn, and D. Ferguson, "Delay-insensitive gate-
level pipelining," Integration, vol. 30, no. 2, pp. 103-131, 2001/10/01/ 2001. 
[57] M. M. Kim, J. Kim, and P. Beckett, "Area performance tradeoffs in NCL multipliers using two-
dimensional pipelining," in 2015 International SoC Design Conference (ISOCC), 2015, pp. 125-
126. 
[58] S. C. Smith and J. Di, "Designing Asynchronous Circuits using NULL Convention Logic (NCL)," 
Synthesis Lectures on Digital Circuits and Systems, vol. 4, no. 1, pp. 1-96, 2009/01/01 2009. 
[59] L. D. Tran, G. I. Matthews, P. Beckett, and A. Stojcevski, "Null convention logic (NCL) based 
asynchronous design — fundamentals and recent advances," in 2017 International 
Conference on Recent Advances in Signal Processing, Telecommunications & Computing 
(SigTelCom), 2017, pp. 158-163. 
[60] D. Sokolov, A. d. Gennaro, and A. Mokhov, "Reconfigurable asynchronous pipelines: From 
formal models to silicon," in 2018 Design, Automation & Test in Europe Conference & 
Exhibition (DATE), 2018, pp. 1562-1567. 
[61] M. Ligthart, K. Fant, R. Smith, A. Taubin, and A. Kondratyev, "Asynchronous Design Using 
Commercial HDL Synthesis Tools," presented at the Proceedings of the 6th International 
Symposium on Advanced Research in Asynchronous Circuits and Systems, 2000.  
[62] A. Kondratyev and K. Lwin, "Design of asynchronous circuits by synchronous CAD tools," 
presented at the Proceedings of the 39th annual Design Automation Conference, New 
Orleans, Louisiana, USA, 2002.  
[63] M. Moreira, A. Neutzling, M. Martins, A. Reis, R. Ribas, and N. Calazans, "Semi-custom NCL 
Design with Commercial EDA Frameworks: Is it Possible?," in 2014 20th IEEE International 
Symposium on Asynchronous Circuits and Systems, 2014, pp. 53-60. 
[64] C. Traver, R. B. Reese, and M. A. Thornton, "Cell Designs for Self-Timed FPGAs," presented at 
the 14th Annual IEEE International ASIC/SOC Conference, 2001.  
[65] S. Ataei and R. Manohar, "AMC: An Asynchronous Memory Compiler," in 2019 25th IEEE 
International Symposium on Asynchronous Circuits and Systems (ASYNC), 2019, pp. 1-8. 
P a g e  | 130 
 
[66] M. S. Hromalik, K. Green, H. Philipp, M. W. Tate, and S. M. Gruner, "Asynchronous and 
synchronous implementations of the autocorrelation function for the FPGA X-ray pixel array 
detector," in Real Time Conference (RT), 2012 18th IEEE-NPSS, 2012, pp. 1-8. 
[67] W. Chen, H. Wu, S. Wei, A. He, and H. Chen, "An Asynchronous Energy-Efficient CNN 
Accelerator with Reconfigurable Architecture," in 2018 IEEE Asian Solid-State Circuits 
Conference (A-SSCC), 2018, pp. 51-54. 
[68] C. Xu, C. Wang, L. Gong, L. Jin, X. Li, and X. Zhou, "Domino: Graph Processing Services on 
Energy-Efficient Hardware Accelerator," in 2018 IEEE International Conference on Web 
Services (ICWS), 2018, pp. 274-281. 
[69] A. Kyrola, G. Blelloch, and C. Guestrin, "GraphChi: large-scale graph computation on just a 
PC," presented at the Proceedings of the 10th USENIX conference on Operating Systems 
Design and Implementation, Hollywood, CA, USA, 2012.  
[70] Q. T. Ho, J.-B. Rigaud, L. Fesquet, M. Renaudin, and R. Rolland, "Implementing Asynchronous 
Circuits on LUT Based FPGAs," in Field-Programmable Logic and Applications: Reconfigurable 
Computing Is Going Mainstream, Berlin, Heidelberg, 2002, pp. 36-46: Springer Berlin 
Heidelberg. 
[71] I. Lemberski, "LUT-oriented dual-rail quasi-delayinsensitive logic synthesis," Electronics 
Letters, vol. 50, no. 7, pp. 503-505, 2014. 
[72] I. Lemberski and A. Suponenkovs, "Asynchronous logic design targeting LUTs," in 2018 7th 
Mediterranean Conference on Embedded Computing (MECO), 2018, pp. 1-6. 
[73] I. Lemberski and A. Suponenkovs, "Asynchronous logiс one-level LUT design based on partial 
acknowledgement," Microelectronics Journal, vol. 80, pp. 53-61, 2018/10/01/ 2018. 
[74] I. Lemberski, A. Suponenkovs, and M. Uhanova, "LUT-Oriented Asynchronous Logic Design 
Based on Resubstitution," in 2019 14th International Conference on Design & Technology of 
Integrated Systems In Nanoscale Era (DTIS), 2019, pp. 1-4. 
[75] M. Herrera and F. Viveros, "Asynchronous 8-bit processor mapped into an FPGA device," in 
Communications and Computing (COLCOM), 2014 IEEE Colombian Conference on, 2014, pp. 
1-7. 
[76] J. Spars and S. Furber, Principles of Asynchronous Circuit Design: A Systems Perspective. 
Springer Publishing Company, Incorporated, 2010, p. 360. 
[77] M. M. Kim and P. Beckett, "Design Techniques for NCL-Based Asynchronous Circuits on 
Commercial FPGA," in 2014 17th Euromicro Conference on Digital System Design, 2014, pp. 
451-458. 
[78] S. C. Smith, "Design of a logic element for implementing an asynchronous FPGA," presented 
at the Proceedings of the 2007 ACM/SIGDA 15th international symposium on Field 
programmable gate arrays, Monterey, California, USA, 2007. Available: https:// dl.acm.org/ 
citation.cfm?id=1216922 
[79] A. Mahram, B. Ghavami, and H. Pedram, "The impact of copy-elements on QDI asynchronous 
FPGA interconnect structure," in Design & Technology of Integrated Systems in Nanoscale 
Era, 2007. DTIS. International Conference on, 2007, pp. 87-91. 
[80] F. A. Parsan and S. C. Smith, "CMOS implementation comparison of NCL gates," in 2012 IEEE 
55th International Midwest Symposium on Circuits and Systems (MWSCAS), 2012, pp. 394-
397. 
[81] J. Wu and M. Choi, "Memristor lookup table (MLUT)-based asynchronous nanowire crossbar 
architecture," in 10th IEEE International Conference on Nanotechnology, 2010, pp. 1100-
1103. 
[82] X. Virtex, "FPGA Family," ed: Virtex-7. 
[83] W. J. Dally, "Virtual-channel flow control," IEEE Transactions on Parallel and Distributed 
Systems, vol. 3, no. 2, pp. 194-205, 1992. 
[84] G. Dimitrakopoulos, A. Psarras, and I. Seitanidis, Microarchitecture of Network-on-Chip 
Routers A Designer's Perspective. New York, NY: Springer New York : Imprint: Springer, 2015. 
P a g e  | 131 
 
[85] A. Monemi, C. Y. Ooi, and M. N. Marsono, "Low latency network-on-chip router 
microarchitecture using request masking technique," Int. J. Reconfig. Comput., vol. 2015, pp. 
2-2, 2015. 
[86] R. Sharma, V. V. Joshi, and V. M. Rohokale, "Performance of router design for Network-on-
Chip implementation," in 2012 International Conference on Communication, Information & 
Computing Technology (ICCICT), 2012, pp. 1-6. 
[87] P. M. Yaghini, A. Eghbal, S. A. Asghari, and H. Pedram, "Power comparison of an 
asynchronous and synchronous network on chip router," in Computer Conference, 2009. 
CSICC 2009. 14th International CSI, 2009, pp. 242-246. 
[88] S. E. Lee and N. Bagherzadeh, "A high level power model for Network-on-Chip (NoC) router," 
Comput. Electr. Eng., vol. 35, no. 6, pp. 837-845, 2009. 
[89] S. Badrouchi, K. Torki, R. Tourki, and A. Zitouni, "Asynchronous NoC router design," (in 
English), Journal of Computer Science, Article vol. 1, p. 429+, 2005. 
[90] M. K. Papamichael and J. C. Hoe, "CONNECT: re-examining conventional wisdom for 
designing nocs in the context of FPGAs," presented at the Proceedings of the ACM/SIGDA 
international symposium on Field Programmable Gate Arrays, Monterey, California, USA, 
2012.  
[91] R. M. Francis, "Exploring networks-on-chip for FPGAs," University of Cambridge, Computer 
Laboratory2013. 
[92] Y. Hoskote, S. Vangal, A. Singh, N. Borkar, and S. Borkar, "A 5-GHz Mesh Interconnect for a 
Teraflops Processor," IEEE Micro, vol. 27, no. 5, pp. 51-61, 2007. 
[93] J. Bainbridge and S. Furber, "Chain: a delay-insensitive chip area interconnect," IEEE Micro, 
vol. 22, no. 5, pp. 16-23, 2002. 
[94] A. J. Martin, "The Limitations to Delay-Insensitivity in Asynchronous Circuits," in Beauty Is 
Our Business: A Birthday Salute to Edsger W. Dijkstra, W. H. J. Feijen, A. J. M. van Gasteren, D. 
Gries, and J. Misra, Eds. New York, NY: Springer New York, 1990, pp. 302-311. 
[95] A. Lines, "Asynchronous interconnect for synchronous SoC design," IEEE Micro, vol. 24, no. 1, 
pp. 32-41, 2004. 
[96] E. Beigne, F. Clermidy, P. Vivet, A. Clouard, and M. Renaudin, "An asynchronous NOC 
architecture providing low latency service and its multi-level design framework," in 11th IEEE 
International Symposium on Asynchronous Circuits and Systems, 2005, pp. 54-63. 
[97] M. Amde, T. Felicijan, A. Efthymiou, D. Edwards, and L. Lavagno, "Asynchronous on-chip 
networks," IEE Proceedings - Computers and Digital Techniques, vol. 152, no. 2, pp. 273-283, 
2005. 
[98] J. Yu, "Lookup Table Design for Null Convention Logic," RMIT University, EEET1242 Project 
Report2013. 
[99] M. Rajgara, "Implementation of Null Conventional Logic in COTS FPGA's," Electrical 
Engineering Masters, Wright State University, 2008. 
[100] J. Yu and P. Beckett, "A dual-rail LUT for reconfigurable logic using null convention logic," 
presented at the Proceedings of the 24th edition of the great lakes symposium on VLSI, 
Houston, Texas, USA, 2014.  
[101] Y. Li, "Null Convention Logic Register," School of Electrical and Computer Engineering, RMIT 
University, Master project 2013. 
[102] A. Stillmaker and B. Baas, "Scaling equations for the accurate prediction of CMOS device 
performance from 180nm to 7nm," Integration, vol. 58, pp. 74-81, 2017/06/01/ 2017. 
[103] F. F. Khan and A. Ye, "An evaluation on the accuracy of the minimum width transistor area 
models in ranking the layout area of FPGA architectures," in 2016 26th International 
Conference on Field Programmable Logic and Applications (FPL), 2016, pp. 1-11. 
[104] Z. Wei and C. Yu, "New generation of predictive technology model for sub-45nm design 
exploration," in 7th International Symposium on Quality Electronic Design (ISQED'06), 2006, 
pp. 6 pp.-590. 
P a g e  | 132 
 
[105] A. Corporation, "Stratix V device handbook," 2013. 
[106] C. Albrecht, "IWLS 2005 benchmarks," in International Workshop for Logic Synthesis (IWLS): 
http://www. iwls. org, 2005. 
[107] J. Lamoureux and S. J. E. Wilton, "Activity Estimation for Field-Programmable Gate Arrays," 
in 2006 International Conference on Field Programmable Logic and Applications, 2006, pp. 1-
8. 
[108] P. Kudva and V. Akella, "A technique for estimating power in asynchronous circuits," in 
Proceedings of 1994 IEEE Symposium on Advanced Research in Asynchronous Circuits and 
Systems, 1994, pp. 166-175. 
[109] S. N. Ved, S. Singh, and J. Mekie, "PANE: Pluggable Asynchronous Network-on-Chip 
Simulator," J. Emerg. Technol. Comput. Syst., vol. 15, no. 1, pp. 1-27, 2019. 
 
