Optimization of Ultra-Low Power Application-Specific Asynchronous Deep Learning Integrated Circuit Design by Sherrill, Cole
University of Arkansas, Fayetteville
ScholarWorks@UARK
Computer Science and Computer Engineering
Undergraduate Honors Theses Computer Science and Computer Engineering
5-2019
Optimization of Ultra-Low Power Application-
Specific Asynchronous Deep Learning Integrated
Circuit Design
Cole Sherrill
University of Arkansas, Fayetteville
Follow this and additional works at: https://scholarworks.uark.edu/csceuht
Part of the Digital Circuits Commons
This Thesis is brought to you for free and open access by the Computer Science and Computer Engineering at ScholarWorks@UARK. It has been
accepted for inclusion in Computer Science and Computer Engineering Undergraduate Honors Theses by an authorized administrator of
ScholarWorks@UARK. For more information, please contact ccmiddle@uark.edu.
Recommended Citation
Sherrill, Cole, "Optimization of Ultra-Low Power Application-Specific Asynchronous Deep Learning Integrated Circuit Design"








A thesis submitted in partial fulfillment 
of the requirements for the award of 




















Jia Di, Ph.D. 





____________________________________            ____________________________________ 
Patrick Parkerson, Ph.D.    Yarui Peng, Ph.D. 












 The Internet of Things (IoT) consists of all devices connected to the internet, including 
battery-powered devices like surveillance cameras and smart watches. IoT devices are often idle, 
making leakage power a crucial design constraint. Currently, there are only a few low-power 
application-specific processors for deep learning. Recently, the Trustable Logic Circuit Design 
(TruLogic) laboratory at the UofA designed an asynchronous Convolutional Neural Network 
(CNN) system. However, the original design suffered from delay-sensitivity issues undermining its 
reliable operation. The aim of this thesis research is to modify the existing CNN circuit to achieve 
increased reliability and to optimize the improved design for low-power, IoT applications. 
Simulations demonstrate that the delay-sensitivity modifications allow the CNN system to operate 
more reliably. The optimized CNN circuit consumes 9.9% less active energy and 14% less leakage 
power. In addition, the optimized CNN circuit requires 8.24% smaller area and is 1.43% faster than 
the more reliable CNN circuit. These improvements show the proposed CNN system is better suited 
























©2019 by Cole Sherrill 
















This research was supported by a University of Arkansas Honors College Research Grant. I 
would like to thank my advisor, Dr. Jia Di for the research opportunities provided to me working in 
his lab and the graduate research assistantship for my continued studies. I am grateful for my fellow 
TruLogic lab members for assisting me on this and other projects. Last but not least, I would not be 

















TABLE OF CONTENTS 
1 Introduction ....................................................................................................................................... 1 
2 Background ....................................................................................................................................... 2 
2.1 Convolutional Neural Network ................................................................................................... 2 
2.2 Asynchronous Logic ................................................................................................................... 2 
2.3 Original Architecture .................................................................................................................. 3 
3 Research Approach ........................................................................................................................... 7 
3.1 More Reliable Operation ............................................................................................................ 7 
3.2 Optimization for IoT Applications ............................................................................................. 8 
4 Results and Analysis ....................................................................................................................... 13 
4.1 Setup ......................................................................................................................................... 13 
4.2 Results....................................................................................................................................... 16 
5 Conclusion ....................................................................................................................................... 23 





 Recently, deep learning has become a subject of great interest in industry. Although the 
theory has existed for more than 70 years, advances in computational power make its practical use 
possible [1]. For example, while traditional CPUs have 4 to 24 general-purpose cores, GPUs 
employ between 1000 and 4000 specialized data-processing cores [2]. The large number of cores in 
GPUs enable highly parallel executions, making GPUs ideal candidates for use in neural networks 
where multiple neurons can be computed concurrently.  
In general, deep learning technology attempts to recognize patterns. At the moment, most 
deep learning services take place on general-purpose processors in the form of software [3]. Using 
general-purpose processors for deep learning applications can be very slow and power-hungry. For 
example, when Siri is asked a question, her answer is not computed by local hardware in the phone 
[4]. Instead, it is computed on powerful cloud-based servers, because the hardware in the phone 
lacks the necessary computational power. The power needs of general-purpose processors make 
them unrealistic for use in IoT applications. IoT consists of all devices connected to the internet 
including surveillance cameras, distributed sensors, smart watches, etc. [5]. Often, IoT devices are 
idle and not performing any useful work. This makes them very susceptible to leakage power, 
especially considering that many IoT devices are battery-powered. However, despite the needs, 
currently there are only a few low-power application-specific processors for deep learning.  
 This thesis research focuses on the optimization of an asynchronous Convolutional Neural 
Network (CNN) system designed in UofA’s Trustable Logic Circuit Design (TruLogic) laboratory 
for IoT applications. The original CNN circuit was designed using Multi-Threshold NULL 
Convention Logic (MTNCL), an ultra-low power asynchronous circuit design paradigm. MTNCL 
has low leakage power since it is slept while idle. This makes MTNCL well suited for IoT circuits. 
The original CNN system had some delay-sensitivity issues undermining its reliable operation. 
2 
 
Additionally, the memory control logic had disproportionately high leakage power, which 
overshadowed the leakage power of the remaining circuitry. This Honors thesis research can be split 
into the modifications for more reliable operation and the optimizations for IoT applications. 
Throughout the rest of this thesis, the “original” design refers to the design with the delay-
sensitivity issues. The “reliable” design has the reliability improvements, and the “optimized” 
design has both the reliability improvements and the optimizations for IoT applications. 
2 Background 
2.1 Convolutional Neural Network 
Numerous deep learning algorithms and architectures exist [2]. One of the most popular 
algorithms is the Convolutional Neural Network (CNN). CNNs are multilayer neural networks 
based on an animal visual cortex. There are two types of layers in CNNs, convolution and pooling. 
The early convolution layers in the network extract features from the input. After each convolution 
layer, a pooling layer reduces the number of features by choosing the more important features. After 
processing, the CNN outputs the most important features from the input. Thus, CNNs are useful for 
image recognition, video analysis, and language processing. 
2.2 Asynchronous Logic 
 The original CNN system was designed using an asynchronous design paradigm called 
Multi-Threshold NULL Convention Logic (MTNCL). MTNCL utilizes multi-rail logic to achieve 
delay-insensitivity while removing the area overhead associated with fine-grained MTCMOS 
synchronous designs [6]. MTNCL utilizes a dual-rail encoding and has three legal values 
corresponding to Boolean ‘0’ (DATA0), Boolean ’1’ (DATA1), and an absence of data (NULL). 
MTNCL circuits have one set of delay-insensitive (DI) registers at the input and one at the output. 
Additional DI registers can be integrated into the combinational logic for increased pipelining. To 
prevent overlapping DATA wavefronts, DI registers separate DATA waves with NULL waves by 
3 
 
communicating with their neighboring registers using request and acknowledge signals. Completion 
Detection units combine all acknowledge signals to generate a request signal to the previous register 
stage. During NULL cycles, MTNCL gates are “at rest”, and their power-ground paths are gated to 
reduce leakage power.  
2.3 Original Architecture 
The original CNN system, illustrated in Figure 1, consists of three convolution layers, each 
followed by an average pooling layer. The CNN circuit’s control logic was omitted from the figure 
for clarity. Pipelining registers were added between each of the main component types in the 
diagram. The outputs of the third pooling layer go into a fully connected layer that generates the 
final outputs for the design. Each convolution layer consists of a convolution step, a bias addition 
step, and a ReLU step. The convolution layers receive one or more input feature maps along with 
multiple groups of weight sets and bias values, and the feature maps are processed in 4×4 blocks 
starting from the top left corner. After completing one 4×4 block, the design shifts the block to the 
right by two until it reaches the right side of the feature maps, at which point it moves down by two 
and starts from the left again. The convolution step separates the 4×4 block into its four constituent 
3×3 blocks. For each 3×3 block, the convolution step multiply-accumulates the block by nine 
weight values elementwise to produce a single value. The second and third convolution layers 
receive multiple input feature maps and many groups of weights. Each group is used to generate a 
separate output feature map by summing the results of multiply-accumulating each input feature 
map by a corresponding set of nine weights in the group. The bias value corresponding to the group 
is added to the four sum results before they pass through the ReLU step. The ReLU step outputs the 
input value if it is positive and zero otherwise. Next, the four outputs of the ReLU units are fed into 
the average pooling unit, generating one value for one of the output feature maps. All values leaving 
the average pooling layers are written to an asynchronous memory unit to be used as inputs for the 
4 
 
next convolution step excluding the outputs of the third pooling layer. The outputs of the third 
pooling layer are fed into the fully connected layer as they are generated.  
 
Figure 1: The Original Asynchronous CNN System 
The first convolution layer takes one 32×32 pixel image. It also receives five – 3×3 sets of 
weights and five corresponding bias values. Each 3×3 block in the 4×4 block of the input image is 
multiply-accumulated by the same 3×3 sets of weights. The bias is added to each of the four results, 
and each value is processed by the ReLU units. Subsequently, the four values are averaged in the 
average pooling unit. The averaged result is written to the asynchronous memory. Next, the same 
operation is performed on the same 4×4 image block for a different 3×3 set of weights and bias 




The second convolution layer takes the five – 15×15 input feature maps, 15 groups of five –
3×3 sets of weights, and 15 corresponding bias values. The second and third convolution layers are 
where the accumulators following the convolution units come into play. A 4×4 block is taken from 
the same location in each of the five input feature maps. Every 4×4 block is matched to one of five 
3×3 sets of weights. Each 4×4 block is multiply-accumulated with its corresponding set of weights 
as in the first convolution layer, producing four values. Next, the four values from the five input 
feature maps are processed by the accumulators. As depicted in Figure 1, the five results from each 
of the four convolution units are summed separately, producing four results. One of the 15 bias 
values is added to the accumulated results before passing through the ReLU and average pooling 
unit. The averaged result is written to the memory as in the first layers. Subsequently, the same five 
– 4×4 blocks are processed with a different five – 3×3 weight set and different bias value to produce 
a value for a different output feature map. This process is repeated for each of the 15 groups of five 
– 3×3 weight sets and 15 bias values. The second convolution and pooling layers produce fifteen – 
7×7 output feature maps.  
The third convolution layer takes the fifteen – 7×7 feature maps, 20 groups of fifteen – 3×3 
weight sets and 20 bias values. This layer operates like the second layer except that there are 15 
accumulations, twenty – 3×3 output feature maps are produced, and the results are not written to the 
memory. As the third pooling layer completes, the 180 outputs are given to the fully connected 
layer, where they are multiply-accumulated by two different sets of 180 weights. Following the 180 
operations, a bias is added to both multiply-accumulation results to produce the CNN system’s final 
outputs. 
The blocks depicted in Figure 1 are fairly self-explanatory except for the convolution units. 
The convolution units were responsible for multiply-accumulating a 3×3 block of an input feature 
map by a 3×3 set of weights. To accomplish this, the convolution units contained nine – 8×8 
6 
 
Baugh-Wooley Multipliers and eight – 16-bit ripple-carry adders. The structure of the convolution 
units is provided in Figure 2. As mentioned earlier, Figure 1 does not include the control logic for 
the CNN circuit. The control logic can be split into three different components that handled the 
memory, accumulators, and multiply-accumulate units in the fully-connected layer. 
Figure 2: Convolution Unit Structure 
The memory control logic was by far the most complex component of the CNN system. The 
memory control logic consisted of an accumulator for the number of read operations, the read 
control logic, the write control logic, and a Finite State Machine (FSM) that handled the overall 
operation. The read control logic and write control logic were composed of four major components. 
The first of these was an FSM that controlled the overall operation of the read or write. The second 
contained multiple accumulators to keep track of the portions of the input feature maps being read 
in the read control logic and the output write locations in the write control logic. The third contained 
7 
 
multiplexers that selected the values added to change the position in the input feature maps and 
output feature maps. The last component compared the second component’s accumulated values 
and the read/write addresses to enable transitions to the second and third convolution layers.  
The accumulator control logic consisted of a counter for the total number of accumulations, 
a counter for the five accumulations in the second convolution layer and the 15 accumulations in the 
third convolution layer, and an FSM to reset the accumulators and to allow the accumulated results 
to continue to the ReLU step. The multiply-accumulate control logic was composed of a counter for 
the number of operations and a comparator to check for the completion of the 180 operations. 
Overall, the total number of distinct sequential designs in the CNN system was 21. 
3 Research Approach 
3.1 More Reliable Operation 
 The initial effort of this thesis research focuses on fixing the delay-sensitivity issues in the 
original CNN system. The first modifications targeted the sequential components in the CNN 
system. All the sequential designs were created by instantiating the registers and manually 
connecting the acknowledge and request signals. This technique is very error-prone. After studying 
an FSM designed previously, a generic MTNCL FSM was constructed to replace the existing ones. 
The structure of the new FSM is depicted in Figure 3. The memory control logic was the first 
component to receive the new FSM. In total, the memory control logic contained 11 sequential 
components. After the original FSMs were replaced, some of the remaining sleep signals were 
connected in an unusual manner. The sleep connections were revised to more closely match a 
standard MTNCL pipeline. After making these changes and a few others, the memory control logic 
functioned as intended. The same changes were applied to the accumulators, the fully-connected 
multiply-accumulate units (MACs) and the remaining control logic. 
8 
 
Figure 3: Generic MTNCL FSM 
3.2 Optimization for IoT Applications 
 In addition to the reliability modifications, this thesis research aims to optimize the CNN 
circuit for IoT applications. Three optimizations were developed and implemented in the VHDL 
files. The first and most significant optimization revolved around the Baugh-Wooley Multipliers. 
As addressed earlier, the CNN system processes input feature maps in 4×4 blocks, and the 4×4 
blocks have four – 3×3 convolution blocks. Since the convolution units have one 8×8 Baugh-
Wooley Multiplier for each element of the 3×3 block, the convolution circuitry has a total of 36 – 
8×8 Baugh-Wooley Multipliers. The outputs of the convolution units are truncated to eight bits, so 
the upper three bits of the 16-bit multiplier results do not need to be calculated. The first 
optimization included removing the unnecessary logic from the Baugh-Wooley Multipliers and the 
associated ripple-carry adders. Figure 4 illustrates the structure of the 8×8 Baugh-Wooley 





















Figure 4: Baugh-Wooley Multiplier with Optimization 
10 
 
The second optimization focused on a subset of the completion detection trees within the 
design. Figure 5 details the structure of an unoptimized 16-bit completion detection tree. These 
components are responsible for checking whether a group of dual-rail signals are all DATA or all 
NULL. The first stage of some of the completion detection trees utilized MTNCL TH12 gates (Z = 
A + B). The optimization involved replacing the TH12 gates with MTNCL TH24comp gates (Z = 
AC + AD + BC + BD). This change approximately halved the number of outputs from the first 
stage, thereby reducing the number of gates in subsequent stages. In some cases, it may have 
eliminated a stage from the completion detection tree, reducing overall latency. The optimization 
was applied to two 32-bit trees and one 64-bit tree. This optimization was applied while working on 
the delay-sensitivity problems, so it did not contribute to the improvement of the optimized design. 


















































Figure 6: Optimized 16-bit Completion Detection Tree 
Originally, the MTNCL registers and completion detection units in the CNN system were 
resettable. Each sequential design requires one input register and three state registers. All the 
registers except one of the three state registers reset to NULL. Each reset-to-NULL register has a 
completion component that resets to ‘1’, so while reset is asserted, all reset-to-NULL registers are 
slept by their corresponding completion component. In this case, the sleep signal can replace the 
reset signal. This optimization involved replacing all reset-to-NULL registers with non-resettable 
13 
 
registers. Due to the layout area of the gates in question, this optimization reduced the area of the 
design. Leakage power may have been reduced as well.  
In addition to these central optimizations, a few minor optimizations were performed. 
Originally, the read and write control logic had components preventing their addresses from 
reaching the asynchronous memory until the convolution units were ready for DATA or until the 
pooling unit result was DATA, respectively. Since the memory completes its operation when the 
read/write signal transitions to DATA, these unnecessary components were removed. Lastly, the 
gates used in the MTNCL 2-to-1 multiplexer were replaced with fewer, more complex gates. 
4 Result and Analysis 
4.1 Setup 
 After implementing the optimizations, measurements concerning active energy, leakage 
power, area, and speed were collected for the reliable and optimized designs for comparison. The 
original CNN circuit was taped out in the GLOBALFOUNDRIES 8RF 130nm bulk CMOS process. 
To allow fair comparison to the original memory control logic before the reliability modifications or 
optimizations, the same cell libraries were used for obtaining the active energy, leakage power, and 
area data. All cells utilized LVT and RVT transistors. All the non-buffer cells were “A” sized to 
meet a 130ps rise/fall time on a 5 femtofarad (fF) load capacitance. Buffers B, C, D, and E were 
sized for the same rise/fall time for the following load capacitances, respectively: 10fF, 20fF, 40fF, 
and 80fF.  
The original CNN system required I/O logic to interface with the single-rail asynchronous 
memory and shift registers to receive inputs due to I/O pad limitation. In this thesis research, no I/O 




To obtain active energy and leakage power, the CNN system was broken into its constituent 
pieces. This included the remaining I/O logic (single-rail to dual-rail converters, sixteen – 8-bit 2-
to-1 MUXes, and two Boolean components responsible for checking whether the final outputs were 
DATA), the convolution units, the accumulators and their control logic, bias adders, ReLU units, 
average pooling unit, memory control logic, MACs and control, TH22_enable, and pipelining 
registers. These five steps were followed for each of the designs: 1) obtained a flattened gate-level 
netlist using Synopsys Design Compiler; 2) reformatted the netlist using Python and Java; 3) 
buffered the netlist using a Java program; 4) reformatted the buffered netlist using Java; and 5) 
imported the netlist to Cadence Virtuoso. The imported designs required individual Verilog-AMS 
testbenches for the active energy measurements.  
The sequential designs received inputs based on their request and acknowledge signals. The 
combinational designs were given inputs based on a fixed time interval. The active energy of each 
design was measured over 10 DATA waves and 10 NULL waves. Given the dual-rail logic of 
MTNCL, the active energy is far less dependent on the data pattern than single-rail Boolean logic. 
However, each design received random inputs to obtain the best average data possible. Integrating 
the current through the VDD pin over the 10 DATA/NULL waves provided the charge. Dividing 
the measurement by 10 (number of DATA/NULL waves) and multiplying by the 1.2V supply 
voltage provided the average active energy per operation. The single-rail to dual-rail converters and 
registers were instantiated at various widths throughout the design. The single- to dual-rail 
converters were measured with a width of 25 and the registers with a width of 32. Dividing the 
resulting measurement by the width approximated the active energy per width per operation. The 
total number of operations was calculated by multiplying the width of each instance by the 
instance’s number of operations and summing the results. Multiplying the total number of 
15 
 
operations by the active energy per width per operation provided the active energy for one input 
image.  
The leakage power was obtained using the same imported designs in Cadence Virtuoso. 
Instead of using a Verilog-AMS testbench, the data inputs were tied to ground and the sleep signal 
to VDD. Measuring the current through the VDD pin over 1ms, multiplying the value by 1.2V and 
by 1000 (to obtain the leakage energy over one second) provided the leakage power. Similar to 
active energy, the measured leakage power was divided by 25 for the single- to dual-rail converters 
and 32 for the register to get the leakage power per width. These values were multiplied by the sum 
of the instance widths to get the total leakage power for the design in question. Summing the 
leakage power for all designs provided the leakage power for the entire CNN circuit. 
 The speed data was collected using Mentor Modelsim. Every component in the CNN system 
was designed structurally, so the delays for the fundamental NCL, MTNCL, and Boolean gates used 
along with the delays for the behavioral memory model were required. The average delays for most 
of the gates were measured earlier in the CNN project. The remaining gate delays were measured 
using Cadence Virtuoso. The delays in the behavioral memory model were those provided by the 
memory design team. The speed of the reliable and optimized designs was measured using the time 
the designs required to produce outputs. 
 Obtaining the area data began by measuring the area of the layouts for all gates used in the 
CNN circuit. Next, the entire design was flattened using Design Compiler. A program counted the 
instances of each gate, multiplied the count by the area of the gate, and summed the total areas for 






 Table 1 shows the active energy data for the reliable design. The values in the Total Active 
Energy column are those required to process an entire image. It is no surprise that the convolution 
units required the greatest active energy at 846 nJ and 80% of the total. The convolution units likely 
required more gates than any other component. Furthermore, excluding the simple components, the 
convolution units had the largest number of operations. The memory control logic for read 
operations had the next largest active energy at 72.4 nJ and 6.9% of the total active energy 
consumption. The write operations required about a fifth of the energy as the read operations. The 
write operations used less circuitry than the read operations, and there were over three times more 
read operations than write operations. While the MACs and their control had the highest active 
energy per operation, the total required active energy was 0.6% of the total due to the 180 





























Single- to Dual-Rail 
Converter 
65 206,716 53.6 11.1 
8-bit 2-to-1 MUX 16 120,000 98.1 11.8 
Output Data Checker 2 2 123.7 0.0 
Convolution Unit 4 30,000 28210.0 846.3 
Accumulators and Control 1 7,500 8681.0 65.1 
Bias Adder 4 8,160 206.3 1.7 
ReLU Unit 4 8,160 83.1 0.7 
Pooling Unit 1 2,040 669.7 1.4 
Memory Control (Read) 1 6,375 11363.0 72.4 
Memory Control (Write) 1 1,860 8069.0 15.0 
MACs and Control 1 180 36280.0 6.5 
FC Bias Adders 2 2 412.6 0.0 
TH22_enable 1 2,040 96.9 0.2 
Registers 184 724,800 26.5 19.2 
Total       1051.4 
 
 Table 2 provides the same data for the optimized design. The values affected by the 
optimizations are highlighted. The optimized convolution units required a total of 768 nJ, a 9% 
reduction. The total energy for the accumulators and their control logic improved by 17%. The 
optimized memory control logic required a total of 80.7 nJ, a 7.7% improvement. All of the 
improved designs except the convolution units were aided by the optimization to the reset-to-NULL 
registers. While the registers in the reliable design required 19.2 nJ, the optimized registers required 
11.9 nJ, a 38% improvement. Overall, the optimized design required 947.6 nJ to process an entire 




















Single- to Dual-Rail 
Converter 
65 206716 53.6 11.1 
8-bit 2-to-1 MUX 16 120,000 98.1 11.8 
Output Data Checker 2 2 123.7 0.0 
Convolution Unit 4 30,000 25610.0 768.3 
Accumulators and Control 1 7,500 7197.0 54.0 
Bias Adder 4 8,160 206.3 1.7 
ReLU Unit 4 8,160 83.1 0.7 
Pooling Unit 1 2,040 669.7 1.4 
Memory Control (Read) 1 6,375 10574.0 67.4 
Memory Control (Write) 1 1,860 7173.0 13.3 
MACs and Control 1 180 32860.0 5.9 
FC Bias Adders 2 2 412.6 0.0 
TH22_enable 1 2,040 96.9 0.2 
Registers 172 634,800 18.7 11.9 
Total       947.6 
 
 Table 3 details the leakage power data for the reliable design. As expected, the convolution 
units exhibited the highest leakage power. The convolution units accounted for 41% of the total 
CNN circuit leakage power at 5.33 µW. The component with the next highest leakage power was 
the memory control logic with 3.73 µW and 29% of the total. During the original CNN project, 
active energy and leakage power data was collected for the original, delay-sensitive design. The 
leakage power of the memory control logic was 204 µW. Unfortunately, the leakage power of the 
memory control logic overshadowed the overall leakage power of the design. At the time, the total 
leakage power for the entire CNN circuit was 213 µW, so the memory control logic accounted for 
almost 96% of the total. The 204 µW was disproportionately large given the relative size of the 
memory control logic. As addressed earlier, the sleep signals in the original, delay-sensitive 
memory control logic were connected unconventionally. Parts of the memory control logic might 
19 
 
have been unslept when they could have been slept, and this probably contributed to the 
disproportionately large leakage power. As seen in Table 3, the delay-sensitivity modifications 
reduced the leakage power of the memory control logic to 3.73 µW, a 98% reduction in the original 
design. In total, the reliable CNN circuit consumed 12.9 µW of leakage power. 





Per Instance (nW) 
Total Leakage 
Power (nW) 
Single- to Dual-Rail Converter 65 6.77 440.05 
8-bit 2-to-1 MUX 16 5.24 83.84 
Output Data Checker 2 14.45 28.90 
Convolution Unit 4 1332.00 5328.00 
Accumulators and Control 1 1714.00 1714.00 
Bias Adder 4 7.44 29.76 
ReLU Unit 4 7.00 28.00 
Pooling Unit 1 23.30 23.30 
Memory Control (Read and Write) 1 3734.00 3734.00 
MACs and Control 1 1164.00 1164.00 
FC Bias Adders 2 14.89 29.78 
TH22_enable 1 2.49 2.49 
Registers 184 1.86 342.24 
Total     12948.36 
 
 Table 4 contains the leakage power for the optimized design. Removing the logic from the 
convolution unit responsible for generating the upper three bits of the result reduced the leakage 
power by 9% to 4.84 µW. Comparing the components individually, the accumulators experienced 
the greatest reduction in leakage power from 1.71 µW to 1.01 µW, a 41% reduction. The sole 
optimization applying to the accumulators and their control logic was the replacement of reset-to-
NULL registers with non-resettable registers. Comparing the leakage power of the reset-to-NULL 
register in Table 3 to the non-resettable register in Table 4 yields a 17% reduction in leakage power. 
By comparing the buffered netlists used for the reliable and optimized simulations and the input 
20 
 
capacitance values used to buffer the netlists, it is apparent that the optimized design required 
significantly less buffering than the reliable design. Around 75% of the resettable TH12nm_a gates 
were replaced with non-resettable TH12m_a gates since two of the three state registers and the input 
register were not reset in the optimized design. The input capacitance of the sleep pin on the 
TH12m_a gates used in the non-resettable registers was significantly lower than the input 
capacitance of the same pin on the TH12nm_a gates used in the resettable registers. Overall, the 
optimizations reduced the leakage power of the entire CNN circuit by about 14% from 12.9µW to 
11.1µW. 






Per Instance (nW) 
Total Leakage 
Power (nW) 
Single- to Dual-Rail Converter 65 6.77 440.05 
8-bit 2-to-1 MUX 16 5.24 83.84 
Output Data Checker 2 14.45 28.90 
Convolution Unit 4 1210.00 4840.00 
Accumulators and Control 1 1006.00 1006.00 
Bias Adder 4 7.44 29.76 
ReLU Unit 4 7.00 28.00 
Pooling Unit 1 23.30 23.30 
Memory Control (Read and Write) 1 3297.00 3297.00 
MACs and Control 1 1051.00 1051.00 
FC Bias Adders 2 14.89 29.78 
TH22_enable 1 2.49 2.49 
Registers 172 1.65 283.80 
Total     11143.92 
 
 Table 5 comprises the area data for the reliable and optimized designs. The rows with the 
area of the cells affected through the optimizations are highlighted. The reliable design had 2,604 
TH12m_a gates, and the optimized design had 3,954. The reliable design had 1,651 TH12nm_a, and 
the optimized design had 169. TH12m_a is the only non-buffer cell increasing in quantity after the 
21 
 
optimizations because the optimization to the registers replaced TH12nm_a with TH12m_a. The 
two most frequent cells in both designs were TH23m_a and TH34w2m_a. The optimizations 
reduced the count of both cells by about 9%. In general, the changes reduced the amount of logic in 
the CNN system. Also, the changes to the registers reduced the capacitive load of the reset net. This 
translated to less buffering. From Table 5, the count of each type of buffer aside from BUFFER_B 
was significantly reduced. The optimized design contained more BUFFER_B instances than the 
reliable design, the smallest of the buffers. The total area devoted to buffering in the reliable design 
was 8,625 square micrometers, and the total area devoted to buffering in the optimized design was 
7,530 square micrometers, a 13% improvement. Overall, the area of the reliable design was 494,352 














Table 5: Area Data for Reliable and Optimized Designs 
Cell Name 




Total Area in 
Original Design 
(um2) 
Count in Optimized 
Design 
Total Area in 
Optimized Design 
(um2) 
MTNCL           
TH12m_a 13.44 2604 34997.76 3954 53141.76 
TH12nm_a 17.28 1651 28529.28 169 2920.32 
TH12dm_a 21.12 167 3527.04 167 3527.04 
TH13m_a 17.28 23 397.44 23 397.44 
TH14m_a 19.2 30 576.00 30 576.00 
TH22m_a 13.44 2791 37511.04 2639 35468.16 
TH23m_a 21.12 7022 148304.64 6398 135125.76 
TH23w2m_a 19.2 9 172.80 9 172.80 
TH24w2m_a 28.8 1 28.80 1 28.80 
TH24w22m_a 19.2 1 19.20 1 19.20 
TH24compm_a 23.04 447 10298.88 441 10160.64 
TH33m_a 17.28 65 1123.20 65 1123.20 
TH33w2m_a 17.28 7 120.96 7 120.96 
TH34w2m_a 26.88 7018 188643.84 6394 171870.72 
TH34w3m_a 19.2 3 57.60 3 57.60 
TH34w32m_a 19.2 1 19.20 1 19.20 
TH44m_a 19.2 126 2419.20 124 2380.80 
TH44w2m_a 28.8 1 28.80 1 28.80 
TH44w3m_a 19.2 2 38.40 2 38.40 
TH54w22m_a 19.2 2 38.40 2 38.40 
TH54w32m_a 19.2 7 134.40 7 134.40 
THxor0m_a 21.12 628 13263.36 628 13263.36 
NCL           
TH12_a 11.52 3 34.56 3 34.56 
TH22n_a 32.64 89 2904.96 89 2904.96 
TH22d_a 24.96 21 524.16 21 524.16 
TH22_a 21.12 7 147.84 7 147.84 
TH24comp_a 28.8 12 345.60 12 345.60 
TH33n_a 34.56 28 967.68 28 967.68 
TH33_a 24.96 6 149.76 6 149.76 
TH44_a 30.72 6 184.32 6 184.32 
Boolean           
BUFFER_B 7.68 185 1420.80 208 1597.44 
BUFFER_C 7.68 239 1835.52 210 1612.80 
BUFFER_D 9.6 52 499.20 42 403.20 
BUFFER_E 15.36 317 4869.12 255 3916.80 
INVERT_A 5.76 351 2021.76 351 2021.76 
invm_a 7.68 2 15.36 2 15.36 
AND2_A 11.52 512 5898.24 512 5898.24 
OR2_A 9.6 33 316.80 33 316.80 
MUX21_A 15.36 128 1966.08 128 1966.08 
Totals   24597 494352.00 22979 453621.12 
23 
 
 Lastly, the reliable design required 238 microseconds to produce outputs, and the optimized 
design required 235 microseconds, a slight 1.43% improvement. When compared with the other 
stages, the convolution stage had high delays. The optimizations to the convolution units should 
have reduced the delay and contributed most to the improvement. 
5 Conclusion 
 The goal of this thesis research was to modify the existing CNN system for more reliable 
operation and to optimize the modified design to better suit IoT applications. The reliability 
optimizations were a success, as the modified CNN system produced correct results in the 
Modelsim simulations. Furthermore, the optimizations were also successful as all four of the 
measured parameters improved. The total active energy required for the optimized CNN circuit to 
process one image improved by 9.9%. The total leakage power was reduced from 12.9µW to 
11.1µW, a 14% improvement. The optimizations saved 40,731 square micrometers in area, an 
8.24% reduction. While less significant, the optimizations improved the speed by 1.43%. After 












[1]  R. D. Hof, “Deep Learning With massive amounts of computational power, machines can now 
recognize objects and translate speech in real time. Artificial intelligence is finally getting smart.,” 
MIT Technology Review. 
[2]  M. T. Jones, “Deep learning architectures The rise of artificial intelligence,” IBM 
developerWorks, 2017. 
[3]  P. Garden, “The IoT Requires A New Type Of Low-Power Processor,” Electronic Design, 
2014. 
[4]  L. Gomes, “Neuromorphic Chips Are Destined for Deep Learning—or Obscurity,” IEEE 
Spectrum, 2017. 
[5]  A. Meola, “What is the Internet of Things (IoT)?,” Tech Insider, 2016. 
[6]  L. Zhou, R. Parameswaran, F. A. Parsan, S. C. Smith, and J. Di, "Multi-Threshold NULL 
Convention Logic (MTNCL): An Ultra-Low Power Asynchronous Circuit Design Methodology." J. 
Low Power Electron. Appl. 5, no. 2: 81-100. 
