Non-Volatile Memory Adaptation in Asynchronous Microcontroller for Low Leakage Power and Fast Turn-on Time by Habimana, Jean Pierre Thierry
University of Arkansas, Fayetteville 
ScholarWorks@UARK 
Theses and Dissertations 
5-2021 
Non-Volatile Memory Adaptation in Asynchronous Microcontroller 
for Low Leakage Power and Fast Turn-on Time 
Jean Pierre Thierry Habimana 
University of Arkansas, Fayetteville 
Follow this and additional works at: https://scholarworks.uark.edu/etd 
 Part of the Computer and Systems Architecture Commons, and the Hardware Systems Commons 
Citation 
Habimana, J. T. (2021). Non-Volatile Memory Adaptation in Asynchronous Microcontroller for Low 
Leakage Power and Fast Turn-on Time. Theses and Dissertations Retrieved from 
https://scholarworks.uark.edu/etd/3998 
This Dissertation is brought to you for free and open access by ScholarWorks@UARK. It has been accepted for 
inclusion in Theses and Dissertations by an authorized administrator of ScholarWorks@UARK. For more 
information, please contact ccmiddle@uark.edu. 
 
Non-Volatile Memory Adaptation in Asynchronous Microcontroller for Low Leakage Power and 
Fast Turn-on Time 
 
 
A dissertation submitted in partial fulfillment  
of the requirements for the degree of  





Jean Pierre Thierry Habimana 
University of Arkansas 
Bachelor of Science in Computer Engineering, 2013 
University of Arkansas 




















__________________________________  ___________________________________ 
James P. Parkerson, Ph.D.    Dale Thompson, Ph.D. 









This dissertation presents an MSP430 microcontroller implementation using Multi-
Threshold NULL Convention Logic (MTNCL) methodology combined with an asynchronous 
non-volatile magnetic random-access-memory (RAM) to achieve low leakage power and fast 
turn-on.  This asynchronous non-volatile RAM is designed with a Spin-Transfer Torque (STT) 
memory device model and CMOS transistors in a 65 nm technology. A self-timed Quasi-Delay-
Insensitive 1 KB STT RAM is designed with an MTNCL interface and handshaking protocol. A 
replica methodology is implemented to handle write operation completion detection for long 
state-switching delays of the STT memory device. The MTNCL MSP430 core is integrated with 
the STT RAM to create a fully asynchronous non-volatile microcontroller. 
The MSP430 architecture, the MTNCL design methodology, and the STT RAM’s low 
power property, along with STT RAM’s non-volatility yield multiple advantages in the MTNCL-
STT RAM system for a variety of applications. For comparison, a baseline system with the same 
MTNCL core combined with an asynchronous CMOS RAM is designed and tested. Schematic 
simulation results demonstrate that the MTNCL-CMOS RAM system presents advantages in 
execution time and active energy over the MTNCL-STT RAM system; however, the MTNCL-
STT RAM system presents unmatched advantages such as negligible leakage power, zero 

















©2021 by Jean Pierre Thierry Habimana 











To my advisor Dr. Jia Di, thank you for your guidance and mentorship throughout my 
whole academic career. Thank you for your patience and for always being available to help. To 
my mother and father, thank you for your love and sacrifices. To my brothers and sisters, thank 
you for always loving, supporting, and caring for me. To my fellow colleagues, thank you for 
your support and comradery throughout the years of studies and research.  
  
 
TABLE OF CONTENTS 
1 INTRODUCTION .................................................................................................................... 1 
1.1 Objective ............................................................................................................. 1 
1.2 Problem Statement .............................................................................................. 1 
1.3 Design Challenges .............................................................................................. 2 
1.4 Organization ........................................................................................................ 4 
2 BACKGROUND ...................................................................................................................... 5 
2.1 MTNCL Methodology ........................................................................................ 5 
2.2 MSP430 Microcontroller Architecture ............................................................. 11 
2.3 STT Device and STT RAMs............................................................................. 12 
3 ADVANCED MTNCL REGISTER FILE DESIGN: FROM A BOOLEAN ARRAY TO 
DUAL-RAIL OUTPUT ................................................................................................................ 15 
3.1 Motivation ......................................................................................................... 15 
3.2 Register File Array Design ............................................................................... 16 
3.3 Register File Output Path .................................................................................. 19 
3.4 Top-Level Register File Design ........................................................................ 23 
4 ASYNCHRONOUS SRAM DESIGN: CMOS RAM AND NON-VOLATILE STT RAM 
SIMILARITIES AND DIFFERENCES ....................................................................................... 25 
4.1 CMOS and STT RAM Bit-cell Designs ........................................................... 26 
4.2 Asynchronous Timing and Column Logic ........................................................ 32 
4.3 Asynchronous SRAM MTNCL Interface ......................................................... 48 
5 MTNCL MSP430 IMPLEMENTATION .............................................................................. 50 
 
5.1 Reduced MSP430 Instruction ISA and Addressing Modes .............................. 50 
5.2 MSP430 MTNCL Architecture and System Data Flow ................................... 51 
5.3 Asynchronous SRAM MTNCL Integration...................................................... 54 
6 SIMULATIONS, RESULTS, AND ANALYSIS .................................................................. 57 
6.1 MTNCL Core Simulation ................................................................................. 57 
6.2 CMOS RAM Simulation................................................................................... 61 
6.3 STT RAM Simulation ....................................................................................... 61 
6.4 MTNCL Core and CMOS RAM System Simulation ....................................... 62 
6.5 MTNCL Core and STT RAM System Simulation ........................................... 63 
6.6 MTNCL-STT RAM System Unique Advantages............................................. 66 
7 CONCLUSION ...................................................................................................................... 72 






LIST OF FIGURES 
Figure 1: MTNCL Gate Structure. -------------------------------------------------------------------------- 7 
Figure 2: MTNCL Early Completion Logic. -------------------------------------------------------------- 9 
Figure 3. MTNCL Three Stage Pipeline. ----------------------------------------------------------------- 10 
Figure 4: STT Device State Representation. ----------------------------------------------------------- 14   
Figure 5: MTNCL Register File Bit-cell Gate-Level Logic. ------------------------------------------- 16 
Figure 6: Complete Register File Array Top-level View. ---------------------------------------------- 18 
Figure 7: Simplified Register File Design View for a Synchronous Execution Unit. -------------- 19 
Figure 8: MTNCL Register File Output Path.------------------------------------------------------------ 20 
Figure 9: Completion Detection Logic for Self-timed Data Path Inputs. ---------------------------- 22 
Figure 10: Gate-level CMOS RAM Bit-cell Design. --------------------------------------------------- 26 
Figure 11: Transistor-level CMOS RAM Bit-cell Design ---------------------------------------------- 27 
Figure 12: State0 and State1 Required Switching Voltages and Times. ----------------------------- 30 
Figure 13 : STT RAM Bit-cell Design. ------------------------------------------------------------------- 31 
Figure 14: SRAM Bit Line and Column Logic Structure. ---------------------------------------------- 33 
Figure 15: CMOS RAM Column Write Circuitry. ------------------------------------------------------ 35 
Figure 16: Asynchronous CMOS RAM Read and Write Completion Logic. ----------------------- 37 
Figure 17: STT RAM Column Write and Read Bit-line Current Flow Controlling Logic. ------- 39 
Figure 18: STT RAM Read and Write Bit-line Current Flow. ---------------------------------------- 40 
Figure 19: STT RAM Column Read Circuit: A Side-by-Side Comparison Between the Active 
and Reference Paths. ----------------------------------------------------------------------------------- 43 
Figure 20: Block-Level Diagram for The STT RAM Replica Mechanism. ------------------------- 45 
Figure 21: Asynchronous 1KB STT RAM with MTNCL Interface ---------------------------------- 48 
Figure 22: MSP430 MTNCL Adapted Single Pipeline Stage Data Flow. --------------------------- 52 
Figure 23: MSP430 MTNCL Asynchronous RAM Read Access Logic ----------------------------- 55 
Figure 24: MTNCL core and Program Counter Initiation Set-up. ------------------------------------ 58 
Figure 25: MTNCL Execution Time, Energy and Power Comparison Across Various 
Instructions. --------------------------------------------------------------------------------------------- 60 
Figure 26: CMOS and STT RAM Read and Write Time, Energy, Power, and Leakage 
Comparison. --------------------------------------------------------------------------------------------- 62 
 
Figure 27: MTNCL-STT RAM and MTNCL-CMOS RAM Execution Time and Energy 
Comparison. --------------------------------------------------------------------------------------------- 65 
Figure 28: MTNCL core to Asynchronous SRAM Interface-Memory Power Failure Interrupt -- 67 
Figure 29: MTNCL-STT RAM System Memory Power Failure Simulation Results. ------------- 69 
Figure 30: MTNCL-STT RAM System Resuming from Memory Power Failure Interruption. -- 69 
Figure 31: Core and Memory Energy Dissipation Comparison for Different Memory Interruption 




The focus of this dissertation is to accomplish two primary objectives: (1) develop a low-
leakage microcontroller by integrating the MSP430 architecture, the Multi-Threshold NULL 
Convention Logic (MTNCL) design methodology, and a non-volatile Spin-Transfer Torque 
(STT) random-access-memory (RAM) and (2) explore advantages yielded by the MTNCL and 
STT RAM combination. 
1.2 Problem Statement 
Process technology advancements allow complementary metal-oxide semiconductors 
(CMOS) devices to continually scale to smaller sizes, causing static power to become a more 
dominant factor in the power consumption equation for integrated circuit (IC) chips, especially in 
digital ICs due to the enormous number of transistors incorporated. As device sizes shrink, 
supply voltages and therefore dynamic power are reduced while smaller geometries cause 
leakage current increases. Thus, leakage may comprise up to 50% of the total system power for 
ICs in 65 nm process nodes and below [1].  
Moreover, as the demand of high-performance systems increases, the die area occupied 
by static random-access-memory (SRAM) designs such as caches and look-up tables 
consequently expands. Hence, SRAM designs have become the predominant source of leakage in 
high-performance systems such as microprocessors [2]. For example, last-level caches contribute 
63% of the total system leakage, which is around 30 % of the total system power for the 45 nm 
Penryn, Intel core next generation processors [2]. To address this issue, many leakage reduction 
techniques have been proposed. Most low-nanoscale process technologies include devices with 
2 
high dielectric constant to reduce the gate-oxide leakage component [3], and sleep transistors are 
used in SRAM cells to power-gate the cell transistors during idle states [4][2]. Although these 
techniques achieve certain degrees of success in leakage reduction, they come with performance 
costs, and sub-threshold leakage is predicted to grow proportionally alongside shrinking 
transistor size [5]. 
Spin-Transfer Torque (STT) RAM technology has emerged as a potential alternative to 
CMOS RAMs and a solution to the leakage problem [6][7][8]. STT RAMs are built using STT 
devices also referred to as spintronics devices or STT MTJ (magnetic tunnel junction) devices. 
STT devices are composed of two magnetic layers whose relative magnetic field orientations 
constitute a state variable that can represent data. STT RAMs present various advantages over 
CMOS RAMs including zero-leakage power, non-volatility, and practically unlimited endurance 
[9]. The zero-leakage advantage is of critical importance respecting this dissertation. A zero-
leakage STT RAM combined with a low-leakage MTNCL core helps achieve the first objective 
of this dissertation: developing a low-leakage microcontroller. Meanwhile, the non-volatility 
advantage is used to create special applications that would otherwise be impossible. For 
example, it is demonstrated that the MSP430 microcontroller implemented in MTNCL and using 
an STT RAM for data memory can handle a memory power failure with negligible power and 
zero delay overheads. Furthermore, this system provides a standby mode with zero-delay wakeup 
time using only 1.51% of the total system power. By combining these zero-leakage power 
memory and ultra-low-power design methodologies, demonstrable advantages are attained. 
1.3 Design Challenges 
Adoption of STT memory devices for asynchronous RAM is the central challenge of this 
dissertation work. Synchronous design methodologies continue to dominate the IC industry and 
3 
academic research, and as a result, there is a deficiency of asynchronous non-volatile RAM 
designs despite recent interest in magnetic RAMs. Fundamental differences between STT 
memory devices and CMOS RAM read and write processes make read and write completion 
detection for asynchronous STT RAM an important challenge. First, bit-line comparison 
techniques used in asynchronous CMOS RAMs [10][11] are not applicable for asynchronous 
STT RAMs due to STT memory device read and write mechanisms. Second, the STT memory 
device model used in this dissertation work requires an extended write time compared to 65 nm 
CMOS logic propagation delay. As a result, conventional delay elements such as inverter chains 
are not a good fit for delay estimation concerning STT RAM read and write completion 
detection.  
A replica methodology introduced in [12] is adopted to resolve the STT RAM read and 
write completion detection issue. The principle of this methodology is that the data stored or the 
STT device’s state in the replica cell is known. This is used for STT RAM write completion 
detection by performing multiple reads while the write operation is still underway. The data 
being written to the replica cell is different from the stored data, and the read returns the old data 
until the write operation completes, signaling the write operation is complete in the array as well. 
For read completion detection a replica cell storing a ‘1’ is used. The read operation starts with 
the output preset to ‘0’, and completion is detected when the output changes to ‘1’.  
In addition to read and write timing, the asynchronous STT RAM must be designed with 
an MTNCL interface so that it can integrate with an MTNCL core.  Input and output MTNCL 
registers and completion logic blocks are added, and the asynchronous STT RAM becomes a 
standalone element of an MTNCL pipeline. The interface design and integration are detailed in 
Section 4.3 and Section 5.3. 
4 
1.4 Organization 
This dissertation encompasses seven chapters. Chapter 2 addresses background material 
of the MTNCL design methodology, MSP430 architecture, and non-volatile magnetic RAMs. 
Chapter 3 outlines an advanced methodology for MTNCL register file design. Chapter 4 
illustrates the design and implementation of asynchronous CMOS and STT RAMs. Chapter 5 
describes the MSP430 MTNCL implementation and asynchronous RAM integration. Simulation 




2.1 MTNCL Methodology 
MTNCL is a low-power asynchronous design methodology derived from NULL 
Convention Logic (NCL) [13][14] and the Multi-Threshold CMOS (MTCMOS) technique [15]. 
MTNCL shares NCL’s dual-rail encoding scheme where two wires are used to represent one bit 
[13][14]. This encoding allows NCL and MTNCL signals to be in one of these three states: 
NULL, DATA0, and DATA1. DATA0 corresponds to Boolean logic zero; DATA1 to Boolean 
logic one; and NULL state is the transition state when the circuit is waiting for new data. The 
NULL state occurs when both rails are de-asserted; the DATA0 state when rail0 is asserted, and 
the DATA1 state when rail1 is asserted. Table 1 explains NCL and MTNCL dual-rail signal 
encoding scheme.  
Table 1: NCL/MTNCL Dual-Rail Encoding Scheme. 
D0 D1 State 
0 0 NULL 
0 1 DATA1 
1 0 DATA0 
1 1 Illegal 
 
Due to the lack of a global timing signal, NCL implements a handshaking protocol where 
adjacent pipeline stages communicate through feedback signals to control data flow. An NCL 
6 
block is in a NULL state when all signals are NULL; and it is in a DATA state when all signals 
are DATA. DATA waves are separated by NULL states. For DATA to propagate from one stage 
to the next, two conditions must be met: (1) all signals must be DATA, and (2) the subsequent 
stage must be ready to receive DATA. The same principle is followed for NULL propagation. A 
ko signal is used for feedback from one stage to its preceding stage. When ko is asserted, it is 
interpreted as a request-for-data (rfd), and when it is de-asserted, a request-for-NULL (rfn). In 
addition, a completion logic block is used to check all stage signals for DATA or NULL 
completion: a completion signal is asserted when all inputs are DATA, and de-asserted when all 
inputs are NULL. To ensure that a NULL wave propagates properly in NCL blocks, NCL gates 
implement a hysteresis property by which once an NCL gate output is asserted it remains so until 
all inputs are de-asserted [14]. However, extra logic is required to implement the hysteresis 
property, which makes NCL gates considerably bigger than their synchronous counterpart 
[13][14]. 
MTNCL adopts NCL’s data flow control with one critical difference. MTNCL implements 
a sleep mechanism where a sleep signal is used to propagate a NULL wave. MTNCL gates are 
designed with a sleep input such that the gate output is ‘0’ when the sleep signal is asserted. This 
way, an MTNCL block is put in a NULL state by asserting its sleep signal. As a result, MTNCL 
gates do not need to implement NCL’s hysteresis property to ensure input completeness with 
respect to NULL. Hence, MTNCL gates are implemented with fewer transistors, and MTNCL 
circuits are much smaller and faster than NCL counterparts [15]. Figure 1 depicts the structure of 
an MTNCL gate. 
To achieve leakage power reductions, MTNCL adopts the use of multi-threshold 
transistors from the MTCMOS power gating technique. In MTCMOS logic blocks, low threshold 
7 
voltage (low-Vt) transistors are used to implement the block functionality; and high threshold 
voltage (high-Vt) transistors are used to power-gate the block during idle times [16]. Although 
MTCMOS achieves significant leakage reductions, it suffers from three drawbacks. First, power-
gate transistor sizing is a tradeoff between speed and area overhead. If these transistors are too 
small, the circuit performance is affected due to low current and therefore speed, and if they are 
too big, area is wasted. The second drawback is storage elements lose data when power is cut 
off; and the third negative is sleep signal generation complexity with respect to both additional 
performance penalties and design time. MTNCL eliminates all three drawbacks. The early 
completion mechanism provides the sleep signal, low-Vt and high-Vt transistors are incorporated 
with MTNCL gate logic, and data is propagated through registers such that it is not lost during 
sleep cycles [15]. Figure 1 illustrates the placement of low-Vt and high-Vt transistors in an 










Figure 1: MTNCL Gate Structure. 
  
8 
2.1.1 MTNCL Completion Logic 
In [15] multiple techniques to manage MTNCL data flow are presented. MTNCL designs 
in this dissertation follow the fixed early completion input incomplete (FECII) methodology 
where a stage’s completion is dictated by both the preceding and the subsequent stage status. An 
early completion logic is used, and as illustrated by Figure 3. Each stage’s ko signal is connected 
to its preceding stage early completion logic block where it is interpreted as a request for NULL 
or DATA, and to the subsequent stage where it serves as the early completion logic block’s sleep 
signal input. When ko is ‘1’, a NULL wave is passed to the next stage, so the next stage’s early 
completion logic is slept; and when it is ‘0’, a DATA wave is passed, and the next stage’s early 
completion logic block wakes so that DATA completion can be detected. Therefore, a stage’s ko 
signal is ‘1’ when all its input signals are DATA, and the next stage requests DATA; and it is ‘0’ 
when all its input signals are NULL, and the next stage requests NULL. The implementation of 





















Figure 2: MTNCL Early Completion Logic. 
The sleepIn as seen in Figure 2 is connected to the preceding stage’s ko, and ki connects 
to the next stage’s ko as further illustrated in Figure 3. For ko to be ‘1’ the following must 
happen: first sleepIn rises, which sleeps the andtree and de-assert comp_signal. If ki is ‘0’, 
TH22n gate output falls, and ko rises. For ko to be ‘0’, sleepIn falls, then INPUT becomes 
DATA, and comp_signal rises. If ki is ‘1’, TH22n gate output rises, and ko falls. 
  
10 
2.1.2 MTNCL Data Propagation 
Figure 3 depicts an MTNCL three stage pipeline and highlights early completion logic 


























Figure 3. MTNCL Three Stage Pipeline. 
MTNCL wavefront propagation works as follows: 
• At reset (RST=’1’), all stages ko signals are ‘1’, forcing all registers and 
combinational logic to output NULL; 
• When DATA is presented at the input register, stage0 ko becomes ‘0’, waking the 
stage1 combinational logic and early completion logic blocks while requesting a 
NULL input; 
• Stage1 completion logic block detects when the corresponding combinational 
logic output is DATA, and ko1(connected to ki0 and sleep2) becomes ‘0’. Then, 
the stage1 register wakes and passes DATA to stage2 combinational logic, and ki0 
requests NULL from stage0; 
• If the input is already NULL, stage0 ko becomes ‘1’ and stage1 receives NULL 
right away, but stage1 ko remains ‘0’ until stage2 register receives DATA; 
11 
• The pipeline content becomes: stage0 DATA, stage1 NULL, stage2 DATA, and 
stage3 NULL, which is how NULL/DATA cycles alternates in a full pipeline. 
The data propagation flow detailed above exemplifies MTNCL’s delay-insensitivity. In 
contrast to synchronous circuits data propagation flow where a clock signal dictates when data is 
passed from one stage to the other, MTCNL data propagation depends only on the status of the 
data being processed and the subsequent stage status. The early completion logic checks all stage 
inputs signals, and a completion signal is not asserted until all signals are DATA. This ensures 
that correct data is always sent regardless of variations in processing and propagation delays.  
2.2 MSP430 Microcontroller Architecture 
The MSP430 is a low-power microcontroller architecture developed by Texas 
Instruments, Inc. It accounts for hundreds of microcontroller designs aimed for various 
applications. The MSP430 architecture supports up to 14 peripherals, making it an important 
design for embedded systems [17]. It achieves low power consumption by integrating multiple 
clocks used for different power modes and by leveraging its rich instruction set architecture.  
2.2.1 MSP430 Instruction Set Architecture (ISA) 
The MSP430 core is a 16-bit reduced instruction set computer with 27 instructions and 24 
emulated instructions. The 27 instructions are implemented in three different categories: 7 single 
operand instructions, 8 jump instructions and 12 double operands instructions. Emulated 
instructions add tremendous value to the ISA without adding too much hardware. MSP430 
architecture takes advantage of available information and easily integrated hardware to add 24 
emulated instructions to its ISA. For example, the MSP430 ALU does not include an inverter 
12 
function, but an inverter instruction is executed by XORing the register content with the constant-
1 (x”FFFF”).   
2.2.2 MSP430 Addressing Modes 
In addition to its elaborate ISA, the MSP430 architecture uses 7 addressing modes for the 
source operand and 4 for the destination operand to access the complete address space. Source 
and destination operands can come from either one or a combination of the following 
components: register file, data memory, program memory, and the constant generator. Source 
and destination addressing modes are encoded within the instructions, source addressing (AS) 
with a 2-bit code, and destination address (AD) with a 1-bit code. Thus, the same instruction can 
be executed differently by changing the addressing mode. 
For example, executing an addition instruction (ADD R4, R5) with AS = “00”, and AD = 
‘1’ means that the source operand come from the register file, and the operation result is written 
to the register file. This is register direct addressing. Executing the same instruction (ADD R4, 
R5) but changing AS to “10” and AD to ‘1’ means the source operand comes from the data 
memory at address specified by R5; and the operation result is written into data memory at the 
address specified by R4. This is register indirect addressing. Changing AS to “11” results in 
register indirect auto-increment, so the source operand comes from data memory at address 
specified by R5; and R5 is incremented by two to point to the next word. 
2.3 STT Device and STT RAMs. 
2.3.1 STT RAM and Recent Developments 
The increasing demand for ultra-low-power and high-performance has accelerated the 
search for alternatives to existing technologies. STT devices exhibit valuable characteristics 
13 
given current challenges such as leakage power and process technology development for low 
nanoscale CMOS devices [6][7][8][9]. Non-volatility, zero-leakage power, excellent integration 
density, and high endurance make STT devices favorable candidates to replace traditional 
CMOS memories [9]. Considering the ever-growing leakage cost of SRAM designs concerning 
high-performance systems, STT RAM adaptation for system caches has become a common focus 
in recent research. Advantages and outstanding challenges to STT RAM adoption are addressed 
subsequently.  
Long write delays and high write energy are the main hindrances respecting STT RAM 
adoption. To combat this, an energy efficient Highly Adaptable Last Level STT-RAM cache 
(HALLS) methodology is presented in [18]. This methodology is based on trading off STT RAM 
non-volatility for a reduced write time and energy. STT device retention time represents how 
long the device state is unchanged. For total non-volatility, retention time is infinity. However, 
for cache applications, STT RAM are preferred mainly for their zero-leakage, so trading non-
volatility for faster write times and lower energy is a benefit. Results indicated that a second 
level STT RAM cache using HALLS achieved a 60% power reduction compared to SRAM and a 
latency only 1.9% higher.  
In [19], a redundancy technique is presented to mitigate STT RAM write delay in a non-
volatile processor. A CMOS RAM and an STT RAM are integrated in parallel such that data is 
transferred from CMOS RAM cells directly to STT RAM cells and vice versa. The processor 
uses the CMOS RAM for read and write in active mode, and in idle mode, data is stored in the 
STT RAM while the CMOS RAM is shut down. This allows the processor to read and write at 
the CMOS RAM speed without incurring increased leakage or loss of data. 
14 
Despite the aforementioned read and write delay mitigation techniques, STT device write 
time remains the main hindrance to STT RAM adoption writ large. However, recent 
development and increasing interest are encouraging. In addition to system-level mitigation 
techniques being developed, improvements at the device level are being made as well. In [9], 
techniques to reduced required energy for state switching are described, and in [20] Toggle Spin 
Transfer (TST) RAM, derived from STT RAM and integrated in 40 nm technology, reduces the 
write time to 0.6ns to achieve GHz-level write latency. 
2.3.2 STT MTJ Device Model 
The STT device model used in this dissertation is established in [6].  This dissertation 
focuses on the electrical properties and how they translate to Boolean logic. Figure 4 presents 
STT device layers and how their polarization translates to two states. These states correspond to 
differing electrical resistances and voltage drops when integrated with CMOS circuitry.  The 
parallel state, State0, presents a lower electrical resistance than the anti-parallel state, State1. 
Details on read and write circuits are presented in Section 4.1.2 and Section 4.2.3. 











Figure 4: STT Device State Representation. 
  
15 
3 ADVANCED MTNCL REGISTER FILE DESIGN: FROM A BOOLEAN ARRAY TO 
DUAL-RAIL OUTPUT   
3.1 Motivation 
The register file is one of the most complex designs to adapt to MTNCL methodology. 
The MTNCL DATA-NULL cycle mechanism conflicts with the register file fundamental role as 
a data storage unit. An MTNCL register file design needs to receive and pass a NULL wave 
without compromising stored data. Previous approaches have successfully implemented the 
register file functionality in MTNCL [21], but with a heavy transistor count cost. For instance, 
the register file design presented in [21] requires 68 transistors for one-bit storage. In 
comparison, this is 6.8 times more transistors than an MTNCL register cell or twice the number 
of transistors in a synchronous d-flip flop (DFF). 
Fundamentally, the register file array does not need to store dual-rail signals. The dual-
rail encoding scheme serves in completion detection mechanisms. However, the register file 
array only stores complete DATA, so it does not need completion check when it is read. 
Therefore, single-rail data from a Boolean-coded register file array can be used to generate dual-
rail register file outputs. This requires an elaborate mechanism to handle propagation delays, 
register file write timing, and DATA/NULL cycle control for output signals.  
The main assumption for correct timing in changing single-rail register file array contents 
into dual-rail outputs is that a NULL wave is propagated when new data is written. This provides 
the necessary NULL wave between two DATA cycles, which emulates the regular MTNCL 
flow. When writing is complete, the register file outputs can be DATA if the next stage requests 
DATA. With a straightforward path from the register file array to the next stage, a self-timed 
16 
Quasi-Delay-Insensitive (QDI) circuit is designed to change single-rail Boolean array outputs 
into dual-rail signals that meet MTNCL handshaking protocol requirements.  
3.2 Register File Array Design 
3.2.1 Register File Bit-cell. 
To achieve a low transistor count, the MTNCL TH12nm gate was chosen over the 
conventional DFF. The TH12nm gate costs 10 transistors against 34 for DFF, and it meets all the 
register file functionality requirements. The RST input allows the register file array to be reset 
when required by the system, and the sleep input is controlled so that the TH12nm gate only 
sleeps when the bit-cell output needs to be ‘0’. The gate-level logic of a register file bit-cell is 












Figure 5: MTNCL Register File Bit-cell Gate-Level Logic. 
The bit-cell output is updated when the write enable signal (W_EN) is ‘1’. Logic one is 
written when data input, DIN signal and W_EN are both ‘1’. The TH12nm gate with an output 
feedback keeps the output asserted once it has been asserted. Therefore, logic zero can only be 
written by asserting the sleep signal, which happens when DIN is ‘0’ and W_EN is ‘1’.   
17 
3.2.2 Register File Word 
The MSP430 is a 16-bit word microcontroller, so a 16-bit register word was designed. All 
bit-cells in a word are written, read, or reset at the same time. Therefore, a register file word is an 
array of 16 register file bit-cells with 16-bit input and output, one-bit reset, and one-bit write 
enable inputs. 
3.2.3 Register File Array 
The MSP430 microcontroller uses an array of 16 register file words including the 
Program Counter (PC or R0) register, the Stack Pointer (SP/R1) register, the Status Register 
(SR/R2), the Constant Generator (CG/R3) register, and 12 general-purpose registers.  All these 
registers are the same in design and only differ in the way they are accessed; therefore, the 
register file array is an array of 16 register file words. A completion logic is added so that a ko 
signal can be asserted when the array data is ready for reading, and de-asserted when data is 
being written or dirty. A complete register file array design is depicted in Figure 6. 
The register file Array completion logic works as follows: 
• When the system starts or when the register file is reset, RF_array_ko is ‘0’; 
• After reset, RF_array_ko rises, which means that the content is stable and can be 
read; 
• At the completion of instruction execution, the register file is ready to accept new 
data. The register file write enable signal (RF_WR_en) is asserted, and 
RF_array_ko is de-asserted (register file array content is not stable until the 
register file write operation is completed); 
18 
• PC_ko rises, which means that the NULL wave has propagated to the outputs and 
that enough time has been allowed for the write operation to complete. Therefore, 








































REGISTER FILE COMPLETION 
LOGIC  




3.3 Register File Output Path 
The output data path from the register file array to the next stage is a straightforward 
path. In a synchronous register file, only the source and destination MUXes are between the 
























Figure 7: Simplified Register File Design View for a Synchronous Execution Unit. 
For an MTNCL register file, this path needs to be modified. First, MTNCL output 
registers need to be added so that the register file is a complete element in an MTNCL pipeline. 
Second, a timing mechanism must be implemented to control when output registers hold DATA 
20 
and when they pass a NULL wave. To meet these conditions, the MTNCL output path consists 
of 3 main components as shown in Figure 8.  
• Boolean source and destination MUXes. 
• single-rail to dual-rail buffers. 











































































Figure 8: MTNCL Register File Output Path. 
  
21 
Delay elements are used to avoid race conditions between register file array output 
selection and MTNCL output register NULL to DATA transition. First, an appropriate 
propagation delay must be allowed for the source and destination MUXes to select the correct 
output after src_select and dst_select vectors are changed. Second, a short delay must be allowed 
for single-rail to dual-rail buffers to transition from NULL to DATA after the sleep signal falls. 
The selection of source and destination operands from the register file array to the output 
registers is explained as follows: 
1. The cycle starts with both register file inputs and outputs in a NULL state. The 
select_ko signal associated with src_select and dst_select input registers is ‘1’, so 
src_select and dst_select vectors are NULL. The single-rail to dual-rail buffer 
(DR_buffer) sleep signal input is ‘1’, which makes the register file source and 
destination operands (src_op and dst_op) NULL; 
2. Register file read process starts: src_select and dst_select inputs become DATA, 
so select_ko falls. Input to the delay element for the MUX stage changes at the 
same time as select signals for the source and destination MUXes become valid 
data.  Propagation delay through the delay element, which is an inverter chain, 
must match or exceed the source and destination MUXes’ propagation delay; 
3. sleep_d1 signal falls following select_ko after the delay element propagation time. 
When this happens, source and destination MUXes output the correct selected 
source and destination operands. The single-rail to dual-rail buffers (DR_buffer) 
sleep input is de-asserted, so their outputs become DATA. The output registers 
receive complete DATA and if ki is a ‘1’, the output src_op and dst_op become 
DATA as well; 
22 
4. The register file write enable signal (RF_WR_en) is used to ensure that the 
DR_buffer sleeps before the register file array content is changed.  The DR_buffer 
sleep input becomes sleep_d1 ORed with RF_WR_en. Therefore, the output is 
NULL when select_ko is ‘1’ (src_select and dst_select inputs are NULL), or 
RF_WR_en is ‘1’ (register file array is being written or dirty). This DR_buffer 
sleep control ensures that newly written data is not passed to the next stage and 
helps create the NULL cycle before the next instruction data is processed; 
5. The output completion logic does not need a completion detection mechanism 
because data coming from the DR_buffer stage is immediately complete when the 
sleep input falls. The sleep_d1 signal slightly delayed and inverted gives the 
completion signal, and the modified early completion detection mechanism 
depicted in Figure 9 will have the same impact as the full implementation as 











Figure 9: Completion Detection Logic for Self-timed Data Path Inputs. 
  
23 
3.4 Top-Level Register File Design 
The input side of the register file also poses some design challenges. In contrast to other 
MTNCL pipelines, register file inputs become available and are processed during different 
stages. For example, src_select and dst_select inputs become available right after the instruction 
decode stage is completed, and they need to be processed so that source and destination operands 
can be passed to the next stage. However, at this stage, other inputs are still NULL. The 
operation result to be written into the register file array does not become DATA until the 
execution stage is completed. Therefore, different inputs need to be assigned to different 
registers and completion logic blocks, so that their completion can be detected separately.  
Finally, the auto-increment functionality is implemented inside the register file. As 
explained in Section 2.2.2 , MSP430 ISA is capable of reading from memory and incrementing 
the source address within one instruction cycle. This is done by adding a constant 2 to the 
selected source operand. A 16-bit ripple carry adder (RCA) block is added inside the register file, 
and its inputs are the selected source operand and a constant “two”. Its output is connected to a 
completion logic and the result is written back into the register file array replacing the old source 
operand value. All register file entries are written simultaneously after the current instruction is 
completed to avoid a mix up between old and new data.  
Detailed register file functions in each execution stage are presented below. 
3.4.1 Instruction Fetch Register File Function 
Instruction execution starts with a new program memory address, which comes from the 
register file. A new PC output is given after each instruction execution completion, which is 
marked by the Write Back stage. When data in being written into the register file array, the 
24 
RF_array_ko signal is de-asserted to indicate that the array content is dirty. The output register 
sleep signal is asserted, and its outputs become NULL. When RF_array_ko rises again, the 
register file outputs the next program memory address, and the Instruction Fetch operation starts.  
3.4.2 Operand Fetch Register File Function 
During the Operand Fetch stage, src_select and dst_select vectors are passed to the 
register file. The source and destination MUXes select the correct source and destination 
operands, and after an appropriate propagation delay src_op and dst_op outputs become DATA. 
For direct addressing mode, these operands are passed to the ALU for execution; and for indirect 
addressing modes, the source operand is used as address to the data memory, which returns the 
actual source operand used during the Execution stage. 
3.4.3 Write Back Register File Function 
All register file entries that need to be updated wait for the Write Back stage. Writing into 
the register file array occurs after all inputs are DATA, and the instruction is fully executed such 
that no more data needs to be read from the array. Up to four entries could be updated in one 
instruction cycle: the PC register is updated in every cycle; the Status Register is updated if any 
of the status bits has changed; the instruction execution result is written to the selected 
destination; and the source operand is incremented for an autoincrement instruction. Writing into 
the array de-asserts RF_array_ko marking the end of the current instruction cycle. 
More details on the system-level data flow and register file integration are given in 
Section 5.2. 
25 
4 ASYNCHRONOUS SRAM DESIGN: CMOS RAM AND NON-VOLATILE STT RAM 
SIMILARITIES AND DIFFERENCES  
Asynchronous SRAMs are not very common designs. Read and write timing without a 
clock signal poses significant design challenges. A few approaches have been implemented for 
Asynchronous CMOS SRAMs. In [10] a QDI SRAM is presented. Read and write detection 
circuitries are added to each bit-line to indicate when the operation is completed. In [12], a 
replica methodology is used where a reference bit-line identical to the main array bit-lines is 
added. The data stored in the reference bit-line is known, which makes the read completion 
detection easier. It is assumed that reading from the main array is done under the same time as 
reading from the replica bit-line; therefore, the read completion detection in the replica signals 
the completion in the main array.  
On the other hand, the asynchronous non-volatile STT RAM presented in this dissertation 
work is the first of its kind. This is because magnetic SRAM technologies are considerably 
newer than CMOS SRAMs, and they have not been researched as much especially in the 
asynchronous design area. The second reason is that STT RAMs present completely different 
challenges. The meaning of logic zero and logic one is different from the CMOS meaning, and 
special read and write circuitries are required to interpret STT device states into CMOS logic 
zero or one and vice versa. As explained in Section 2.3, State0 and State1 (Parallel and Anti-
Parallel states) differ by less than 100mv for a read circuit operating with a 1.2V nominal 
voltage. This makes read and write detection techniques used in [10][11] inapplicable for STT 
RAM completion detection. How STT RAM differs from CMOS RAM and solutions to allow 
asynchronous STT RAM implementation are presented from bit-cell level read and write to top-
level self-timing and dual-rail MTNCL interface. 
26 
4.1 CMOS and STT RAM Bit-cell Designs 
4.1.1 CMOS RAM Bit-cell 
The simplest and most common SRAM bit-cell is adopted for the CMOS RAM. This 
design consists of six transistors. Four of them make up the two data-holding inverters while the 
other two serve as Word Line enablers. The CMOS bit-cell gate-level and transistor-level 



















Figure 11: Transistor-level CMOS RAM Bit-cell Design 
4.1.1.1 CMOS bit-cell read and write. 
The CMOS RAM bit-cell read works as follows: 
1. Start with BLT and BLC pre-charged to VDD; 
2. Turn on the selected Word Line and column MUXes; 
3. Turn off pre-charge transistors; 
4. Selected bit-cell inverters drive BLT and BLC to VDD or GND depending on 
stored data; 
5. Column sense amplifiers detect voltage differences, and the output is latched; 
6. Turn off the selected Word Line and column MUXes. 
 
The CMOS RAM bit-cell write works as follows: 
1. Start with BLT and BLC pre-charged to VDD; 
2. Discharge BLT or BLC depending on the data being written; 
28 
3. Turn on the selected Word Line and column MUXes; 
4. If the data to be written is different from the stored data, the line connected to 
GND overpowers the PMOS device in the bit-cell inverter, and a ‘1’ or a ‘0’ is 
written depending on which line is connected to GND; 
5. Turn off the selected Word Line and column MUXes. 
4.1.1.2 CMOS Bit-cell Transistor Sizing 
Transistor sizing plays a critical role in CMOS RAM read and write processes. The PMOS 
devices (P1 and P2 in Figure 11) need to be the smallest. The WL enable transistors need to be 
wider than the PMOS transistors, but smaller than the NMOS devices (N1 and N2 in Figure 11). 
As a result, WL enable transistors can overpower PMOS transistors to write a ‘0’ to either one of 
the inverters, but a ‘1’ cannot be accidentally written when BLT and BLC are pre-charge to VDD 
because WL transistors cannot overpower the NMOS devices in the inverter. Therefore, writing 
a ‘1’ to the BLT side inverter is done by writing a ‘0’ to the BLC side inverter, and letting the 
BLC inverter drive the BLT inverter output to ‘1’. For this dissertation work, bit-cell transistors 
are sized as P1 and P2 1 ∗ 𝑊, WL transistors 2 ∗ 𝑊 and N1 and N2 transistors 3 ∗ 𝑊, where W 
equals 120 nm. The transistor channel length is the same for all transistors, i.e., 60 nm. 
4.1.2 STT RAM Bit-cell 
The STT RAM bit-cell storage element consists of one compact non-volatile device. Its 
storage capability comes from the fact that the magneto-electrical properties of an STT MTJ 
structure can be changed to allow two possible states. The Parallel state also referred to as State0 
is when the magnetic field of the device’s free layer aligns with that of the fixed layer to allow 
the least electrical resistance. The Anti-Parallel state or State1 is when magnetic fields of the free 
29 
and fixed layers are in opposite directions, which constitutes a structure of higher electrical 
resistance.  
A write operation for an STT device consists of changing its state. State0 is changed to 
State1 by applying a positive current from one terminal to the other, and State1 is changed to 
State0 by applying a negative current. A state change requires that a certain critical current be 
reached for a predetermined time. For instance, the model used in this dissertation work requires 
a minimum of 284mV across its terminals to produce a 45.68μA current to switch from State0 to 
State1. Due to higher resistance, switching from State1 requires a higher voltage. The minimum 
required voltage to switch from State1 to State0 is 543mV to produce a 45.54μA. The time 
required for the switch depends on the voltage across the STT device terminals. The higher the 
voltage, the higher the current through the device; therefore, the faster the state is changed. Table 
2 shows the required switching time as a function of voltage and the device initial state. 
Table 2: Minimum Voltage for STT Device State Switch and Corresponding Switching Time. 
Switching From [T1, T2] 
Voltage (V) 
Current (μA) Switching Time 
(ns) 
P / State0 0.284 45.68 9.9 
P / State0 1 160.85 1.1 
AP / State1 -0.543 -45.54 9.9 
AP / State1 -1 -114.89 1.65 
 
The histogram in Figure 12 further illustrates the relationship between the voltage across 
the device, its initial state, and the switching time.  
30 
 
Figure 12: State0 and State1 Required Switching Voltages and Times. 
4.1.2.1 STT RAM Bit-cell Structure 
One of the most favorable aspects of STT devices is their easy integration into CMOS 
logic. For transistor-level simulations, STT devices can be used with CMOS logic, and their 
electrical properties are recognized by most CAD simulation tools. As a result, a schematic-level 
STT RAM bit-cell can be designed as shown in Figure 13. In comparison to the CMOS RAM 
bit-cell, the data-holding inverters are replaced by the STT memory device, while the WL enable 
transistors are replaced by one access transistor. The access transistor plays the Word Line 
enabler role by allowing the current to go through the STT device when the bit-cell is selected 















P to AP P to AP AP to P AP to P
State0 and State1 Switching Times comparison





































Figure 13 : STT RAM Bit-cell Design. 
4.1.2.2 STT RAM Bit-cell Read and Write. 
The main challenge for read and write in STT devices or non-volatile magnetic memory 
cells, in general, is that the Parallel and Anti-Parallel states do not necessarily correspond to 
ground or supply voltages. As it is shown in Table 2 for instance, both State0 and State1 are 
conductive, and they only differ in how much current goes through the devices when the same 
voltage is applied. Reading from an STT device is done by applying a positive current and 
detecting the device’s state by comparing the current to a reference current. However, to 
implement a read and write circuitry that meets SRAM functionality, control transistors are 
added to the current path, which makes STT device current and voltage multivariable equations.  
Therefore, reading or writing to a magnetic memory device is a more complex process than 
reading or writing to a CMOS RAM bit-cell. 
32 
To switch from State0 to State1, the current flows from BLT to BLC, and to switch from 
State1 to State0, the current flows from BLC to BLT. This implies that BLT is connected to 
power and BLC to ground to write State1, and vice versa to write State0. To reduce switching 
delays, the voltage across the memory device needs to be maximized. In other words, the 
resistance from VDD to the device and from the device to GND must be minimized; therefore, the 
read and write circuitry for every bit-line is complex enough to control the current direction 
depending on the data being written, but simple enough to keep a minimal resistance in the BLT 
and BLC paths.  
The read logic adds a level of complexity to the equation. The CMOS RAM read is done 
by comparing BLT and BLC voltages, which are driven by the bit-cell to VDD and GND 
respectively when the data is a ‘1’ or vice versa when the data is a ‘0’. On the other hand, 
reading from a magnetic memory device is done by applying a positive current and comparing 
the voltage drop across the device to a reference voltage. Transistor-level read and write 
circuitries are explained in Section 4.2.3. 
4.2 Asynchronous Timing and Column Logic 
4.2.1 SRAM Structure 
Figure 14 shows the SRAM bit-line and column read and write structure. Each bit line is 
connected to a pre-charge block and a column MUX. The outputs of the MUX, SLT and SLC, 



























Figure 14: SRAM Bit Line and Column Logic Structure. 
While each bit-line is connected to pre-charge and column MUX blocks, the write circuit 
and the sense amplifier are used for multiple bit lines depending on the number of bits dedicated 
to the column MUX select signals. For instance, the SRAM in this dissertation work uses three 
of the address bits for the column decoder, which means that one in every 8 bit-lines is selected. 
34 
For a 16-bit input/output, the SRAM array needs to have 8x16 or 128 bit-lines. For the 1 KB 











= 29      (1) 
Equation 1 shows that nine bits are required to address 16-bit words in a 1 KB SRAM. 
Therefore, three bits are used for column MUX selection, and six bits are used for Word Line 
selection.  
4.2.2 CMOS RAM Column Circuitries and Read and Write Timing 
4.2.2.1 Pre-charge signal timing 
The main challenge in designing an asynchronous SRAM is the timing of control signals. 
The pre-charge block uses a pre-charge or PC input to connect BLT and BLC lines to VDD 
during the pre-charge stage. PMOS devices are used for better power conductivity during 
transistor saturation; therefore, PC is ‘0’ during pre-charge and ‘1’ during read or write. To keep 
the SRAM in standby mode, bit-lines are pre-charged during NULL cycles so that reads or writes 
can be performed as fast as possible when data is presented. PC timing works as follows: 
• After a read or write is finished, the SRAM receives a NULL wave, and the PC 
signal is de-asserted to allow pre-charge; 
• After DATA is presented for read or write, the PC signal is ‘1’ to allow the read 
or write, but an appropriate delay must be allowed so that access transistors on the 
selected Word Line are turned on while BLT and BLC lines are still fully 
charged. 
35 
4.2.2.2 CMOS SRAM write circuitry. 






Figure 15: CMOS RAM Column Write Circuitry. 
The write enable signal (W_EN) is used so that SLT and SLC lines are not discharged to 
ground during a read. For a write operation, with both SLT and SLC (selected BLT and BLC) 
pre-charged to VDD, the data input (DIN) decides which line is discharged to ground: SLT is 
discharged to ground, and SLC stays charged to VDD when DIN is ‘0’; and SLC is discharged 
and SLT stays charged to VDD when DIN is ‘1’. The data written to the selected bit-cell depends 
on which line is discharged as explained in Section 4.1.1.1. 
4.2.2.3 Read and Write completion detection. 
Without a clock signal, asynchronous SRAM needs a mechanism to indicate that an 
operation is completed. For the write operation, a completion signal is needed to feedback the 
operation status to the core, and for a read operation, a completion signal is needed so that data 
can be latched and passed to the core at the right time. In a CMOS RAM, the pre-charge stage 
36 
that takes place before read and write operations is leveraged to deduct a simple mechanism for 
read and write completion detection.  
A read or write operation starts with each BLT and BLC in the array charged to VDD. 
After the selected Word Line enable transistors are turned on and pre-charging is stopped, BLT 
and BLC lines are connected to selected bit-cell inverters. For a write operation, selected BLT or 
BLC depending on data being written is driven to GND by the column write circuit. Unselected 
bit-lines on the other hand have both BLT and BLC lines charged to VDD until one of them is 
driven to GND by the inverter in the bit-cell. Due to the bit-line capacitance, it takes longer to 
discharge BLT or BLC lines through a bit-cell inverter than it takes a bit-line to overpower the 
PMOS device and change the stored data. This means that the selected bit-line writes the new 
data before unselected bit-lines are discharged. Therefore, it is concluded that a write completion 
is detected when unselected bit-lines are discharged. 
For read operations, BLT or BLC for each bit-line starts discharging through the selected 
bit-cell inverter when Word Line enable transistors are turned on. A read completion is detected 
when a certain voltage difference threshold between BLT and BLC lines is reached. Sense 
amplifiers used to compare selected BLT and BLC voltages are sensitive to the smallest voltage 
differences. In contrast, a Boolean gate requires that the voltage falls significantly before logic 
zero is detected. Therefore, using Boolean gates for the completion detection logic guarantees a 








Figure 16: Asynchronous CMOS RAM Read and Write Completion Logic. 
The completion detection logic works as follows: 
1. The DATA cycle starts with BLT and BLC lines charged to VDD, which makes 
NAND gates output ‘0’; 
2. A completion is detected when one of the lines is discharged (one of the NAND 
gate input becomes ‘0’), so both NAND gate outputs are ‘1’; 
3. For a fully delay-insensitive design, completion detection circuitries would have 
been added to each bit-line. For a QDI design, it is assumed that if the operation is 
completed for one bit-line, it is completed for all. However, the completion 
38 
detection presented in Figure 16 checks two consecutive bit-lines to guarantee 
that the bit-line being checked is not selected for write. 
4.2.3 STT RAM Timing and Column Logic 
While the overall RAM structure is the same for the asynchronous CMOS and STT RAM 
designs, control signals and read/write circuitries are different. STT device read and write 
operations require a significant amount of current flow; therefore, a dynamic current flow from 
VDD to GND is required in contrast to the pre-charge technique used in CMOS RAM designs.   
Given that STT RAM designs do not need bit-line pre-charging, the read and write 
completion signal is the only timing critical signal to be designed carefully. Device-level 
simulations have shown that the write time under a 1.2V nominal voltage might be as high as 
7ns. This poses a timing issue that goes beyond common delay measuring techniques. To put this 
problem into perspective, for a 65 nm process, one inverter delay is 13ps. This means that if an 
inverter chain delay circuitry were used to detect STT device write completion, around 550 
inverters would need to be used. Area and power overhead resulting from such circuitry would 
pose a big disadvantage, but more importantly, delay variations from such a long chain would 
result in a high dependency to process variations. To address this problem, a replica 
methodology using reference STT devices to detect the write completion was adopted. 
Transistor-level read and write circuitries as well as the replica mechanism are detailed in this 
section. 
4.2.3.1 STT RAM read and write. 
STT RAM column logic follows the same block level structure as that of the CMOS 
RAM presented in Figure 14. The difference between the two designs resides in the transistor-
39 
level implementation and functionality of each block. Like the CMOS RAM design, the column 
MUX block is a pass gate that is ON when the select signal is ‘1’.  The write circuitry in the 
CMOS RAM design is replaced by a Write_and_Read circuit depicted in Figure 17. The 
Write_and_Read block uses a more elaborate logic than the CMOS write circuitry due to the 
complex nature of read and write processes in magnetic memory devices. The diagram in Figure 
18 shows the current direction depending on whether a read or a write operation is performed, 










Write     or Read
Write    
NMOS devices in Black PMOS devices in Purple
 
Figure 17: STT RAM Column Write and Read Bit-line Current Flow Controlling Logic. 
  
40 
Instances labeled RRW1, Rmux1, RwL, Rmux0, and RRW0 in Figure 18 represent respectively the 
resistance for the Write_and_Read and MUX blocks on the SLT line, the Word Line enable 
transistor, and the MUX and Write_and_Read blocks on the SLC line. Rp and RAP are the 
resistances for the STT device in State0 and State1, respectively. The total path resistance for 




















Figure 18: STT RAM Read and Write Bit-line Current Flow. 
  
41 
𝑇𝑜𝑡𝑅𝑃 = 𝑅𝑅𝑊1 + 𝑅𝑚𝑢𝑥1 + 𝑅𝑊𝐿 + 𝑅𝑃 + 𝑅𝑚𝑢𝑥0 + 𝑅𝑅𝑊0    (2) 
𝑇𝑜𝑡𝑅𝐴𝑃 = 𝑅𝑅𝑊1 + 𝑅𝑚𝑢𝑥1 + 𝑅𝑊𝐿 + 𝑅𝐴𝑃 + 𝑅𝑚𝑢𝑥0 + 𝑅𝑅𝑊0   (3) 
The current to write a ‘1’ or to switch from State0 to State1 (IW1) and the current to write 








      (5) 
It should be noted that in the circuitry depicted in Figure 18, the WL NMOS transistor 
placement violates the symmetry between the positive and negative current paths. The WL 
transistor is placed between the power source and the STT device for the positive current, but 
between the device and ground for the negative current. This results in a high NMOS device 
resistance for the positive current when its source terminal voltage is closer to VDD. For the 
negative current, the WL transistor is on the other side of the STT device, and its source terminal 
voltage is closer to GND, which makes it less resistive. The increased WL NMOS resistance 
while switching from state0 to state1 results in IW0 being greater than IW1 despite RAP being 
higher than Rp. As a result, for this STT RAM implementation, switching from state0 to State1 
takes longer than switching from state1 to state0 in contrast with the stand-alone STT device 
simulation. 
As explained in Section 4.1.2, the write current is proportional to the required time for the 
STT device state change. Consequently, transistor sizing in the Write_and_Read and MUX 
blocks is critical. Transistors in these blocks need to be wider to minimize RRW1, Rmux1, Rmux0, and 
RRW0, but not too wide to become the limiting factor for layout.  
The Write_and_Read block controls IW1, IW0, and Iread  current flow as follows: 
42 
• To achieve a bi-direction active current flow control, both SLT and SLC lines 
have a path to VDD and GND; 
• The Read-or-Write enable (RW) signal and its complement RW_B are used to 
control the bottom half transistors of the Write_and_Read block. All bottom 
transistors are ON when RW is ‘1’ to allow current flow for read or write; 
• The top transistors are ON depending on the direction in which current needs to 
flow. SLT_ON and SLC_ON signals are mutually exclusive such that SLT is 
connected to VDD and SLC to GND when SLT_ON is asserted, and SLC is 
connected to VDD and SLT to GND when SLC_ON is asserted; 
• A custom logic block was designed to compute SLT_ON and SLC_ON signals 
such that SLT_ON is asserted to write a ‘1’ (switch to State1) or to read, and 
SLC_ON is asserted to write a ‘0’ (switch to State0)  
4.2.3.2 Detailed STT RAM Read Process 





































Figure 19: STT RAM Column Read Circuit: A Side-by-Side Comparison Between the Active and 
Reference Paths. 
In contrast to the CMOS RAM read mechanism, STT RAM read logic cannot compare 
SLT and SLC lines to read the STT device state. A read operation consists of applying a positive 
current from the SLT to the SLC line; therefore, the voltage on the SLT line is always higher 
than that of the SLC line during a read process regardless of the STT device state. To detect the 
state, a reference STT device of a known state is used. The reference device is in State1, and it is 
never changed. As a result, the resistance (Rref) in the reference path and current (Iref) flowing 
through the reference device during read are given by Equation 6 and Equation 7. 




       (7) 
44 
Both state0 and state1 are read the same way, with a positive current, thus the WL 
NMOS resistance is approximately the same when the STT device is in state0 or state1.  For read 
operations the difference in total path resistance is solely attributed to Rp and RAP. Assuming 
RRWref  equals RRW1  and considering STT device State0 resistance RP which is less than State1 
resistance RAP, Equations 2, 3, and 6 are summarized as: 
𝑇𝑜𝑡𝑅𝑃 < 𝑇𝑜𝑡𝑅𝐴𝑃 = 𝑅𝑟𝑒𝑓      (8) 
Consider IR0 and IR1, currents when reading state0 and state1 respectively, Equation (8) 
translates into: 
𝐼𝑅0 > 𝐼𝑅1 = 𝐼𝑟𝑒𝑓      (9) 
However, (9) does not help with the read process if the reference and State1 current are 
the same. To fix this, transistor sizing in the Write_and_Read circuit for the reference path is 
changed to make RRWref  higher than RRW1. Equation 8 and Equation 9 change into 
𝑇𝑜𝑡𝑅𝑃 < 𝑇𝑜𝑡𝑅𝐴𝑃 < 𝑅𝑟𝑒𝑓      (10) 
𝐼𝑅0 > 𝐼𝑅1 > 𝐼𝑅𝑒𝑓      (11) 
A sense amplifier compares the active line SLT to the reference path SLT (SLT_ref). The 
voltage on the SLT line when reading State0 (SLT_0), and the voltage on the SLT line when 
reading State1 (SLT_1) are given by Equation 12 and Equation 13, while the voltage on the 
reference SLT line (SLT_ref) is given by Equation 14.  
𝑆𝐿𝑇_0 = 1.2𝑉 − (𝐼𝑅0 ∗ 𝑅𝑅𝑊1)    (12) 
𝑆𝐿𝑇_1 = 1.2𝑉 − ( 𝐼𝑅1 ∗ 𝑅𝑅𝑊1)   (13) 
𝑆𝐿𝑇_𝑟𝑒𝑓 = 1.2𝑉 − ( 𝐼𝑟𝑒𝑓 ∗ 𝑅𝑅𝑊𝑟𝑒𝑓)     (14) 
45 
Transistors in the reference Write_and_Read block are sized such that the voltage on the 
reference line is higher than that of the active line while reading ‘0’, but lower when reading ‘1’, 
as shown by Equation 15.  
𝑆𝐿𝑇_0 <  𝑆𝐿𝑇_𝑟𝑒𝑓 <  𝑆𝐿𝑇_1    (15) 










































Figure 20: Block-Level Diagram for The STT RAM Replica Mechanism. 
  
46 
The replica circuit depicted in Figure 20 works as follows: 
As discussed in the previous section, writing a ‘1’ takes longer than writing a ‘0’, and 
assuming that the write operation is complete in the array, the replica write completion detection 
must account for the worst-case scenario. Thus, a ‘1’ must be written in the replica cell at the 
same time as the main array write occurs. When the replica completes writing a ‘1’, it is assumed 
that writing ‘0’ or ‘1’ in the main array is complete. However, the replica methodology requires 
that a different data be written to the replica cell. Therefore, 2 STT devices are used such that for 
every write operation, a ‘0’ is written to the cell that is storing a ‘1’, and a ‘1’ is written to the 
cell that is storing a ‘0’. A 1-bit counter block is used to keep track of the 2 STT devices. As 
demonstrated in Figure 20, the 2 STT devices’ input data are the counter output and its 
complement, so input changes from ‘0’ to ‘1’ and vice versa. Notice that the counter is only 
incremented by the write enable (W_EN) signal. This ensures that the counter does not change 
for read operations. 
The “write and read blocks” as labeled in Figure 20 represent the whole read and write 
logic including the sense amplifier. As the write operation goes on, sense amplifiers track both 
devices’ states. When the write operation is completed, the sense amplifier output changes. A 
MUX block is used to select the output from the device where ‘1’ is being written. However, for 
the sense amplifier to update the output, its enable signal (SENSE_EN) needs to rise at the right 
time. A short delay element is used to toggle the SENSE_EN signal continuously until the sense 
amplifier output changes. The delay used for SENSE_EN determines the replica completion 
detection precision. If a short delay is used, multiple reads are performed in small intervals of 
time, and write completion is detected fast after it occurs. On the other hand, when a longer delay 
47 
is used, the sense amplifier output is updated less frequently, and write completions are not 
detected immediately.  
In addition to the write completion detection, the replica block is responsible for the main 
array SENSE_EN signal timing for read. STT devices read delay is not as problematic as their 
write delay. Regular delay elements could be used for read completion detection, but the 
mechanism implemented to detect write completion detects read completion with no extra cost. 
After read address and enable signals are passed to the main array, the replica SENSE_EN signal 
rises to read from the selected STT device. The replica read time is primarily dictated by the 
delay before the replica SENSE_EN rises since STT devices alone can be read instantaneously. 
For this dissertation work, the replica SENSE_EN signal toggles at the same frequency for read 
and write completion detection, which does not optimize STT RAM write time. After the replica 
sense amplifier updates its output, it is assumed that the main array outputs are ready to be 
latched as well, thus the main array SENSE_EN signal is asserted. 
48 

















































Figure 21: Asynchronous 1KB STT RAM with MTNCL Interface 
As illustrated by Figure 21, input and output MTNCL register as well as completion logic 
blocks are added to the asynchronous STT RAM. This interface ensures that all inputs have been 
received before the internal self-timing mechanism is activated. DMEM_ko, DMEM_slpIn and 
DMEM_ki as depicted in Figure 21 are used for their regular roles in the MTNCL handshaking 
protocol. The input stage registration and control signals help signal when the internal RAM self-
timing process should start. During a NULL cycle, internal memory signals are disabled to 
automatically put the memory in an idle state. At the same time, when data is presented (input 
data and address for a write operation or address for a read operation), DMEM_ko falls, which 
49 
signals the memory control block to start the operation. The memory control logic asserts write 
enable (W_EN), read enable (R_EN) and read or write enable (RW_EN) signals accordingly.  
At the output stage, DMEM_ki signal comes from the core; and it is the ko signal 
associated with the stage waiting for the memory output. When a memory operation is finished 
and output DATA is passed to the core, enable signals are de-asserted to avoid unnecessary 
power dissipation. The output registration helps hold DATA until DMEM_ki is de-asserted (the 
core sends a request-for-NULL). 
  
50 
5 MTNCL MSP430 IMPLEMENTATION 
5.1 Reduced MSP430 Instruction ISA and Addressing Modes 
As discussed in Section 2.2.1, MSP430 is an advanced architecture with a considerable 
number of instructions and execution modes. The MSP430 MTNCL core architecture 
implemented in this dissertation work is limited to instructions and execution modes relevant to 
the objectives of the dissertation. The implemented architecture includes all main components 
required for the full MSP430 ISA, but some control logic elements are omitted.  
Out of 27 total instructions, 16 are fully implemented and tested, 9 are supported by the 
implemented architecture, and 2 require architectural enhancement: 
• All double-operand instructions are fully implemented; 
• 4 of 7 single-operand instructions are fully implemented; 
• Jump instructions: the implemented architecture was designed to handle jump 
instructions, but the implementation was not completed. All jump instructions can 
be fully implemented with little changes to the current design; 
• CALL (subroutine call) and RETI (return from interrupt) instructions: not 
supported. 
The implemented architecture supports 9 of the 12 addressing modes. All 6 constants 
addressing are fully implemented. Also, register direct, register indirect and indirect 
autoincrement modes are fully implemented. Indexed, immediate and absolute addressing are not 
implemented.  
51 
5.2 MSP430 MTNCL Architecture and System Data Flow 
5.2.1 MSP430 Architecture 
The MSP430 being a low-power microcontroller implements a reduced instruction set 
computer (RISC) ISA with a simple data path. In this dissertation, no high-speed or performance 
enhancement techniques are explored. The architecture implemented is centered on the basic 
million instructions per second (MIPS) architecture with 5 simple instruction execution stages: 
instruction fetch, instruction decode, operand fetch, execution and write back. Furthermore, a 
single pipeline stage is implemented, which helps with data dependency control. As illustrated 
by the flow chart in Figure 22, an instruction must be fully executed before the next one is 
fetched. This makes the architecture easily adaptable to MTNCL data propagation flow where 
instruction execution stages are translated into MTNCL pipeline stages. To elaborate, each 
instruction execution stage is implemented as a complete MTNCL pipeline element with input 
and output registers. This clear hierarchy helps with design debug and enhancement, which can 









































Figure 22: MSP430 MTNCL Adapted Single Pipeline Stage Data Flow. 
  
53 
5.2.2 MSP430 MTNCL Data Flow 
The system data flow follows the 5 execution stages as detailed below: 
• Instruction fetch: instruction fetch starts with a program counter output from the 
register file, which is sent to the program memory. The program memory is 
accessed, and an instruction is retrieved and sent back to the core; 
• Instruction decode: Instruction decode is implemented with one combinational 
logic block. It uses the 16-bit instruction code to deduct all process parameters 
including source and destination operand select vectors (register file array 
addresses), instruction type, addressing mode and execution mode (bit-wise or 
word execution); 
• Operand fetch: during operand fetch source and destination operands are 
determined depending on the instruction type and addressing mode. Source and 
destination select vectors from the instruction decode are used to read source and 
destination operands from the register file. If the instruction uses an indirect 
source addressing mode, the source operand comes from the data memory. The 
register file source operand output is then used as address to the data memory; and 
the memory output is sent as source operand to the execution stage; 
• Execution: All implemented instructions use the ALU block during the execution 
stage; hence, the ALU and the register file are the 2 central blocks of the 
implemented MSP430 core. The ALU implements blocks such as the adder, the 
logic unit, the rotate unit, the sign extend unit, and the swap byte unit. A sleep 
select mechanism is used to (1) minimize active power dissipation, and (2) 
54 
simplify output selection logic. Thus, for each instruction, only the required 
component is activated; 
• Write back stage: During the write back stage, instruction execution results are 
saved. For a destination indirect addressing, the execution result is written to the 
data memory. However, even when the ALU result is written to data memory, the 
register file still needs to be updated. The PC increment logic is implemented 
outside of the register file, so the incremented PC value needs to be written to the 
register file in every instruction cycle. Also, the status register is updated every 
time an instruction modifies one of the status bits. However, the register file 
implementation requires that data is only written to the register file array after an 
instruction is fully executed. Handshaking signals from the data memory MTNCL 
interface help meet this requirement with little control logic overhead. 
The flow chart diagram in Figure 22 summarizes the stage-by-stage MSP430 MTNCL 
system execution flow. 
5.3 Asynchronous SRAM MTNCL Integration 
Memory read and write delays are the limiting factor with respect to instruction execution 
time for both synchronous and asynchronous systems. Most synchronous system architectures 
including the MSP430, integrate multiple clocks for different power modes. Also, different clock 
frequencies are used for different system blocks depending on targeted execution times or power 
budgets. This means that for a synchronous system, the core is not limited by memory access 

















































Figure 23: MSP430 MTNCL Asynchronous RAM Read Access Logic 
On the other hand, an MTNCL pipeline propagation delay is the sum of all stages’ 
delays. Therefore, to achieve the same behavior implemented in synchronous systems, the 
asynchronous SRAM needs to be integrated such that its delay only affects the pipeline when a 
memory access is required. This implementation is exemplified in Figure 23 illustrating the 
SRAM integration with respect to data memory read during the operand fetch stage. 
The control logic shown in Figure 23 focuses only on the read process control flow. To 
accommodate the write operation, the actual implementation requires more attention on 
handshaking signals control. For instance, the DMEM_slpout signal, which is the data memory 
output stage ko falls (1) when a read operation is completed during the operand fetch stage, and 
(2) when the write operation is finished during write back stage. Also, the memory address input 
is the register file source operand when reading from the memory, or the register file destination 
operand when writing to the memory. Therefore, in addition to the blocks represented in Figure 
56 
23, a MUX block is added for read and write address selection and handshaking signals control 
accounts for operand fetch and write back stage memory access. 
The asynchronous data memory is integrated with respect to read operation in the 
following steps:  
• At the output end of the data memory, the execution stage input register receives 
the source operand from the register file when executing a source direct 
addressing instruction, or from the data memory when executing a source indirect 
addressing instruction. The MUX block selects the register file source operand 
output (RF_op_src) or the data memory output (DMEM_op_src) depending on 
the addressing mode. The Mem_en signal, which is asserted for indirect 
addressing mode, is used as the MUX select input to select DMEM_op_src when 
‘1’, and RF_op_src when ‘0’. The completion logic block sleep signal input, 
which is connected to the preceding stage’s ko in regular MTNCL pipeline, is 
computed with ko signals from the register file and the data memory as detailed in 
Figure 23; 
• At the input end, the data memory input register is slept unless an instruction 
requires memory access. The Mem_en_B signal is asserted unless a source 
indirect instruction is executed, which keeps the data memory input register slept; 
• The Boolean AND and OR gates as depicted in Figure 23 represent only the 
general idea as it pertains to handshaking signal control. The actual 
implementation requires attention to propagation delays, thus MTNCL and NCL 
gates are used. 
  
57 
6 SIMULATIONS, RESULTS, AND ANALYSIS 
As mentioned in previous sections, MTNCL methodology and magnetic RAMs are both 
known for their low leakage power, and this dissertation work aims to evaluate advantages 
resulting from combining the two methodologies. To demonstrate the MTNCL-STT RAM 
system advantages, first, the MTNCL core, the CMOS RAM, and the STT RAM designs are 
simulated individually. Later, the MTNCL-CMOS RAM and the MTNCL-STT RAM systems 
are simulated under the same conditions. Device-level schematic simulations for functionality 
and energy measurements are performed using the Ultrasim simulator in Cadence Analog Design 
Environment. Results are compared in terms of execution time, active energy, and leakage 
power. Finally, unique advantages of the MTNCL-STT RAM system are presented and 
explained. 
6.1 MTNCL Core Simulation 
For the stand-alone MTNCL core simulation, only those instructions using register file 
operands were included. A program memory was used to provide test instructions, so the core-
program memory interface is tested as well.  Simulations start with a Verilog-A testbench 
writing instructions into the program memory. This emulates the MSP430 functionality where a 
serial debug interface (SDI) accesses the core registers and memory interface so that an outside 
computer can be used for system debugging. For simplification, the Verilog-A testbench is 
directly connected to the program memory. The program memory used for this test is the same 
asynchronous CMOS RAM design used for data memory in the MTNCL-CMOS RAM system, 
but a wrapper was added to communicate with the Verilog-A testbench during the initiation stage 
and communicate back and forth with the MTNCL core during execution. The Verilog-A 
testbench keeps the core RST signal asserted and CPU_en de-asserted while it writes instructions 
58 
into the program memory.  Once test instructions have been written into the program memory, 
the core is activated, and the execution begins. Figure 24 shows the MTNCL core, the program 

































Figure 24: MTNCL core and Program Counter Initiation Set-up. 
  
59 
Table 3: MTNCL Core Stand-alone Simulation Results. 
Instruction Energy (J) Time (ns) Power (W) 
ADD 1.18E-11 10.85 1.09E-03 
SUB 1.15E-11 10.04 1.15E-03 
XOR 1.17E-11 9.91 1.19E-03 
OR 1.15E-11 9.89 1.17E-03 
AND 1.16E-11 10.01 1.16E-03 
RRA 1.15E-11 9.73 1.19E-03 
AVERAGE 1.16E-11 10.07 1.16E-03 
 
Energy, execution time, and active power dissipation results for executed instructions are 
presented in Table 3. The balanced execution time is a result of the identical data path followed 
by all executed instructions. Instructions involving the adder component (ADD and SUB) 
dissipate slightly higher energy. Logic instructions (XOR, OR, AND) use slightly higher energy 
than the Arithmetic Rotate instruction. This is because the Rotate block implementation requires 
the smallest number of gates amongst all ALU components. Tested instructions data paths only 
differ by the ALU component they use; as a result, execution time, energy, and power are 
effectively the same as illustrated by the histogram in Figure 25. 
60 
 
Figure 25: MTNCL Execution Time, Energy and Power Comparison Across Various 
Instructions. 
6.1.1 MTNCL Leakage Power 
Results showed that the MTNCL core presents low leakage power as expected. 
Operational leakage power, which is the energy measured during a NULL cycle immediately 
followed by a DATA cycle, is measured to be 2.42E-05W, which is 2.09% of the average active 
power. The idle mode leakage, which is the energy measured following a significant time of 
inactivity, fluctuates from negative to positive such that the average leakage power is practically 
negligible, i.e., 2.16E-07W, which is 0.02% of the average active power, is the measured average 









ADD XOR OR AND SUB RRA
Time, Energy and Power comparison
Time (ns) Energy (pJ) Power (100 uW)
61 
6.2 CMOS RAM Simulation 
A stand-alone CMOS RAM simulation is performed.  A Verilog-A testbench emulating 
an MTNCL core is used to perform multiple read and write operations. Averaged read and write 
times, energy, and power results are summarized Table 4. 
Table 4: CMOS RAM Read and Write Time, energy, and Power. 
Operation Time (ns) Energy (J) Power (W) 
Write 1.217 1.41E-12 1.15E-03 
Read 1.225 1.44E-12 1.18E-03 
 
Read and write operations are performed practically under the same time and consume 
around the same energy. The leakage power of 7.52E-04W, 63% of the active power, is 
significantly high. However, for an SRAM that is kept in standby mode or pre-charged, high 
leakage power is expected. 
6.3 STT RAM Simulation 
The STT RAM design is tested under the same conditions as the CMOS RAM. Results 
are presented in Table 5. 
Table 5: STT RAM Read and Write Time, Energy, and Power. 
Operation Time (ns) Energy (J) Power (W) 
Write 7.577 1.47E-11 1.94E-03 
Read 1.924 3.85E-12 2.00E-03 
 
62 
In comparison to the CMOS RAM design, the STT RAM design presents considerably 
higher write time and energy. However, simulation results highlight STT RAM design leakage 
advantages. STT RAM leakage power falls quickly right after a read or a write. The presented 
leakage power of 2.39E-06W, which is 0.32% that of the CMOS RAM design, is only due to 
CMOS logic, and it is equivalent to 0.37% of the system active power dissipation, which is 
extremely low. The diagram in Figure 26 illustrates CMOS RAM design’s advantages in read 
and write times and energy, but the STT RAM leakage is negligible compared to CMOS RAM 
leakage.  
 
Figure 26: CMOS and STT RAM Read and Write Time, Energy, Power, and Leakage 
Comparison. 
6.4 MTNCL Core and CMOS RAM System Simulation 
Table 6 shows simulation results for the MTNCL-CMOS RAM system. Instructions 
using the register file (RF) for source and destination operands require the same energy and 










Read Time (ns) Write Time (ns) Energy (pJ) Power (mW) Leakage Power
(100 uW)
CMOS RAM to STT-RAM comparison
CMOS RAM STT-RAM
63 
total energy increases considerably due to CMOS RAM leakage. Instructions accessing the 
memory for read or write take longer to execute due to the added memory access time, and as 
expected, instructions that read and write to memory in the same cycle require the longest 
execution time. Table 6 presents execution time, core and data memory energy, and the total 
energy in relation to operands source and destination.  
Table 6 MTNCL CMOS RAM System Execution Time and Energy Simulation Results. 








ADD RF RF 10.858 1.188E-11 8.175E-12 2.005E-11 
XOR RF RF 9.9173 1.17E-11 7.465E-12 1.916E-11 
OR RF Memory 12.215 1.232E-11 8.785E-12 2.111E-11 
SUB Memory RF 12.649 1.327E-11 1.234E-11 2.561E-11 
ADD Memory Memory 15.446 1.234E-11 1.496E-11 2.73E-11 
 
6.5 MTNCL Core and STT RAM System Simulation 
The MTNCL-STT RAM system is tested under the same conditions as the MTNCL-
CMOS RAM system. Execution time, core and memory energy and the total energy are 
summarized in Table 7. Like the MTNCL-CMOS RAM system, execution time is not affected 
by memory read and write latencies when all operands come from the register file. In addition, 
due to STT RAM low leakage power, the total system energy dissipation when data memory is 
not used is practically the same as the MTNCL core energy dissipation executing the same 
instructions in stand-alone mode, which is an important advantage over the MTNCL-CMOS 
RAM system. The MTNCL-STT RAM system presents longer execution times for instructions 
requiring a memory access. This is expected considering STT RAM read and write times. For 
64 
example, an instruction that reads and write to memory takes around 22.5ns compared to 15.4ns 
for the MTNCL-CMOS RAM system. The difference in the two execution times matches STT 
RAM and CMOS RAM read and write time differences.  
Results in Table 7 also show that the core energy remain almost unchanged when STT 
RAM write delay extends the execution time considerably. This is an advantage due to MTNCL 
low leakage property. Despite a 90% increase in execution time for an instruction that writes to 
memory rather than the register file, the MTNCL core energy dissipation only increases by 10 %. 
Table 7 MTNCL-STT RAM Execution Time and Energy Simulation Results. 








ADD RF RF 10.85 1.18E-11 1.35E-15 1.18E-11 
XOR RF RF 9.91 1.17E-11 1.24E-15 1.17E-11 
OR RF Memory 18.61 1.25E-11 2.4E-11 3.65E-11 
SUB Memory RF 13.37 1.33E-11 9.50E-12 2.28E-11 
ADD Memory Memory 22.57 1.31E-11 2.86E-11 4.17E-11 
 
The histogram in Figure 27 summarizes execution time, core energy, and total system 
energy comparisons between the MTNCL-CMOS RAM and the MTNCL-STT RAM systems. 




Figure 27: MTNCL-STT RAM and MTNCL-CMOS RAM Execution Time and Energy 
Comparison. 
1. Execution time is the same for both systems for instructions using Register File 
for source destination and operands. 
Reason: MTNCL QDI methodology allows the two SRAMs to be 
swapped without any changes to the core data path or interface. 
2. The core energy is approximately the same for the same instruction types on both 
systems despite big differences in execution times. 
Reason: MTNCL core low leakage allows a minimal energy overhead for 
the MTNCL-STT RAM system that requires a significantly longer execution 
time. 
3. The total energy dissipated by the MTNCL-CMOS RAM system to execute 











MTNCL-CMOS RAM and MTNCL-STT RAM systems 
Execution Time and Energy comparison
Time (ns) Core Energy Total Energy
66 
alone core energy as well as the MTNCL-STT RAM system energy executing the 
same instructions.  
Reason: CMOS RAM leakage power adds a significant cost to the system 
total energy dissipation. This is the biggest concern for CMOS SRAM designs in 
most applications. 
4. The MTNCL-STT RAM system presents the longest execution time and highest 
energy dissipation especially for “read and write” instructions where 46% and 
53% overheads are incurred in execution time and energy dissipation, 
respectively.  
Reason:  STT devices require high currents and long state switching times, 
which results in high write latency and high read and write energy. 
6.6 MTNCL-STT RAM System Unique Advantages  
While the analysis in the previous sections focuses on the delay and energy comparisons 
between MTNCL-STT RAM and MTNCL-CMOS RAM systems, this section focuses on 
applications that are only possible with the MTCNL-STT RAM system. The MTNCL 
handshaking protocol for QDI designs combined with the non-volatility property of STT RAMs 























Figure 28: MTNCL core to Asynchronous SRAM Interface-Memory Power Failure Interrupt 
Given a self-timed STT RAM with an MTNCL interface, data propagation from MTNCL 
core to STT RAM follows the MTNCL protocol where DATA is only passed to the next stage 
when ko is ‘1’.  As illustrated in Figure 28, the DMEM_ko signal comes from data memory and 
serves as ko to the core. However, when memory power is cut off, every signal in the STT RAM 
design falls including DMEM_ko, which translates into a request for NULL to the core. When 
the core executes an instruction requiring memory access, and the STT RAM is off; the core is 
forced to wait for the STT RAM to recover and assert its ko signal before the execution can 
continue. Since STT RAM is a non-volatile memory, its content is intact when power is restored, 
and its control signals resume in a NULL state, which asserts ko to request DATA from the core. 
In this way, the MTNCL-STT RAM system handles a memory power failure with zero design 
and implementation overhead.  
68 
The described scenario is implemented and simulated. The power source to the STT 
RAM block is set to 0 right before the MTNCL core sends an address to memory. A short and a 
long interruption were simulated to analyze how the core leakage power progresses with time. 
As expected, the core halts the execution waiting for memory feedback and resumes after 
memory power is restored. The halted instruction resumes correctly, and the core continues with 
the next instruction from the program memory. Simulation results for the 2882ns execution 
interruption are presented in Figure 29. At 117ns STT RAM power supply is set to 0V, thus STT 
RAM output signals DMEM_ko and DMEM_slpo fall. DMEM_ki is asserted to request DATA 
from the data memory, but the data memory output is NULL despite DMEM_slpo being ‘0’.  At 
the data memory input side, DMEM_slpIn is asserted, and it remains asserted the whole 
interruption time because with DMEM_ko being ‘0’ DATA cannot be sent to memory. 
DMEM_cen (data memory chip enable signal) remains de-asserted as well due DMEM_ko being 
‘0’. 
In Figure 30 simulation results show the MTNCL-STT RAM system resuming execution 
after STT RAM power supply is set to 1.2V. At 3000ns power is restored, and subsequently: 
DMEM_ko rises, DMEM_slpIn falls and DMEM_cen rises. DMEM_ko falls right after 
DMEM_slpIn falls as expected, and DMEM_slpo falls a few nanoseconds after, which means 
STT RAM output to the core is DATA. The core finishes the execution, the cycle is complete, 




Figure 29: MTNCL-STT RAM System Memory Power Failure Simulation Results. 
 
Figure 30: MTNCL-STT RAM System Resuming from Memory Power Failure Interruption. 
70 
The core and DMEM energy for the whole execution time of the uninterrupted execution, 
the short interruption, and the long interruption are recorded in Table 8. Figure 31 further 
clarifies the core and memory energy variations for the uninterrupted execution, short and long 
interruptions. Even though the core energy increases with the interruption time, the increase in 
energy dissipation is practically negligible compared to the interruption time. For instance, the 
added short interruption is 15 times the uninterrupted execution, but the core energy dissipation 
only increases by 32%. The long interruption is 215 times the uninterrupted execution, but the 
core energy overhead is only 1.75 times the energy for the uninterrupted execution. The total 
memory energy is the same for the short and long interruption, and it is 52% higher than the 
uninterrupted execution energy consumption. The difference goes to shutting down and restoring 
control signal circuitries. 
















1.33E-11 9.50E-12 0 0 
Short Interruption 
(198ns) 
1.58E-11 1.42E-11 0.32 14.83 
Long Interruption 
(2882ns) 




Figure 31: Core and Memory Energy Dissipation Comparison for Different Memory 
Interruption Times 
The memory power failure scenario is one example of the MTNCL-STT RAM system 
fast turn-on advantages. Results show that the STT memory device non-volatility property can be 
leveraged with little overhead incurred. This can be extended to a large variety of applications. 
For instance, a low power mode can be implemented such that the core goes into a standby state 
when the system power goes below a given threshold. Simulation results show that the core 
standby power dissipation, 2.79E-05W, is only 2.8% of the active core power and 1.51% of the 
average total system active power. This means that the MTNCL-STT RAM system power can be 
reduced to as low as 1.51%, and it will resume its normal execution with zero delay when full 















Core Energy Dissipation Across Different Execution Time 
due to Memory Power Failure
Core Energy DMEM Energy
72 
7 CONCLUSION 
This dissertation explores three different technologies and creates a system that leverages 
their advantages. The MTNCL-STT RAM system incorporates the MTNCL design methodology 
with the MSP430 microcontroller architecture and STT RAM technology. The MTNCL design 
methodology is chosen for its low leakage power and flexible timing requirements; the MSP430 
microcontroller is chosen for its low power architecture; and the STT RAM technology is chosen 
for its zero-leakage and non-volatility properties. As a result, the MTNCL-STT RAM is a non-
volatile low leakage Quasi-Delay-Insensitive system.  
The MTNCL-STT RAM system is compared to another system that implements the same 
functionality but using a CMOS volatile RAM. Simulation results show that the MTNCL-STT 
RAM fulfills expected goals. The system presents low leakage power advantages across various 
modes and applications. Results show that the memory leakage contribution to the total system 
energy is negligible for the MTNCL-STT RAM system. In contrast, memory leakage is the main 
concern for the system with the volatile CMOS RAM in which for some instructions, leakage 
energy accounts for up to 40% of total system energy. Furthermore, the combination of a non-
volatile memory with MTNCL yields unique advantages in applications such as the memory 
power failure handling and low power standby mode that are only possible with the MTNCL-
STT RAM system.  
However, the STT memory device model used in this dissertation requires high currents 
and long delays to switch states. This translates into long memory write times and high active 
energy. The MTNCL-CMOS RAM system presents better execution time and better active 
energy compared to the MTNCL-STT RAM system. For future work, a more up-to-date non-
volatile STT memory device should be used.    
73 
8 REFERENCE 
[1] N. S. Kim et al., "Leakage current: Moore's law meets static power," in Computer, vol. 
36, no. 12, pp. 68-75, Dec. 2003, doi: 10.1109/MC.2003.1250885. 
[2] S. Li, K. Chen, J. H. Ahn, J. B. Brockman and N. P. Jouppi, "CACTI-P: Architecture-
level modeling for SRAM-based structures with advanced leakage reduction techniques," 
2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), San 
Jose, CA, USA, 2011, pp. 694-701, doi: 10.1109/ICCAD.2011.6105405. 
[3] Varghese George et al., "Penryn: 45-nm next generation Intel® core™ 2 processor," 
2007 IEEE Asian Solid-State Circuits Conference, Jeju, Korea (South), 2007, pp. 14-17, 
doi: 10.1109/ASSCC.2007.4425784. 
[4] S. Rusu, S. Tam, H. Muljono, D. Ayers and J. Chang, "A Dual-Core Multi-Threaded 
Xeon Processor with 16MB L3 Cache," 2006 IEEE International Solid State Circuits 
Conference - Digest of Technical Papers, San Francisco, CA, USA, 2006, pp. 315-324, 
doi: 10.1109/ISSCC.2006.1696062. 
[5] Semiconductor Industries Association, "International Technology Roadmap for 
Semiconductors (ITRS) / Model for Assessment of CMOS Technologies and Roadmaps 
(MASTAR) http://www.itrs.net/. 
[6] Y. Zhang et al., "Compact Modeling of Perpendicular-Anisotropy CoFeB/MgO Magnetic 
Tunnel Junctions," in IEEE Transactions on Electron Devices, vol. 59, no. 3, pp. 819-
826, March 2012, doi: 10.1109/TED.2011.2178416. 
[7] E. Kültürsay, M. Kandemir, A. Sivasubramaniam and O. Mutlu, "Evaluating STT-RAM 
as an energy-efficient main memory alternative," 2013 IEEE International Symposium on 
Performance Analysis of Systems and Software (ISPASS), Austin, TX, USA, 2013, pp. 
256-267, doi: 10.1109/ISPASS.2013.6557176. 
[8] X. Fong et al., "Spin-Transfer Torque Devices for Logic and Memory: Prospects and 
Perspectives," in IEEE Transactions on Computer-Aided Design of Integrated Circuits 
and Systems, vol. 35, no. 1, pp. 1-22, Jan. 2016, doi: 10.1109/TCAD.2015.2481793. 
[9] X. Fong, Y. Kim, R. Venkatesan, S. H. Choday, A. Raghunathan and K. Roy, "Spin-
Transfer Torque Memories: Devices, Circuits, and Systems," in Proceedings of the IEEE, 
vol. 104, no. 7, pp. 1449-1488, July 2016, doi: 10.1109/JPROC.2016.2521712. 
[10] J. Chen, K. Chong, B. Gwee and J. S. Chang, "An ultra-low power asynchronous quasi-
delay-insensitive (QDI) sub-threshold memory with bit-interleaving and completion 
detection," Proceedings of the 8th IEEE International NEWCAS Conference 2010, 
Montreal, QC, Canada, 2010, pp. 117-120, doi: 10.1109/NEWCAS.2010.5603713. 
74 
[11] J. Dama and A. Lines, "GHz Asynchronous SRAM in 65nm," 2009 15th IEEE 
Symposium on Asynchronous Circuits and Systems, Chapel Hill, NC, USA, 2009, pp. 
85-94, doi: 10.1109/ASYNC.2009.23. 
[12] C. D. C. Arandilla and J. A. R. Madamba, "Comparison of Replica Bitline Technique and 
Chain Delay Technique as Read Timing Control for Low-Power Asynchronous 
SRAM," 2011 Fifth Asia Modelling Symposium, Manila, Philippines, 2011, pp. 275-278, 
doi: 10.1109/AMS.2011.58. 
[13] K. M. Fant and S. A. Brandt, "Null Convention Logic: A Complete and Consisten Logic 
for Asynchronous Digital Circuit Synthesis," International Conference on Application 
Specific Systems, Architectures, and Processors, pp. 261-273, 1996.  
[14] Designing Asynchronous Circuits using NULL Convention Logic (NCL), Morgan & 
Claypool Publishers, 2009. 
[15] J. Di and S. C. Smith, "Ultra-Low Power Multi-Threshold Asynchronous Circuit 
Design". United States Patent 7,977,972 B2, July 2011 
[16] J. T. Kao and A. P. Chandrakasan, "Dual-threshold voltage techniques for low-power 
digital circuits," in IEEE Journal of Solid-State Circuits, vol. 35, no. 7, pp. 1009-1018, 
July 2000, doi: 10.1109/4.848210. 
[17] O. Girard, "openMSP430 Documentation," 2012. [Online]. 
[18] K. Kuan and T. Adegbija, "HALLS: An Energy-Efficient Highly Adaptable Last Level 
STT-RAM Cache for Multicore Systems," in IEEE Transactions on Computers, vol. 68, 
no. 11, pp. 1623-1634, 1 Nov. 2019, doi: 10.1109/TC.2019.2918153. 
[19] Y. Liu et al., "4.7 A 65nm ReRAM-enabled nonvolatile processor with 6× reduction in 
restore time and 4× higher clock frequency using adaptive data retention and self-write-
termination nonvolatile logic," 2016 IEEE International Solid-State Circuits Conference 
(ISSCC), San Francisco, CA, USA, 2016, pp. 84-86, doi: 10.1109/ISSCC.2016.7417918. 
[20] Z. Wang et al., "Proposal of Toggle Spin Torques Magnetic RAM for Ultrafast 
Computing," in IEEE Electron Device Letters, vol. 40, no. 5, pp. 726-729, May 2019, 
doi: 10.1109/LED.2019.2907063. 
[21] Hinds, Michael, "Design and Analysis of an Asynchronous Microcontroller" (2016). 
Theses and Dissertations. 1664. http://scholarworks.uark.edu/etd/1664 
 
 
 
