Energy Efficient Neocortex-Inspired Systems with On-Device Learning by Zyarah, Abdullah M
Rochester Institute of Technology 
RIT Scholar Works 
Theses 
9-2020 
Energy Efficient Neocortex-Inspired Systems with On-Device 
Learning 
Abdullah M. Zyarah 
amz6011@rit.edu 
Follow this and additional works at: https://scholarworks.rit.edu/theses 
Recommended Citation 
Zyarah, Abdullah M., "Energy Efficient Neocortex-Inspired Systems with On-Device Learning" (2020). 
Thesis. Rochester Institute of Technology. Accessed from 
This Dissertation is brought to you for free and open access by RIT Scholar Works. It has been accepted for 
inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact 
ritscholarworks@rit.edu. 
Energy Efficient Neocortex-Inspired
Systems with On-Device Learning
by
Abdullah M. Zyarah
A dissertation submitted in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy
in Engineering
Department of Computer Engineering
Kate Gleason College of Engineering








Committee Approval: We, the undersigned committee members, certify that
we have advised and/or supervised the candidate on the work described in this
dissertation. We further certify that we have reviewed the dissertation manuscript
and approve it in partial fulfillment of the requirements of the degree of Doctor of
Philosophy in Engineering.
Dr. Dhireesha Kudithipudi, Advisor Date
Professor, Department of Computer Engineering, RIT
Dr. Santosh Kurinec, Committee Member Date
Professor, Department of Electrical and Microelectronics Engineering, RIT
Dr. Cory Merkel, Committee Member Date
Assistance Professor, Department of Computer Engineering, RIT
Dr. Haibo He, Committee Member Date
Professor, Department of Electrical, Computer, and Biomedical Engineering, URI
Certified By:
Dr. Edward Hensel Date
Professor, Director, PhD in Engineering, RIT
iii
ABSTRACT
Degree: Doctor of Philosophy
Author’s name: Abdullah M. Zyarah
Advisor’s name: Prof. Dr. Dhireesha Kudithipudi
Dissertation title: Energy Efficient Neocortex-Inspired Systems with On-Device
Learning
Shifting the compute workloads from cloud toward edge devices can significantly im-
prove the overall latency for inference and learning. On the contrary this paradigm
shift exacerbates the resource constraints on the edge devices. Neuromorphic
computing architectures, inspired by the neural processes, are natural substrates
for edge devices. They offer co-located memory, in-situ training, energy efficiency,
high memory density, and compute capacity in a small form factor. Owing to
these features, in the recent past, there has been a rapid proliferation of hybrid
CMOS/Memristor neuromorphic computing systems. However, most of these
systems offer limited plasticity, target either spatial or temporal input streams,
and are not demonstrated on large scale heterogeneous tasks. There is a critical
knowledge gap in designing scalable neuromorphic systems that can support hybrid
plasticity for spatio-temporal input streams on edge devices.
This research proposes Pyragrid, a low latency and energy efficient neuromorphic
computing system for processing spatio-temporal information natively on the
edge. Pyragrid is a full-scale custom hybrid CMOS/Memristor architecture with
analog computational modules and an underlying digital communication scheme.
Pyragrid is designed for hierarchical temporal memory, a biomimetic sequence
memory algorithm inspired by the neocortex. It features a novel synthetic synapses
representation that enables dynamic synaptic pathways with reduced memory usage
and interconnects. The dynamic growth in the synaptic pathways is emulated in
the memristor device physical behavior, while the synaptic modulation is enabled
through a custom training scheme optimized for area and power.
Pyragrid features data reuse, in-memory computing, and event-driven sparse local
computing to reduce data movement by ⇡ 44⇥ and maximize system throughput
and power efficiency by ⇡ 3⇥ and ⇡ 161⇥ over custom CMOS digital design. The
innate sparsity in Pyragrid results in overall robustness to noise and device failure,
particularly when processing visual input and predicting time series sequences.
Porting the proposed system on edge devices can enhance their computational
capability, response time, and battery life.
iv
Dedicated to the stars that have enlighted my past, and are still
enlighting my present, and future,
my father and mother,
you have always been my role models and a source of inspiration,
and my lovely wife,
your unlimited support, encouragement, and love
no words can describe it.
v
Acknowledgments
Praise be to Allah, the Lord of All Creations.
Here I am writing the last pages of my dissertation and all the great memories
start to pop up into my head. My feelings are mixed up!! Part of me is happy
and excited to start a whole new journey where I can teach and share the next
generation of youths what I have learned. The other part of me is sad to leave my
RIT family headed by my advisor and the Neuromorphic AI lab members.
First and foremost, I would like to thank my adviser Dr. Dhireesha Kudithipudi
for all the guidance, generous support, and constructive feedback during my PhD
work. I am also grateful to the committee members: Dr. Cory Merkel, Dr. Haibo
He, Dr. Santosh Kurinec, and Dr. Clark Hochgraf for your time, feedback, and
valuable advice.
Many thanks go to RIT faculty and staff, specifically, Mr. Mark Indovina, Dr.
Cory Merkel, Dr. Marcin Lukowiak, and Richard Tolleson for your extraordinary
support. Many thanks also go to Dr. Edward Hensel and Rebecca Ziebarth for
making the PhD journey wonderful. Without Dr. Hensel’s wisdom and Rebecca’s
exceptional support, the PhD journey would be much more difficult.
I would also like to express my gratitude to my colleagues, faculty members at
the Department of Electrical Engineering, University of Baghdad. Your assistance
and continuous encouragement made me stronger and gave me the courage to face
numerous challenges.
It has been an honor to work with great collaborators and researchers. Many thanks
go to Kevin Gomez, Zoran Jandric, Aditya Jain (Seagate Technology), Dr. Garrett
Rose and his group (University of Tennessee), and Dr. Nate Cady and his group
(SUNY Albany). I also thank all the members of the Neuromorphic AI lab: Sayed
Hamed Fatemi, Nicholas Soures, Anurag Daram, Tej Pandit, Vedant Karia, Humza
Syed, Fatima Tuz Zohora, Zachariah Carmichael, and Luke Boudreau. Also to
my friends: Ahmed Alnabhan, Nazar Al-Wattar, Qutaiba Saleh, Saif Algraiti, and
Mohamed Maafa, I was so fortunate to be surrounded by such smart, passionate,
and helpful researchers. Thank you so much guys for always being there for me.
Finally, gratitude and sincerest thanks go to my parents, brothers, and sisters, I
vi
am indebted to you for your love, encouragement, and support; my lovely wife
for your patience, motivation, and exceptional support; my two little angels, my






List of Tables xi
List of Figures xix
Frequently Used Symbols xix
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Contributions and Thesis Outline . . . . . . . . . . . . . . . . . . . 9
2 Background and Literature Review 12
2.1 Overview of Neuromorphic Systems . . . . . . . . . . . . . . . . . . 12
2.1.1 CMOS Neuromorphic Systems . . . . . . . . . . . . . . . . . 13
2.1.2 Hybrid Neuromorphic Systems . . . . . . . . . . . . . . . . . 15
2.2 Network Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 Multi-layer Neural Network: Architecture and Learning . . . 18
2.2.2 Hierarchical Temporal Memory: Architecture and
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 In-Situ Training for Multi-layer Feedforward Networks 24
3.1 Multi-layer Neural Networks . . . . . . . . . . . . . . . . . . . . . . 25
3.1.1 Network Topology: Extreme Learning Machine . . . . . . . 25
3.1.2 Learning Rule: Stochastic Gradient Descent . . . . . . . . . 27
3.2 Design Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.1 Learning Rule . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.2 Writing Circuit: Ziksa . . . . . . . . . . . . . . . . . . . . . 30
3.3 System Design and Analysis . . . . . . . . . . . . . . . . . . . . . . 34
3.3.1 Memristive layer . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.2 Error Computing Unit . . . . . . . . . . . . . . . . . . . . . 37
viii
3.3.3 Training system . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Chip Design and Fabrication . . . . . . . . . . . . . . . . . . . . . . 41
3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5.1 Binomial and Multinomial Data Classification . . . . . . . . 43
3.5.2 Analysis of Stuck-at-Faults . . . . . . . . . . . . . . . . . . . 45
3.5.3 Network Resilience . . . . . . . . . . . . . . . . . . . . . . . 48
3.5.4 Power Consumption . . . . . . . . . . . . . . . . . . . . . . 49
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4 Hierarchical Temporal Memory 52
4.1 Overview of the HTM Algorithm . . . . . . . . . . . . . . . . . . . 53
4.2 HTM Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3 HTM Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3.1 Spatial Pooler . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3.2 Temporal Memory . . . . . . . . . . . . . . . . . . . . . . . 63
4.4 HTM Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4.1 SDR classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5 Design Methodology of HTM 69
5.1 HTM Synapse Modeling . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2 Proximal Synapses Formation . . . . . . . . . . . . . . . . . . . . . 75
5.2.1 Memristive Crossbar . . . . . . . . . . . . . . . . . . . . . . 75
5.2.2 Dynamic Memristive Crossbar . . . . . . . . . . . . . . . . . 78
5.3 Distal Synapses Formation . . . . . . . . . . . . . . . . . . . . . . . 79
5.3.1 Address Event Representation . . . . . . . . . . . . . . . . . 80
5.3.2 Synthetic Synapses Representation . . . . . . . . . . . . . . 81
5.4 Homeostasis and Neurogenesis Plasticity Mechanisms . . . . . . . . 82
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6 Pyragrid: HTM Neuromorphic SoC 86
6.1 HTM System Design and Implementation . . . . . . . . . . . . . . 87
6.2 HTM Region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2.1 HTM Mini-Column . . . . . . . . . . . . . . . . . . . . . . . 89
6.2.1.1 Winner-take-all Circuit . . . . . . . . . . . . . . . 93
6.2.1.2 Mini-column Training . . . . . . . . . . . . . . . . 97
6.2.2 HTM Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.3 Synthetic Synapses Representation . . . . . . . . . . . . . . . . . . 107
6.4 SDR Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.5.1 HTM Network Setup . . . . . . . . . . . . . . . . . . . . . . 113
6.5.2 Memristor Device Parameters . . . . . . . . . . . . . . . . . 114
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7 HTM Results and Discussion 117
ix
7.1 Spatial Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.1.1 Image Recognition . . . . . . . . . . . . . . . . . . . . . . . 118
7.1.2 Noise Robustness . . . . . . . . . . . . . . . . . . . . . . . . 121
7.2 Temporal Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.2.1 Time-Series Prediction . . . . . . . . . . . . . . . . . . . . . 123
7.2.2 Noise Robustness . . . . . . . . . . . . . . . . . . . . . . . . 126
7.3 System Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.3.1 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.3.2 Network Reliability and LifeSpan . . . . . . . . . . . . . . . 131
7.3.3 Device Failure and Network Robustness . . . . . . . . . . . 134
7.3.4 Power Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.3.4.1 Power Consumption . . . . . . . . . . . . . . . . . 136
7.3.4.2 Energy-Delay Product and Power Distribution . . . 139
7.3.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . 140
7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
8 Conclusions and Future Work 144
Appendices 147
A HTM Pseudo Code 148
A.1 Spatial Pooler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
A.2 Temporal Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
B Circuits and Device Sizes 151
C HTM on Edge Devices 153
C.1 Smartphones Power Budget and Distribution . . . . . . . . . . . . . 154




3.1 ELM network topologies trained with SGD with various levels of
simplification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Comparison of the SGD-based training system with different simpli-
fication levels for a network size of 100⇥10. . . . . . . . . . . . . . . 31
3.3 Driving signals of the local control units and Ziksa to enable bi-
directional weight adjustment. . . . . . . . . . . . . . . . . . . . . . 40
3.4 Classification accuracy (mean ± standard deviation) of binomial
and multinomial datasets using ELM and SoftELM networks. The
networks are trained using SGD and the accuracy is reported for
both the software version (baseline) and the hardware behavioral
model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5 Total power consumption of the individual components, building
units, and the full size network. . . . . . . . . . . . . . . . . . . . . 51
5.1 A comparison of various memristor window functions based on the
desired HTM synapse features and the evaluation metrics described
in [1, 2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.1 HTM-SW parameters when benchmarking the network on visual
tasks (MNIST and YaleFaces) and time series prediction. . . . . . . 114
6.2 The device parameters used in the mini-column and cell designs. . . 115
7.1 Summary of image recognition accuracy for MNIST dataset using
HTM and other algorithms. Here, SP stands for the spatial pooler,
and nc denotes the number of mini-columns used by the network. . 120
7.2 Summary of image recognition accuracy for Yalefaces dataset using
HTM and other algorithms. Here, SP stands for the spatial pooler;
nc denotes the number of mini-columns used by the network. . . . . 120
xi
7.3 HTM network evaluation while predicting the next 2-5 steps ahead
in time of various time series benchmarks. We indicate the averaged
MAPE ± standard deviation over 5 trials. . . . . . . . . . . . . . . 125
7.4 Latency of the core components in the HTM system (nc = 961 and
nm = 4) estimated for the digital and mixed signal designs while
processing the Hot-Gym dataset. The digital system is clocked at
100MHz, while the mixed signal system is dual clocked; 8MHz is
used as a system clock and 128MHz is used to clock the cells LFSRs.132
7.5 A comparison of the proposed HTM system with the previous work.
One may note that these implementations are on different substrates,
thereby this table offers a high-level reference template for HTM
hardware rather than an absolute comparison. . . . . . . . . . . . . 142
B.1 Transistor sizes of the comparator. . . . . . . . . . . . . . . . . . . 152
B.2 Transistor sizes of the Op-Amp. . . . . . . . . . . . . . . . . . . . . 152
C.1 Average power consumption by smartphones’ CPUs and GPUs when
used for compute intensive tasks such as running machine learning
algorithms or video games. . . . . . . . . . . . . . . . . . . . . . . . 158
xii
List of Figures
1.1 Core design flow in this dissertation: starts with abstracting the
salient structural and algorithmic properties of the biological neocor-
tex using HTM, progresses to mapping the HTM to neuromorphic
chip, and ends by deploying the chip on resource constraint platforms. 4
1.2 High-level architecture of the proposed neuromorphic system with
three core units: data encoders, HTM algorithm, and classifiers. The
encoder transfers the input data into a binary distributed represen-
tation. The HTM algorithm learns spatial information and temporal
sequences, while the SDR classifier transfers the HTM output into a
meaningful data format. . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Example of state-of-the-art digital and mixed-signal neuromorphic
chips developed by industry and academia [3]. . . . . . . . . . . . . 14
2.2 (a) The structure of a memristor device along with its symbol.
The device is composed of Hfo2 and TiN sandwiched by nano-
wires of platinum. (b) This Figure demonstrates the continual
increment/decrement of the current flowing in the memristor as
its resistance decreases/increases when a positive/negative voltage
applied across it [4]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1 High level representation of a three-layer feedforward network: a
buffer layer (input), feature extraction layer (hidden), and classifica-
tion layer (output). . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 (a) 2⇥2 memristive crossbar driven by Ziksa unit. (b) and (c)
Ziksa units for adjusting current-threshold and voltage-threshold
memristor devices integrated into the crossbar structure. (d) Ziksa
unit with pass transistors used during the training to achieve precise
memristor tuning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
xiii
3.3 Architecture of a single layer network constituting of the memristive
layer, training system, and error computing unit. The memristive
layer core units are the neuron circuits and their associated synaptic
connections. The training system is composed of driving transistors
(Tr+ and Tr ), column local controllers (CLCs), and row local
controllers (RLCs), while the error computing unit has a subtractor,
absolute value circuit, and sign detection circuits. . . . . . . . . . . 35
3.4 A single layer network including neuron circuits and their memristive
synaptic connections modeled using the semi-trained crossbar. . . . 37
3.5 Absolute value circuit to compute the absolute of the error signal. . 38
3.6 (Left) Input and output signals to the absolute value circuit. (Right)
Input and output characteristic of the absolute value circuit imple-
mented in IBM 65nm process. . . . . . . . . . . . . . . . . . . . . . 39
3.7 The training system as integrated in one column of the memris-
tive crossbar. The system has Ziksa driving transistors (Tr+ and
Tr ) controlled by the row local controller (left) and column local
controller (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.8 Physical layout of the fabricated training circuit including Ziksa
writing circuit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.9 Training circuit including Ziksa writing circuit after fabrication. . . 43
3.10 (Top) Classification accuracy versus rate of device failure for different
topologies (a) ELM‡, and (b) ELM†. (Bottom) Output layer weight
distributions for the corresponding networks in (top) when the
network experiences 50% stuck-off faults. . . . . . . . . . . . . . . . 46
3.11 (Top) Classification accuracy versus rate of device failure for different
topologies (a) SELM‡, (b) SELM†. (Bottom) Output layer weight
distributions for the corresponding networks in (top) when the
network experiences 50% stuck-off faults. . . . . . . . . . . . . . . . 47
3.12 Network recovery for different topologies after experiencing 50%
device stuck-off faults at the 10th epoch. . . . . . . . . . . . . . . . 47
3.13 Classification accuracy as a function of increasing number of hidden
neurons when experiencing 30% device faults for (a) ELM† and (b)
SELM†. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.14 Power consumption distribution for ELM (ELM†) and SoftELM
(SELM†). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
xiv
4.1 The biological neocortex structure including the regions and the
building blocks (neurons), and their correspondence in the HTM
network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2 High-level architecture of the HTM algorithm with three core units:
data encoders, HTM network, and classifiers. The encoder trans-
forms the input data into binary representation. The HTM network
learns spatial information and captures temporal transitions, while
the classifiers map the HTM output to the corresponding class labels
and identify anomalies. . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3 (bottom) An input data stream representing power consumption
in a gym captured at every hour. (top) SDR representation of the
bottom signal as it is encoded by the HTM random distributed
scalar encoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4 Spatial pooler initialization phase. (left) Depicts the process of
selecting mini-columns’ receptive fields, where proximal connections
are formed. (right) The growth level of the individual proximal
connection as defined by the permanence is randomly set. . . . . . . 59
4.5 Spatial pooler overlap and inhibition phase. (left) Demonstrating
overlap score computing and nominating the mini-columns (high-
lighted with bold black boarders) for input representation. (right)
Inhibition process and selecting the active mini-columns (highlighted
in light-blue) based on the desired level of sparsity. . . . . . . . . . 61
4.6 Spatial pooler learning phase. It involves updating the strength
(permanence) of the proximal connections for the active mini-columns
only and according to Hebbian’s rule. . . . . . . . . . . . . . . . . . 62
4.7 Learning sequences (‘3-4-5’) presented to an HTM region for the first
time. Thus, the region experiences bursting operation and formation
of distal connections between the cells. . . . . . . . . . . . . . . . . 66
4.8 After learning, the cells that are previously set into the predictive
state are chosen to represent the current input. . . . . . . . . . . . 67
5.1 (a) Characteristic curves of the proposed Z-window function to
model the HTM synapses’ behavior. (b) and (c) The linkage to
the linear drift model and scalability features, respectively. (d) The
non-symmetrical behavior feature. . . . . . . . . . . . . . . . . . . . 72
5.2 The conductance change (left) and hysteresis characteristic curve
(right) of the memristor while driving it with a positive pulse signal. 73
xv
5.3 The conductance change (left) and hysteresis characteristic curve
(right) of the memristor while driving it with a negative pulse signal. 73
5.4 Fitting the memristor model to the physical device behavior while
modulating the device conductance with a train of pulses. . . . . . . 75
5.5 Mini-column receptive fields modeled by a sparsely connected memris-
tor crossbar implemented using (a) blocking memristor (b) predefined
mini-columns regional connections. . . . . . . . . . . . . . . . . . . 77
5.6 (a) LFSR used to generate a global RF. (b) LFSR with partially
used registers (red-base) to generate the local RF. . . . . . . . . . . 79
5.7 The address-event representation between two cores, sender and
receiver. When a neuron from the sender chip fires, its address gets
encoded and sent over the data bus indicating an event has occurred.
Whenever the receiver chip gets the address, its decoder generates a
spike in the corresponding location [5]. . . . . . . . . . . . . . . . . 80
5.8 High-level description of the formation of synaptic pathways using
synthetic synapses representation, where the solid lines represent
successfully formed pathways, while the dotted line indicates an
attempt to form connection with inactive cells which did not go
successfully. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.9 The accumulated activity of a select set of mini-columns along with
the corresponding boosting level recorded after various iterations. . 83
5.10 The density of the potential synapses as linked to an input space
with activity centered to the middle region when the neurogenesis
mechanism (a) disabled (b) enabled. . . . . . . . . . . . . . . . . . 84
6.1 High-level architecture of the HTM system composed of an encoder,
SDR classifier, and an HTM region. The encoder and SDR classifier
are utilized to convert sensory information to SDR representation and
to transform HTM neuronal activities into conventional data format,
respectively. The HTM region comprises pnc ⇥
p
nc mini-columns
with nm cells each to process spatial and temporal information. The
data flow and communication within the region is facilitated through
the control units, and the communication arbiter and selector. . . . 88
xvi
6.2 The circuit diagram of a mini-column in an HTM region. It consists of
a peripheral unit in which the proximal connections are generated and
connected to input space, a proximal unit to store the connections’
strength, and a WTA cell to enable the mini-columns to compete
for input representation. . . . . . . . . . . . . . . . . . . . . . . . . 90
6.3 The impact of the synaptic permanence (denoted as Permanence#.)
modulation on the mini-column overlap score as the proximal synapses
(denoted as synapse#) receive feed-forward input. . . . . . . . . . . 92
6.4 Current conveyor circuit proposed by [6] to implement a WTA function. 94
6.5 (a) WTA circuit (nk number of cells) with local excitatory feedback. 94
6.6 Simulation results of four randomly picked cells [5-6-7-10] from the
proposed WTA circuit while identifying the winning classes. For the
shown waveforms, the expected output labels are: [x-7-7-x-7-x-10-6-
x-10-...], where x indicates other classes (not shown here). . . . . . . 97
6.7 (a) The training circuit of the proximal synaptic connections in
an HTM mini-column, (b) A waveform diagram demonstrating the
operation of the training circuit during the testing period (shaded
in light gray) and training period. . . . . . . . . . . . . . . . . . . . 98
6.8 The possible scenarios for the current sneak paths when a DFF is
buffered with (a) A NOT gate, (b) A Tri-state buffer to drive the
proximal connection memristors. . . . . . . . . . . . . . . . . . . . . 99
6.9 The circuit diagram depicting the HTM cell with synaptogenesis
unit, which can generate or prune distal segments; a distal dendritic
segment to hold the permanence values of the distal connections;
and current comparators to evaluate the distal segment activation
level and consequently the cell status (predictive or unpredictive). . 102
6.10 The matching probability between a distal segment address generated
by LFSRs and the address of the active cells in the previous time
step for various segment sizes. . . . . . . . . . . . . . . . . . . . . . 103
6.11 Competitive circuit that enables the cells within one mini-column to
interact with each other when a massive firing activity takes place
in the mini-column. . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.12 Waveform diagram illustrating the competitive circuit for the cell to
select the best matching cell. . . . . . . . . . . . . . . . . . . . . . . 106
xvii
6.13 SSR arbiter circuit consisting of buffers to store the simultaneous
requests from the winning mini-columns, a series of nMOS pass
transistors to monitor the status of the individual mini-column’s
requests, and a feedback circuit to clear mini-column requests once
served. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.14 Waveform diagram demonstrating a part of the SSR operation while
processing several concurrent requests sent from several mini-columns
located within the same row of the HTM region. . . . . . . . . . . . 110
6.15 The SDR classifier schematic, which is mainly composed of a softmax
unit (memristive crossbar + winner-takes-all circuit) to classify the
HTM neuronal activities, time buffer to store data over time, and
sequential digital comparators to recognize novel inputs, which helps
in maintaining input data records. . . . . . . . . . . . . . . . . . . . 112
7.1 Samples from the classification benchmarks used to evaluate HTM
system performance for visual information processing. (left) Hand-
written digits from MNIST dataset. (right) Yalefaces dataset. . . . 119
7.2 Recognition accuracy of MNIST dataset classified with SDR classifier
and SP+SDR classifier in the presence of a noise level ranging
between 0% and 10%. . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.3 Noise robustness of the SP+SDR classifier when presenting MNIST
dataset as a stream of data mediated by noisy information. . . . . . 122
7.4 A snapshot of (a) the power consumption of the Hot-Gym dataset
recorded every hour over approximately 4 days, (b) the taxi demand
in New York City estimated at every hour, (c) The daily minimum
temperature in Melbourne, Australia, (d) The number of successful
observed sunspots for 230 years. . . . . . . . . . . . . . . . . . . . . 124
7.5 MAPE for predicting the power consumption in a gym for the next
2 and 5 hours using HTM software (HTM-SW) and HTM hardware
(HTM-HW) models. . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.6 A sample of the Hot-Gym dataset before and after superimposing
Gaussian noise with Bernoulli distribution with probability of 0.5.
The SNR is 19.74 dB. . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.7 MAPE while predicting the power consumption in a gym for the
next two hours in the presence of various degrees of uniform noise
(left) and Gaussian noise (right). . . . . . . . . . . . . . . . . . . . . 128
xviii
7.8 The impact of injecting internal noise on HTM system performance.
In separated experiments, the noise is injected into the encoder,
spatial pooler, and temporal memory outputs, respectively. Then,
the noise is collectively added to the entire network at the same time.129
7.9 Latency of the digital and mixed-signal HTM designs as a function of
the network size, given by the number of mini-columns (the number
of cells per mini-column is 4). . . . . . . . . . . . . . . . . . . . . . 131
7.10 Elasticity (lifespan) of the overall HTM mini-columns in the ideal
and real-world scenarios. . . . . . . . . . . . . . . . . . . . . . . . . 134
7.11 The impact of the HTM network scaling (number of mini-columns
nc) on the elasticity (lifespan). . . . . . . . . . . . . . . . . . . . . . 134
7.12 The MAPE (averaged over 5 runs) of the HTM-HW predicting two
steps ahead in time for the Hot-Gym dataset while experiencing
various types of stuck-at faults. . . . . . . . . . . . . . . . . . . . . 135
7.13 The total power consumption of the developed HTM system as it
processes and predicts time-series data from Hot-Gym dataset. . . . 137
7.14 The average power consumption of the proximal segments (left) and
distal segments (right) involved in training. . . . . . . . . . . . . . . 138
7.15 The average power consumption of the all proximal segments (left)
and distal segments (right) while computing the overlap scores. . . . 139
7.16 Contour of energy-delay-product for the developed HTM system as
a function of the network size. . . . . . . . . . . . . . . . . . . . . . 140
7.17 The distribution of the power consumption among the building
entities of the developed HTM system during the training and
testing modes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.18 Normalized power consumption of Pyragrid with respect to GPU [7],
CPU, state-of-the-arts digital custom designs (green bars), and
memristor-based analog and mixed-signal designs (red bar). . . . . 141
B.1 Transistor-level schematic of the comparator. The comparator circuit
uses additional buffers and positive feedback. The buffers are used to
speed up signal propagation, whereas the positive feedback endows
the comparator the capability to handle external/internal noise.
Additional transistors (T10 and T9) are utilized to achieve low resolution.151
B.2 Transistor-level schematic of 3-stage Op-Amp with output stage of
class B to provide low-output impedance. . . . . . . . . . . . . . . . 151
xix
C.1 Power consumption breakdown of smartphones based on regular
daily use as estimated by [8]. . . . . . . . . . . . . . . . . . . . . . . 154
C.2 Power consumption breakdown of smartphone (Nexus-6) while play-
ing video games. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
C.3 The estimated battery life of Nexus-6 smartphone after porting HTM
system (various sizes), assuming HTM has the capability to reduce
the power consumption of CPU and GPU by 25% when used for
compute-intensive operations. . . . . . . . . . . . . . . . . . . . . . 159
C.4 The estimated battery life of Nexus-6 smartphone versus the reduc-
tion percentage in the CPU and GPU power consumption when
using the developed HTM system. . . . . . . . . . . . . . . . . . . . 159
xx
Frequently Used Symbols
↵th Overlap score threshold or minOverlap
⇢̄ Connected distal synapses array
⇢̄p Connected proximal synapses array
ā Mini-column average activity
  Output layer weight array in the ELM network
⌘ Desired level of sparsity
  Synapse adaptation pace factor
⇤ Spatial pooler output
! Hidden layer weight array in the ELM network
  Region predictive cells (array)
⇢ Distal synaptic permanence array
⇢p Proximal synaptic permanence array
  Learning rate (stochastic gradient descent)
~↵ Mini-columns overlap score
~e↵ Nominated mini-columns with high overlap scores
⇠ Inhibition radius
A Region cells’ activity
b Mini-column boosting factor
Dij Group of distal segments for a given cell
Ed Memristor device endurance
xxi
Gmem Memristor device conductance
Goff Conductance of the memristor undoped region
Gon Conductance of the memristor doped region
Lr Successful learning rounds for a given network
nc Total number of mini-column in an HTM region
nm Number of cells in each mini-column
nn Total number of input samples
nw Number of winning mini-columns after inhibition
nx Feed-forward input vector length
ny Number of class labels recognized by the SDR classifier
nsd Maximum number of synapses per distal segment
nsp Proximal connections count in each mini-column
P+ Distal synaptic permanence increment
P  Distal synaptic permanence decrement
P+
p
Proximal synaptic permanence increment
P 
p
Proximal synaptic permanence decrement
Pth Synaptic permanence connection threshold
S Distal synaptic connections array
Sp Proximal synaptic connections array
Sth Distal segment activation threshold




Motivated by the ubiquitous internet connectivity and the proliferation of smart
devices, Artificial Intelligence-on-edge (AI-on-edge) has become a new paradigm
that is driving a new generation of computer architectures and systems. Unlike
Artificial Intelligence-on-cloud (AI-on-cloud), which is capable of executing complex
deep learning tasks on large server racks away from the data source, AI-on-edge
processes data locally on device. Hence it offers: (i) Real-time AI with extremely
low latency; (ii) Optional network connectivity, supports mission critical scenarios
where AI can be performed without network access; (iii) Data privacy. Users
need not upload private data to the cloud in exchange for in-cloud AI services
[9]. Examples such as traffic monitoring using mobile devices, smart healthcare
monitoring using wearable sensors, and precision agriculture using drones, also
present a case for rich processing capability on the edge [10, 11, 12].
Enabling AI-on-edge is known to be challenging, especially when using neural
networks. Key limitations to deploying neural networks on edge devices are compute
resources, memory bandwidth, and power budget. For instance, porting a deep
neural network, ResNet-152, requires 22.6 GOPS to process a 224⇥ 224 image and
220MB of memory to store the network parameters (⇡ 60M parameters) [13, 14].
This is added to the high latency and power consumption which result from the
CHAPTER 1. INTRODUCTION 2
fact that most of the computing systems in edge devices are built based on the
classical von Neumann architectures, where the compute and memory units are
physically separated [15]. For such architecture, a first-order approximation of
the power consumption when using ResNet-152 is estimated to be 38.4W, only
for memory access (dynamic random access memory (DRAM) running at 1kHz),
which is far beyond the power budget of edge devices. However, the above issues
can be addressed by developing full custom neuromorphic systems with co-located
memory and compute units.
Neuromorphic systems, which are neurobiological inspired, have emerged as a
promising candidate for the next generation of computing systems as they offer
state-of-the-art performance in complex real-world tasks, such as pattern recognition,
sequence prediction, and novelty and anomaly detection [16, 13]. Besides the high
performance on complex tasks, neuromorphic systems offer reduced computational
time within an optimal power budget. However, several of the current generation
production scale neuromorphic systems solely rely on pure complementary metal-
oxide-semi-conductor (CMOS) technology. In the future, these systems have to
confront CMOS scaling challenges, beyond 5nm, which translate to increased
parasitic resistance and capacitance effect, stronger layout effect, edge placement
error [17], and electrostatic dominance [18] [19], as elucidated by several research
teams. This calls for embracing new materials and devices that exhibit physical
behavior analogous to that observed in the biological brain.
In 2008, a successful physical implementation of a synapse-like device called mem-
ristor was proposed by Strukov et al. [20]. The memristor is a non-volatile memory
element [21], that consumes low energy [22, 23], has a small footprint compared to
transistors, and can be integrated in high-density crossbar structures [4]. A key
advantage of the crossbar structure is that it enables performing computationally
CHAPTER 1. INTRODUCTION 3
intensive operations (multiply-accumulate) in neuromorphic systems concurrently
on-memory while consuming a small amount of power compared to conventional
implementations [24, 25]. Furthermore, it enables on-chip storage and eliminates
the need for off-chip storage, which itself consumes a prohibitive amount of energy.
For instance, off-chip access of DRAM implemented in 45nm process takes 640pJ,
which is two orders of magnitude more than 32b multiplication operation imple-
mented in a similar process [13]. However, the advent of the memristor device, on
one hand, has been a watershed in neuromorphic systems as it enables deploying
them on energy constrained platforms. On the other hand, it raises challenges
spanning from the system level down to the circuit level. At the system level, one
may trade-off the network computational speed, consumed power, and network
throughput. Designing an optimal system requires comprehensive knowledge about
the neural network algorithm and its targeted platform. On the circuit level,
developing computational neuronal units with dynamic synaptic connections and
enabling in-situ training create other challenges.
This dissertation addresses these challenges by studying a biologically inspired algo-
rithm known as hierarchical temporal memory (HTM), with the aim of deploying
this algorithm on edge devices and energy-constrained platforms (see Figure 1.1).
Primarily, this work sheds light on two main aspects. First, it studies the challenges
of in-situ training in memristor-based crossbar networks for HTM1. Recent litera-
ture shows in-situ training models, but they lack co-design between the learning
algorithm and the architecture. This leads to suboptimal implementations and
miss key features such as network’s plasticity. For example, in the case of HTM,
the genesis and pruning of potential synapses throughout the network’s lifetime is
an integral part of learning. This work addresses this issue by proposing synthetic
1The in-situ training in memristor-based crossbar is also studied within the context of multi-layer
random projection networks which can be used to recognize HTM neuronal activities.
CHAPTER 1. INTRODUCTION 4
synapses representation (SSA) to virtually formulate the physical synaptic pathways
in complex networks for dynamic reconfiguration.
Second, this work investigates new architectures for deploying hierarchical networks
like HTM on the edge devices. Specifically, a low latency and energy-efficient
CMOS/Memristor hybrid architecture, namely Pyragrid, is proposed for the entire
HTM network. The proposed design is evaluated and validated against different
benchmark tasks ranging from image classification to time-series data prediction.
It is important to mention here that in certain scenarios deploying a custom ASIC
(application specific integrated circuit) might be a formidable challenge. Therefore,
a reconfigurable system-on-chip (SoC) style architecture is also proposed to enable
HTM deployment on off-the-shelf edge devices.
Throughout this work, the following central questions will be explored: "How to
design SoC with plasticity as a key feature throughout the design abstraction? What













Figure 1.1: Core design flow in this dissertation: starts with abstracting the
salient structural and algorithmic properties of the biological neocortex using HTM,
progresses to mapping the HTM to neuromorphic chip, and ends by deploying the
chip on resource constraint platforms.
CHAPTER 1. INTRODUCTION 5
1.1 Motivation
Inspired by the desire to develop a unified computational platform where spatio-
temporal information is processed continuously in real-time within an optimal
power budget, this work aims at developing a neuromorphic computing system
with a core algorithm modeled by HTM for edge devices. Biologically inspired
systems, such as HTM [26, 27], have demonstrated strong capability in processing
spatial and temporal information with a high degree of plasticity while learning
models of the world. HTM also exhibits natural compatibility for continuous online
learning [28], noise and fault tolerance [29], and low power consumption achieved
through sparse neuronal activity [30, 31]. These properties make the algorithm
attractive for a wide range of applications such as visual object recognition and
classification [32, 33], prediction of data streams [34], natural language processing
and anomaly detection [35].
Despite the fact that HTM is an attractive algorithm, deploying the algorithm
on edge devices tends to be a challenging process for several reasons. The first
is associated with the learning in HTM. HTM is inspired by the information
processing ability of biological brains which are naturally endowed with a high
degree of plasticity and learn continuously through their lifetime. The plasticity
mechanisms in biological brains are responsible for network evolution and continuous
learning. Example mechanisms include neuroplasticity (structural plasticity and
homeostatic intrinsic plasticity) and synaptic plasticity. Structural plasticity relates
to physical changes in dendrites (synaptogenesis) or replacing the non-functional
neurons with new ones (neurogenesis). Homeostatic intrinsic plasticity regulates
neuron firing activities [36]. Usually, the neuron activities lead to an alteration
in the strength of the synaptic connection between neurons (synaptic efficacy),
CHAPTER 1. INTRODUCTION 6
which is known as synaptic plasticity [37]. Unfortunately, adopting this level of
plasticity and learning in neuromorphic systems is a daunting task due to the
sheer number of dynamic interconnects that are continuously updated. Therefore,
most of today’s available neuromorphic systems only take advantage of synaptic
plasticity which is implemented off-chip on a host computer or cloud (referred to
as off-device or off-chip training) [38, 39]. This results in systems that have limited
learning capability and limited applicability to real-world scenarios, where new
information has to be learnt on the fly.
Few research groups address the challenges of incorporating synaptogenesis using
virtual synapses realized through address-event representation (AER) [5], developed
by Mahowald in 1992 [5]. AER takes advantage of sparse neuronal events to bridge
the connections between neurons through a shared data bus. It turns out that
AER is shown to work effectively for one-to-one inter-chip communication, but it is
not deemed effective for intra-chip communication, especially with sparse dynamic
connections, where sparse convergent and divergent connections are required [40].
The complex network connectivity is solved through the enhanced AER proposed by
Goldberg et al.[41]. The enhanced AER uses look-up tables (LUTs) to describe the
synaptic connectivity between two sets of neuronal arrays. However, this approach
constrains the speed of operation of the neuronal circuit and may require significant
memory (e.g. enhanced AER demands ⇡72MB of memory for an HTM network
with 4k neurons) to store neuronal connections.
When it comes to synaptic plasticity, a few research groups suggest using hybrid
learning, where the majority of synaptic tuning is offloaded to the cloud and any
minor tweaking is performed in-situ [42]. This approach might improve the learning
capability based on the past inputs but will have a learning-lag as it will not
account for the new incoming data. Additionally, there are several applications
CHAPTER 1. INTRODUCTION 7
where hybrid or off-chip learning will not be feasible either due to the lack of access
to the cloud, location sensitivity, or high latency [9]. An example of that is a
drone sent on a rescue mission to remote places, a situation which may require
continuous access to the cloud to take critical decisions and to learn continually
with the changing environment. Other examples include implanted medical devices
such as pacemakers which do not tolerate wireless signal transmission as it may
disrupt the heart operation, or autonomous vehicles, which cannot afford even a
millisecond of latency in decision making to operate safely.
The reasons stated above motivate us to study in-situ training to support not only
the synaptic strength modulation, but also the formation of synaptic pathways in
real-time. Key points include
• It is practical and reliable as it offers inbuilt plasticity.
• Roundtrip communication to the cloud introduces unacceptable latency, espe-
cially for real-time applications. In-situ training improves network response
time and performance.
• In-situ training ensures better privacy and security as there is no need to
transfer any sensitive data to the cloud for processing.
• In-situ training is robust as it accounts for the inherent device variability
during the learning process. It also addresses the issues associated with the
memristor’s undesired conductance change with time.
The second major challenge with deploying HTM algorithm on edge devices is
associated with the architecture of the computational platforms. Running HTM
algorithm on conventional computing systems incurs several challenges. Firstly,
HTM demands high computational power that can not be fulfilled by conventional
von Neumann architectures. This is because the HTM innate architecture, which
is composed of thousands of neuronal circuits, requires high level of parallelism in
CHAPTER 1. INTRODUCTION 8
information processing. Running HTM on platforms with conventional architecture
can lead to severe throughput drop and high power consumption. For instance, run-
ning the HTM algorithm (only the spatial pooler) along with the sparse distributed
representation (SDR) classifier on a macOS machine with an Intel i7 processor
clocked at 2.6GHz to classify hand-written digits of the modified national institute
of standards and technology (MNIST) dataset [43] leads to a 325⇥ drop in network
throughput as compared to an ASIC with co-localized memory and processing
units, and consumes 1.23W of DRAM power and 26.34W of CPU power2.
Secondly, the HTM algorithm is memory intensive due to the high number of
synaptic connections which are continuously adjusted during the learning process.
For instance, an HTM network with 4k neurons can have ⇡ 1083k synaptic
connections. This requires ⇡ 4.12MB of memory just to store the weights. When
processing an input data stream, ⇡ 47k connections are used for computations in
each time step. This costs 30.14µJ of energy, if a 45nm DRAM is used to store all
the weights.
Thirdly, the limited memory bandwidth availability in von Neumann architecture
introduces undesired latency and degrades network performance. For instance, in
the HTM example, limited access to the memory leads to ⇡ 1.449msec time delay
each time the network processes an input (computed based on the DRAM read and
write latency of 40ns and 23ns, respectively [44]). The latency further increases as
the network size scales up.
Finally, the HTM algorithm is known to have unbalanced workload at the neuronal
level and arbitrary memory access due to the sparse neuronal activities. Thereby,
mapping the algorithm to von Neumann-based computational platforms that provide
the necessary parallelism, such as GPUs, seems to fail in providing satisfactory
2These measurement is taken using Intel power gadget software while running HTM-Spatial
pooler on a MacBook Pro machine.
CHAPTER 1. INTRODUCTION 9
performance [7]. Besides the unsatisfactory performance, GPUs usually demand a
large power budget as well.
All the above reasons motivate us to develop a specialized hardware to run the
HTM algorithm efficiently and affordably on edge devices. We introduce, Pyragrid,
a custom hybrid CMOS/Memristor mixed-signal neuromorphic system with co-
located memory and processing units. The proposed design supports wide range of































Figure 1.2: High-level architecture of the proposed neuromorphic system with three
core units: data encoders, HTM algorithm, and classifiers. The encoder transfers
the input data into a binary distributed representation. The HTM algorithm learns
spatial information and temporal sequences, while the SDR classifier transfers the
HTM output into a meaningful data format.
1.2 Contributions and Thesis Outline
The primary objective of this work is to develop a low latency and energy efficient
neuromorphic computing system capable of processing spatio-temporal information
for edge devices and energy constrained platforms (see Figure 1.2). The proposed
CHAPTER 1. INTRODUCTION 10
system is based on the HTM algorithm, thereby it inherently offers critical properties
for sequence learning such as high resiliency and noise tolerance. The proposed
system is validated on a range of applications such as image recognition and
time-series data prediction. The key contributions of this work are:
• A training system for memristor-based neuromorphic systems, offering in-situ
learning with minimal resource and power requirements.
• A novel design of the HTM primary computational units (mini-column and
cell); unique features include high computational speed, fast training, power
efficiency, and support for various plasticity mechanisms.
• A neuromorphic architecture of the HTM spatial pooler, temporal memory,
and sparse distributed representation classifier.
• A holistic SoC architecture, equipped with a novel communication scheme,
that can enable efficient and affordable spatio-temporal information processing
on edge device.
The rest of this dissertation is organized as follows: Chapter 2 presents an overview of
the existing state-of-art neuromorphic systems developed in academia and industry.
It also surveys challenges associated with porting various neural network algorithms
(e.g. multi-layer neural networks and hierarchical temporal memory) on edge devices
while enabling on-device learning. Chapter 3 discusses enabling in-situ training in
multi-layer neural networks and the challenges associated with that. The proposed
training system, including the memristor writing scheme, is also introduced in this
chapter (Chapter 3 content is based on Zyarah et. al [45, 46, 47]). In Chapter 4,
an overview of the HTM network structure and its algorithmic properties are
provided, including the mathematical modeling (Chapter 4 content is partially
based on Zyarah et. al [31]). Chapter 5 provides detail about the various design
CHAPTER 1. INTRODUCTION 11
considerations required to develop a hybrid neuromorphic system for the HTM
algorithm (Chapter 5 content is based on Soures and Zyarah et. al [48] and Zyarah
et. al [31]). Chapter 6 describes Pyragrid, the proposed neuromorphic architecture
of the HTM system. A novel design of the HTM primary building blocks (mini-
columns and cells), and synthetic synapses representation, a novel communication
scheme, to enable HTM neuronal interaction are introduced (Chapter 6 content
is based on Zyarah et. al [49]). Chapter 7 demonstrates benchmarking the
proposed HTM neuromorphic system on image classification and time series data
prediction. The proposed system is also evaluated for different metrics such as
latency, lifespan, power efficiency, and robustness (Chapter 7 content is based on




Background and Literature Review
Edge devices and energy constrained platforms such as smartphones, implantable
medical devices, and unmanned aerial vehicles (UAV) are becoming pervasive.
It is estimated that the number of edge device shipments will reach 2.6 billion
annually by 2025 [51], while AI tasks on edge devices will grow from 6% in 2017 to
43% in 2023 [52]. Deploying neuromorphic systems that offer in-situ training and
online spatio-temporal information processing on edge devices has the potential to
improve their computational capabilities with orders of magnitude lower energy
usage than conventional von Neumann architectures. In this chapter, an overview
of the existing neuromorphic systems and their limitations will be presented with
emphasis on two network topologies: multi-layer neural network and hierarchical
temporal memory.
2.1 Overview of Neuromorphic Systems
Over the course of the last decade, there has been a profound shift in the re-
search, where biologically-inspired computing systems are being actively studied
to address the demand for energy efficient intelligent devices. State-of-the-art
digital neuromorphic systems have shown remarkable success in disparate tasks
CHAPTER 2. BACKGROUND AND LITERATURE REVIEW 13
such as pattern recognition and natural language processing. For example, IBM has
developed TrueNorth [38] (shown Figure 2.1-(a)), a spiking neuromorphic processor
equipped with 4096 cores, one million neurons and 250 million synapses. Currently,
this processor can perform inference on-chip whereas the training is performed
off-chip. Performing off-chip training limits TrueNorth’ learning capability and
applicability to rapid decision making tasks. Another spiking processor, Loihi
from Intel [53], comprises of ⇡130k neurons and 130 million synapses. Loihi offers
in-situ training, with reduced network capacity on-device, and is attractive for edge
devices. Mixed-signal neuromorphic systems are also witnessing similar success
as digital neuromorphic counterparts. Typically, these systems closely emulate
the analog processing in the brain and leverage its inherent statistical behavior
to develop energy efficient systems. A pertinent example is the Neurogrid board
(see Figure 2.1-(c)) developed by Stanford University [39]. The Neurogrid uses
analog components to emulate the neuronal units which are communicating through
AER, a digital communication scheme. Presently, it comprises of 1 million neurons
and 6 billion synapses, trained off-chip. Another mixed-signal neuromorphic chip
is HICANN, developed by the Heidelberg University BrainScaleS project [54]. The
HICANN chip has 512 membrane circuits and 128k synapses trained on-chip, and
it offers lower latency than Neurogrid, but is less area and energy efficient. Other
example systems include SpiNNaker [55], DeepSouth [56], and SyNAPSE [57].
2.1.1 CMOS Neuromorphic Systems
Several of the above mentioned neuromorphic systems are based on CMOS tech-
nology. While CMOS technology has consistently demonstrated reduced transistor
size and power dissipated/transistor [58], there are a few challenges associated
with pure CMOS technology neuromorphic systems for beyond 5nm, such as worse









Figure 2.1: Example of state-of-the-art digital and mixed-signal neuromorphic
chips developed by industry and academia [3].
parasitic resistance and capacitance, stronger layout effect, edge placement er-
rors [17], and electrostatic dominance [18]. Therefore, researchers are exploring
hybrid technologies, new logic and memory devices, that can support a new class
of computing architectures. In that vein, hybrid CMOS/Memristor technology
has been shown to address the bottlenecks associated with memory density, low
switching energy, multi-level state representation, and low resource synaptic and
neuronal devices [59, 60, 15].
There are a few first generation memristor-based neuromorphic systems that have
been developed over the past few years, such as the memristor-based ROLLS
neuromorphic chip [61], mrDANNA [62], and the distributed memristor-based
accelerator in [63]. These systems have been demonstrated solely on spatial tasks
in spiking and conventional networks. In some instances researchers have combined
CMOS/Memristor architectures with other device models as well, such as the
photonic computing devices [64]. All these approaches are orthogonal to this
research and offer a foundation to building a large scale neuromorphic system that
can perform spatio-temporal tasks.
CHAPTER 2. BACKGROUND AND LITERATURE REVIEW 15
2.1.2 Hybrid Neuromorphic Systems
Hybrid neuromorphic systems, where the computational primitives are based on
CMOS/Memristor technologies, represent a radical shift in computing paradigm
away from von Neumann architecture. The hybrid neuromorphic systems, as alluded
to earlier, are known to have a co-locating memory and processing units. Co-locating
the memory and processing units do not only address the challenges associated with
the latency and energy of data movement, but it affords in-memory computing,
massive parallelism, and high computational efficiency. These features grant hybrid
neuromorphic systems a capability to demonstrate efficiency when deployed on
edge devices and energy-constrained platforms to perform real-world tasks such as
image classification, natural language processing, and anomaly detection [16, 65].
Typically, hybrid neuromorphic systems have a massive number of parameters
stored in non-volatile memory such as memristor devices. A memristor is a two-
terminal synapse-like nanoscale resistive memory. Its term was coined by Leon
Chua in 1971 [66] and the device received rekindled interest when it was fabricated
by HP labs in 2008 [20]. The device has a small footprint compared to transistors,
consumes low switching energy [20, 67], and can be integrated into high-density
crossbar structures [16]. The crossbar structure enables performing vital operations
in neuromemristive systems (multiply-accumulate) with significant enhancement in
computational speed and energy efficiency.
The prototypical structure of the memristor device1, particularly RRAM, is com-
posed of two Platinum (Pt) nano-wires sandwiching two pieces of doped and
undoped insulating material such as Titanium, Tantalum, or Hafnium. The un-
1Memristor as a term refers to a range of memory devices such as resistance random access
memory (RRAM), phase change memory (PCM), and spin transfer torque magnetoresistive
random access memory (STT-MRAM). In this work, our use is limited to RRAM devices as it
offers high resistance range, high endurance, fast switching, and most importantly, it uses the
same common material in semiconductor manufacturing [68].
CHAPTER 2. BACKGROUND AND LITERATURE REVIEW 16
doped region offers high resistivity compared to the doped one, which is doped
with oxygen vacancies. Briefly, a RRAM device can work as follows: upon apply-
ing a voltage across the memristor or allowing a current to flow through it, the
oxygen vacancies start migrating either towards the undoped region leading to
a decrease in device resistivity, or in the opposite direction and in this case the
device resistivity gets increased. Figure 2.2-(a) illustrates the memristor device
symbol and its structure, where D denotes the device thickness, and w is the
thickness of the doped region. In spite of the fact that D is a fixed value, w can be
altered by moving the oxygen vacancies. Figure 2.2-(b) illustrates the variation in
memristor resistance when the vacancies move in both directions. During the 5
positive cycles, the oxygen vacancies move towards the undoped region, resulting
in a decrement in device resistance and increment in the current flowing through it,
whereas in the negative 5 cycles, the opposite occurs [4]. The relationship between
the current flowing through the memristor, i(t), and the dopant movement as
defined by the boundary (boundary is the line separating the doped and undoped
regions) position is given by Equation (2.1), whereas the memristor conductivity
(Gmem) as a function of the boundary position can be defined by Equation (2.2)2.
Here, µ is the dopant mobility, GON and GOFF denote the conductivity of the



















It is important to mention here that the memristor devices, on one hand, offer nu-
2Equation (2.2) may not generalize to all memristor devices. Some memristor devices exhibit
unique characteristics and require different equations for modeling as we will see in Chapter 5.
CHAPTER 2. BACKGROUND AND LITERATURE REVIEW 17
(a) (b)
Ϯ
Figure 2.2: (a) The structure of a memristor device along with its symbol. The
device is composed of Hfo2 and TiN sandwiched by nano-wires of platinum. (b)
This Figure demonstrates the continual increment/decrement of the current flowing
in the memristor as its resistance decreases/increases when a positive/negative
voltage applied across it [4].
merous attractive features. On the other hand, memristors have limited endurance
and device yield, and are highly prone to defects due to process variations [42, 69].
There are various types of defects that can occur in a memristor device such as
ageing fault, endurance degradation fault, switching delay fault, and stuck-at fault.
The main cause of ageing faults is the excessive memristor device switching (in
range of millions) between the low resistance state (LRS) and high resistance state
(HRS). This may result in damaging the memristor electrode layer and eventually
leads to the memristor being stuck in LRS [70]. The endurance degradation fault
refers to a memristor’s failure to switch to the maximum or minimum possible
resistance level. This fault is primarily attributed to recrystallization of the oxide
material or forming short conductive filaments [71]. The switching delay fault
here describes the variation in switching speed of the memristor devices within
the crossbar and this may occur due to doping concentration disparity during the
fabrication process. Lastly, the stuck-at-faults can occur either due to excessive
doping or lack of doping during the fabrication process. Stuck-at-faults can be
categorized into three types: stuck-at, stuck-on, and stuck-off. The stuck-at repli-
CHAPTER 2. BACKGROUND AND LITERATURE REVIEW 18
cates memristor device failure right after the forming process. The resistance of
such devices follows a random Gaussian distribution with range defined by the
physical device resistance. For stuck-on and stuck-off, the memristor remains stuck
at maximum and minimum conductance, respectively. In this work, we will mostly
place emphasis on stuck-at-faults as they are more common [72] and have higher
impact on network performance.
2.2 Network Topologies
2.2.1 Multi-layer Neural Network: Architecture and Learn-
ing
The general structure of a feed-forward multi-layer neural network is defined by
several consecutive layers of neuron stacks, namely input, hidden, and output
layers. Each of these layers, except the input, performs a series of transformations
that change the similarities between input cases [73]. Typically, each layer in the
network is meshed with the successive one with weighted synaptic connections.
During network training, the weighted synaptic connections are adjusted according
to a predefined learning rule to enable network convergence to a function of interest.
Despite the fact that porting multi-layer neural networks to edge devices has
the potential to facilitate various tasks such as feature extraction and function
approximation, training these networks is still a daunting problem. Typically,
neural networks have a massive number of parameters that need to be fine-tuned
during the training process. Transferring all these parameters to the cloud to do
ex-situ training is not practical for a myriad of end-user devices especially those
with limited resources as it demands high communication bandwidth and a large
CHAPTER 2. BACKGROUND AND LITERATURE REVIEW 19
power budget. Thus, performing in-situ training is favourable. There are various
in-situ training systems that have been developed and studied for different learning
algorithms and neural networks. The majority of these in-situ training systems
are designed to enable network training in online fashion with a core learning
model based on gradient descent optimization algorithm applied together with
backpropagation.
Examples of networks in which in-situ training systems are solely based on stochastic
gradient descent (SGD) include a memristor-based single-layer perceptron to classify
synthetic patterns of the letters "X" and "T", which was proposed by Alibar et
al in 2013 [16]. The proposed design is trained using ex-situ and in-situ methods
and perceptron training rule but no adequate details about the training system
are provided. In 2014 Chabi et al. proposed in-situ supervised training circuit for
a high-density neural crossbar [74]. The circuit realizes a boolean version of the
delta rule using binary memristor logic gates. Then, a brain-state-in-a-box model
trained in-situ using SGD was presented by Hu et al. in 2014 [75]. The authors
here simplify the training procedure by considering only the sign of the input
and the sign of the error for training. The proposed design is demonstrated on
character image dataset in the presence of noise, the impact of which is alleviated by
performing on-device training. However, not enough details are provided regarding
the sign detection circuit and the writing circuit.
Examples of networks trained with gradient descent and backpropagation includes
the memristor-based multi-layer neural network presented by Soudry et al. 2015 [76].
The authors used two transistors and one memristor (2T1M) to emulate the synaptic
weights and adjust their values, and this makes the total number of transistors in
the crossbar scale linearly with the number of memristors. Also, the authors did
not provide enough details about the circuit used to evaluate network performance,
CHAPTER 2. BACKGROUND AND LITERATURE REVIEW 20
which is expected to require ADCs, digital-to-analog converter (DAC)s, multipliers,
and buffers. Hasan et al. proposed in-situ training system for a memristor-based
multi-layer neural network in 2016 [77]. The proposed training system implements
backpropagation in its two versions: exact and stochastic. The authors find that
stochastic backpropagation takes a longer time to converge but is more power and
area efficient as it does not require any ADCs/DACs, multipliers, and look-up
tables. However, their training system still heavily relies on Op-Amps, leading to
overhead on network power and area budget. In 2018, Li et al. proposed a multi-
layer memristive neural network with hybrid training realized in S/W and H/W
models. In this work, the authors exploited the pass transistors associated with
each memristor in the crossbar to improve the linearity in memristor conductance
change and enhance the learning process [78]. However, performing hybrid training
through a host computer makes the proposed training system less attractive for
edge devices and energy constrained platforms. Giacomin et al. then proposed a
dot product engine based on memristors and a single-ended XOR sense circuit in
2018 [79]. The proposed design is integrated into ISAAC chip to accelerate the
training process and reduce the power consumption, whereas the memristor training
is performed using a circuit similar to Ziksa [45]. Later in 2019, Greenberg-Toledo
et al. explored supporting stochastic gradient descent with momentum to speed up
network convergence [80]. The authors propose two design approaches to support
the momentum. The first relies on external memory which is not recommended
as it demands accurate read/write of each memristor in the network crossbar to
update it according to the momentum value. The second approach incurs adding
an additional memristor to each synapse to store the accumulated history of the
previous updates that occurred during training. The second approach increases
the synapse complexity (3T2M) but is way more efficient in terms of power and
speed as compared to the first approach. Other examples of training systems for
CHAPTER 2. BACKGROUND AND LITERATURE REVIEW 21
multi-layer neural networks are reported in [81, 82, 83, 75].
2.2.2 Hierarchical Temporal Memory: Architecture and
Learning
Hierarchical temporal memory (HTM) is a machine learning algorithm theoretically
formulated by Jeff Hawkins in 2005 [26]. HTM attempts to capture the structure
and the key functional properties of the human neocortex and thereby its ability to
learn challenging cognitive tasks. HTM is structured from ascending hierarchical
regions of cellular layers that enable the network to demonstrate strong capability
in processing spatial and temporal information with a high degree of plasticity
while learning models of the world. Thus, it is used in a myriad of applications
including visual object recognition and classification [32, 33], prediction of data
streams [34], natural language processing and anomaly detection [35]. Besides the
fact that HTM is an attractive algorithm, it demands high computational power
and memory bandwidth that cannot be fulfilled by conventional von Neumann
architectures such as CPUs and GPUs. This motivates several research groups to
develop specialized custom hardware designs to run the HTM algorithm efficiently
and affordably, targeting edge devices.
Deploying the HTM algorithm on edge and embedded devices can enable pattern
recognition, real-time prediction, and anomaly detection tasks. There are a few
research groups that study the digital and mixed-signal architectures for HTM.
However, HTM has been continually evolving and most of the published archi-
tectures focus on the earlier deprecated versions of the algorithm. The first wave
of architectures were published circa 2007, and focused on the first generation of
the algorithm (Zeta). In 2007, Kenneth et al. realized HTM-Zeta on FPGA for
CHAPTER 2. BACKGROUND AND LITERATURE REVIEW 22
image recognition [32]. The model has 81 parallel computational nodes arranged
hierarchically in 3 layers and offers 148x speedup over the software counterpart. A
Verilog implementation of the single fundamental unit in HTM-Zeta, a node, was
proposed by Pavan et al. in 2013 [84]. The proposed node architecture can be used
to realize the HTM-Zeta spatial pooler, but it is not feasible to apply for the se-
quence memory HTM. The second generation of the architectures were investigated
circa 2015. Zyarah et al. [85] designed a scalable design with 100 mini-columns
which was demonstrated for classification with SVM. The authors also proposed a
temporal memory design for prediction [86, 50]. In 2016, a nonvolatile memory
based HTM spatial pooler implementation was presented by Streat et al. [87],
considering the physical constraints of the commodity flash memory. Meanwhile,
Weifu Li et al. proposed a full architecture of the HTM algorithm including both
spatial and temporal aspects [88]. The proposed design has 400 mini-columns (2
cells in each mini-column) connected in point-to-point format to the HTM input
space, which eventually causes the column to be in active mode even when there is
insignificant activity (noise) in the input space. Furthermore, this implementation
does not account for potential synapses in the design. Later, a memristor-based
implementation of HTM spatial pooler is proposed by James et al. [89]. Although
the proposed design is power efficient, it lacks reconfigurability which is important
for learning and making predictions. The same group also presented the HTM full
architecture in 2018 [90]. However, the temporal aspect of the implementation
does not match that described in HTM sequence memory. It depends on the class
map concept which matches stored patterns with the test ones. Recently, Truong
et al. presented a memristor-based crossbar to model the spatial pooler of the
HTM algorithm [91]. However, due to the fact that the HTM is dominated by
dynamic sparse connections, using the traditional crossbar structure leads to dark
spots (unused regions) in the crossbar.
CHAPTER 2. BACKGROUND AND LITERATURE REVIEW 23
2.3 Summary
In this chapter, an overview of the state-of-the-art neuromorphic systems devel-
oped by industry and academia is presented. The distinct features of each chip,
including the possible application domains such as pattern recognition and natural
language processing, are highlighted. To go along with the market demand and
due to the technology limitations, we show that there is a shift toward develop-
ing CMOS/Memristor hybrid systems as an alternative to existing pure CMOS
neuromorphic systems. The CMOS/Memristor hybrid systems offer energy and
resource-efficient intelligent platforms. When it comes to network topologies, the
high-level architecture and the previous implementations of the multi-layer net-
work and the hierarchical temporal memory algorithms along with their training
techniques, are discussed. One of key aspects that may hinder porting multi-layer
networks on edge devices is the existence of a simple and effective on-device training
system. In case of the HTM, porting the algorithm on edge devices demands a
specialized custom design. To the best of our knowledge, there is no full custom
mixed-signal design of the HTM algorithm in literature with underlying digital
communication scheme and analog computational modules. Such a design ag-
gregates the necessary reconfigurability, low energy-delay product, and robust




In-Situ Training for Multi-layer Feedforward
Networks
The exponential growth of edge devices coupled with the limited bandwidth of
the mobile networks sheds light on the necessity for enabling in-situ training [92].
Enabling in-situ training enhances data privacy, increases system reliability, and
reduces decision latency. This chapter presents an in-situ training system for
memristor-based neuromorphic systems. The proposed training system enables
on-device learning with minimal resource and power requirements, owing to its
simplified training circuitry and writing scheme. The writing scheme, named Ziksa,
supports tuning current-threshold and voltage-threshold memristors. The proposed
training system is used to train a class of multi-layer neural networks known as
extreme learning machine (ELM) with core learning modeled by stochastic gradient
descent. Using stochastic gradient descent enables online learning and offers
additional degrees of freedom to address the impact of learning rule simplification
on network performance.
This chapter also investigates the ELM network’s capability to restore its per-
formance via plasticity mechanisms such as increasing the population of hidden
neurons. System level simulation is conducted to evaluate the network performance
CHAPTER 3. IN-SITU TRAINING FOR FEEDFORWARD NETWORK 25
for different classification tasks. The impact of memristor device variability and
device failures (stuck-at faults) on network performance are also studied.
3.1 Multi-layer Neural Networks
3.1.1 Network Topology: Extreme Learning Machine
The general architecture of a multi-layer feedforward neural network is shown
in Figure 3.1. The network is mainly composed of neuron stacks grouped in
consecutive layers, namely input, hidden, and output. The input layer is dedicated
to buffering the input prior to presenting it to the network. In the hidden layer,
the input data is projected to a high-dimensional space so that the most important
and relevant features are stochastically extracted [93, 94]. The output of the
hidden layer is usually classified by the subsequent output layer. Typically, the
consecutive layers in the network are communicating in one direction (in case of
feedforward networks) through weighted synaptic connections. The connection
strengths (or weights) are continuously updated during network training to enable
network convergence to a function of interest.
There are several learning rules that can be adopted to adjust the synaptic weights
in the multi-layer neural networks such as resilient backpropagation (Rprop) [95],
adaptive moment estimation (Adam) [96], and stochastic gradient descent (SGD).
In this work, SGD is chosen because of its simplicity and minimal use of compute
resources. SGD is used to train ELM algorithm, which is well-known for its fast
training as it requires training the output layer weights only, while the hidden
layer weights are randomly initialized and left unchanged during the training phase.
Typically, ELM is trained with the normal equation. Given an input data array








Figure 3.1: High level representation of a three-layer feedforward network: a buffer
layer (input), feature extraction layer (hidden), and classification layer (output).
X 2 Rnx⇥nn and its associated class label array1 Y 2 Rnk⇥nn{0,1} , the output layer
weights ( ) can be found using Equation (3.2), where H and ! are the hidden layer
activation array and its weights, respectively. nn and nx are the number of input
feature vectors and their length, f is the neuron non-linear activation function, and
nk denotes the number of class labels.
H = f(X!) (3.1)
  = (HTH) 1HTY (3.2)
It turns out that using the normal equation in finding the output layer’s optimal
weights is computationally fast as it does not go through any iterative training.
However, for a large dataset, it requires a massive memory. For instance, training
the ELM network2 on the MNIST dataset requires at least 26GB of memory. This
makes running the ELM on resource-constrained platforms impossible. For this
reason, it is recommended to adopt the iterative approaches of learning, such as
1The expected output labels are encoded in one-hot format where the location indexed by the
class label is assigned to ’1’ and ’0’ elsewhere.
2The ELM network size used in this experiment is 784⇥250⇥10.
CHAPTER 3. IN-SITU TRAINING FOR FEEDFORWARD NETWORK 27
SGD, which slows down the learning process, but is more convenient for the target
platforms (edge devices).
In this work, two topologies of ELM, trained with SGD, will be explored. The first
topology is the conventional ELM with output layer constituted by neurons with
non-linear activation functions. The second topology is an ELM network with a
softmax output layer, named SoftELM. Replacing the output layer in ELM with a
softmax is a risky step because it may result in a network crash when the outputs
of the hidden layer (or output layer weights) are dominated by negative terms i.e.
leads to divide by zero condition. To circumvent this issue, one may add a small
constant value (}3) to the denominator of the softmax equation to avoid a divide
by zero condition. Given an input vector, h, the output of the jth output neuron








3.1.2 Learning Rule: Stochastic Gradient Descent
Stochastic gradient descent (SGD) is a natural fit learning algorithm for ELM
network when trained iteratively. SGD is well-known for its effectiveness in training
various machine learning algorithms such as support vector machine (SVM) [97]
and spiking neural networks (SNNs) [98]. In the context of ELM, using SGD to
update the output layer weights can be described as follows: given a synaptic
weight,  ji, connecting the jth output neuron with the ith input neuron, for a given
input hi and an expected output yj, the weight update can be computed using
Equation (3.4), where   is the learning rate.
3} is a constant. It is set to 0.001 in this work.
CHAPTER 3. IN-SITU TRAINING FOR FEEDFORWARD NETWORK 28
  ji =   ⇥ (tj   yj)⇥ hi| {z }
=gradient
(3.4)
In spite of the simplicity of the above equation relative to Equation (3.2), computing
the amount of error for every iteration and multiplying it by the corresponding
input demands non-trivial hardware resources (multipliers and subtractors). One
way to simplify the learning rule in Equation (3.4) is by considering only the sign
of the error and the input while learning, as suggested by [75] and explored further
for ELM networks in our previous work [47]. Though such an approach simplifies
the learning rule significantly, it slows down the network convergence and degrades
its performance level (classification accuracy) by ⇡ 2-8%. Thus, to reduce the
impact of considering only the sign effect and to maintain minimum resource usage
for the training system, we considered the error value and the input sign to update
the weight as in Equation (3.5). This enhances ELM network performance, but it
significantly improves SoftELM performance as will be discussed in section 3.5.





1 hi > 0
 1 hi < 0
0 Otherwise
In this work, the learning mechanisms in both ELM and SoftELM will be explored.
Additionally, the impact of simplifying the learning rule i.e. SGD on network
performance will be studied. For the rest of the chapter, ELM‡ and SELM‡ will
refer to an ELM and SoftELM trained with simplified SGD which considers the
sign of the error and sign of the input rather than their actual values, while ELM†
CHAPTER 3. IN-SITU TRAINING FOR FEEDFORWARD NETWORK 29
and SELM† will refer to the same networks trained with semi-simplified SGD which
considers the error value and input sign for learning (see Table 3.1).
Table 3.1: ELM network topologies trained with SGD with various levels of
simplification.
Network topology Input Error Output
ELM‡ sign(input) sign(error) sigmoidal activation function
ELM† sign(input) error value sigmoidal activation function
SELM‡ sign(input) sign(error) softmax
SELM† sign(input) error value softmax
3.2 Design Methodology
3.2.1 Learning Rule
The SGD enables the multi-layer neural networks to be trained in an online fashion.
The training starts by evaluating the network response (t) to a particular input
vector (h) with respect to the expected class labels (y). According to Equation (3.4),
this is done by finding the net difference between t and y (error). In neuromemristive
designs, typically, tj and yj are analog values and their difference can be computed
by using at least two Op-Amps. Given that the Op-Amps are implemented in 65nm
technology process, each Op-Amp consumes more than 17.5µW. For a network
with 100 output neurons, this costs at least 3.5mW of power. In order to keep
the training system power efficient, it is recommended to evaluate the output
layer neurons individually (time-multiplexing). However, this requires a power
efficient sample-and-hold circuit to store the neuron outputs. We adopted the
sample-and-hold circuit proposed in [99], which consumes only 10nW. Now, after
storing neuron outputs in the sample-and-hold circuit, the error of the individual
CHAPTER 3. IN-SITU TRAINING FOR FEEDFORWARD NETWORK 30
neurons can be evaluated sequentially. Once an error term is computed, it gets
multiplied by the neuron inputs through an analog multiplier and the combined
effect is used to update the corresponding memristor resistances.
More power savings can be achieved by simplifying the learning rules in Equa-
tion (3.4), as discussed earlier. This involves considering only the sign of the error
and the input while learning, as suggested by [75]. To closely observe the impact
of considering the signs, we commence by considering the error value and sign
of the input (semi-simplified SGD). The error term is computed using Op-Amps
as discussed earlier. Then, the combined effect with the input sign, which can
be found either by using a comparator or logic inverter, is used to update the
memristor resistance. Further simplification is achieved by considering the sign of
the error and the sign of the input (simplified SGD). Here, the common approach
to evaluate the error term’s sign is by using a comparator to compare between tj
and yj . However, the comparator can only indicate if tj is larger or smaller than yj ,
but cannot indicate the case of equality between them. Therefore, we developed a
winner-take-all circuit (WTA) [31] (will be discussed in chapter 6), that covers all
cases of comparisons and consumes 3⇥ lower power than the baseline comparator
circuit. Once the error term sign is evaluated, it is multiplied by the input sign
using an XOR logic gate to estimate   ji. After estimating the   ji, the update
of the synaptic weights is carried out using a writing circuit named Ziksa. Table 3.2
summarizes the estimated resources, power consumption, and area of adopting the
aforementioned approaches in simplifying the SGD.
3.2.2 Writing Circuit: Ziksa
The proposed writing circuitry, Ziksa [45], has four main transistors (H-Bridge
configuration) connected to the memristor crossbar terminals as shown in Figure 3.2-
CHAPTER 3. IN-SITU TRAINING FOR FEEDFORWARD NETWORK 31
Table 3.2: Comparison of the SGD-based training system with different simplifica-
tion levels for a network size of 100⇥10.
Baseline Semi-Simplifed Simplified
SGD SGD SGD
Storage Sample-and-Hold Sample-and-Hold 2-bit SIPO
+ Mux + Mux shift register
Error Multiplier subtractor 2-cell
evaluation & subtractor WTA
Neuron Analog Analog 2-bit
error digital
Absolute value Needed Needed Not
circuit Needed
ADCs and DACs Needed Not Not
needed needed
Performance Recommended for Recommended for Not recommended
small and large small and large for small
datasets datasets datasets
Power 4565µW 80µW 13.41µW
consumption
Area 100007.81µm2 8371.08µm2 8367.56µm2
a. During the training session, these transistors apply proper voltages to the
memristors to change their conductance. Due to the fact that the memristors
share the crossbar grids, their conductance cannot be easily updated in a parallel
fashion. Thus, it is common to train memristor crossbars column-wise [77]. Training
memristor crossbars column-wise requires isolating the crossbar columns that are
not involved in the training so that their memristors are not accidentally tuned.
There are various approaches to block the columns that are not involved in the
training. One of these approaches, which we proposed in [45], involves adding
blocking transistors to the H-Bridge transistor pairs connected to the crossbar
columns (T5 in Figure 3.2-c). The blocking transistors connect the columns that
are not involved in training to Vb so that the potential across the memristors in the
CHAPTER 3. IN-SITU TRAINING FOR FEEDFORWARD NETWORK 32
untrained columns is always less than the device threshold. Here, |Vb| > |Vtr|  |Vth|,
where Vtr is the training voltage, and Vth refers to the memristor threshold voltage.
The above writing circuit is simple and area efficient, but not necessarily power-
efficient, as the columns that are not involved in the training still sink current. One
may trade-off between the power consumption and area of this circuitry using a
memristor crossbar with pass transistors, i.e. a 1T1M crossbar. Typically such a
crossbar is 10⇥ larger in area than the 0T1M crossbar [100], but it offers attractive
features. 1T1M crossbar pass transistors can be used to block the memristors that
are not involved in the training; thereby negligible power will be consumed by the
untrained columns, unlike the previous approach. Also, the pass transistors in the
crossbar can be exploited during the training to perform more precise tuning. When
Ziksa circuits are used to tune the memristor devices, the pass transistors can work
as control channels to limit the amount of current passing through the memristors
under training (shown in Figure 3.2-d). In this work, the pass transistors are
controlled by the absolute value of the SGD error term.
There are two versions of writing circuits developed in this work: Ziksa-VM [45, 46]
and Ziksa-CM [47]. Ziksa-VM operates on voltage-threshold memristor devices
integrated into a crossbar structure. Given the writing circuit shown in Figure 3.2-
(c) connected to a 2⇥2 crossbar, the circuit operates as follows: if the first column
is selected for training (its T5 is set to be OFF ), the second column is blocked (its
T5 is connected to Vb). Because two memristors within the same column cannot be
updated in two different directions simultaneously, two update steps are needed to
adjust the memristors in each column. During the first step, the memristors whose
resistance need to be decreased are adjusted. This is done by dictating the current
to flow from +Tr towards -Tr i.e. T1 and T4 are set to be ON, while T2 and T3
are OFF. During the second step, the opposite takes place. T2 and T3 are turned
ON, while T1 and T4 are OFF. This dictates the current flow from -Tr towards +Tr
































Figure 3.2: (a) 2⇥2 memristive crossbar driven by Ziksa unit. (b) and (c) Ziksa units
for adjusting current-threshold and voltage-threshold memristor devices integrated
into the crossbar structure. (d) Ziksa unit with pass transistors used during the
training to achieve precise memristor tuning.
leading to change the memristors whose resistance need to be increased.
The same concept applies to the second writing circuit, Ziksa-CM, which operates
on current-threshold memristor devices integrated in the crossbar structure. Here,
the H-Bridge transistors are connected to two current mirrors to limit the amount of
current flowing in the memristor which consequently ensures consistent adjustment
of memristor resistance. By allowing the current to flow in both directions, the
memristor resistance can increase or decrease. Usually the amount of current is
limited to ⇡ I  when the memristor resistance increases, and to ⇡ I+ when it
decreases. A unique feature of Ziksa-CM is that it allows tuning multiple memristors
connected in series as they draw the same tuning current. This is beneficial to
CHAPTER 3. IN-SITU TRAINING FOR FEEDFORWARD NETWORK 34
develop stochastic synapses and in stacking low resistance memristors to build
memristors with high conductance range.
3.3 System Design and Analysis
The system architecture of ELM4 is constituted by the input, hidden, and output
layers. As discussed earlier, the hidden and output layers have the same struc-
ture except that the output layer requires additional units to evaluate network
performance and to carry out the training. Therefore, in this section we will
place emphasis on the output layer, but the same applies for the hidden layer5.
The architecture that emulates the output layer in the ELM network is depicted
in Figure 3.3. Mainly, it is composed of three units: a memristive layer, error
computing, and training system. The memristive layer models a single layer of
neurons in the ELM network and it is dedicated to carrying out the input-weight
vector-matrix multiplication and applying non-linearity to the resulted weight-sum.
The error computing unit is used to evaluate network performance and compute
the final network error. The final network error is later used by the training system
to adjust the synaptic weights to converge the network to an optimal solution. In
the following subsections, each unit in the system architecture will be discussed in
more detail.
4The SoftELM has a similar system architecture to that of ELM, but the output neurons in the
SoftELM are designed to carry out a softmax operation. A winner-take-all proposed in our
previous work can perform such task [31].
5The pre-formed memristors in a crossbar have random conductance with Gaussian distribu-
tion [22]. This randomness can be leveraged to initialize the hidden and output layer weights.
CHAPTER 3. IN-SITU TRAINING FOR FEEDFORWARD NETWORK 35



































S&H: Sample and hold
RLC: Ro  local contr.













Figure 3.3: Architecture of a single layer network constituting of the memristive
layer, training system, and error computing unit. The memristive layer core units
are the neuron circuits and their associated synaptic connections. The training
system is composed of driving transistors (Tr+ and Tr ), column local controllers
(CLCs), and row local controllers (RLCs), while the error computing unit has a
subtractor, absolute value circuit, and sign detection circuits.
3.3.1 Memristive layer
The core functional units of the memristive layer are the neuron circuits and their
associated synaptic connections. The neurons are implemented using Op-Amps in
an inverting configuration to achieve a non-linear activation function6, whereas their
synaptic connection weights are mapped onto a crossbar structure. Leveraging the
crossbar structure here enables efficient vector-matrix multiplication of the input
features and their corresponding synaptic weights with significant enhancement in
6The neuron circuits in Figure 3.4 can be configured to obtain tanh or sigmoid activation functions.
CHAPTER 3. IN-SITU TRAINING FOR FEEDFORWARD NETWORK 36
computational speed and energy efficiency.
Figure 3.4 illustrates the memristive layer with nk output neurons. Each neuron is
stimulated by the preceding layer of neurons, which are connected to the crossbar
row through a set of weighted synaptic connections. Typically, every synaptic
connection weight is mapped to two memristors, where the net difference between
their conductance (G+ & G ) results in a bi-polar weight representation (as
in Equation (3.6)). Thus, it is common to use two crossbars to model the set of





Positive value, if G+ > G 
0, if G+ = G 
Negative value, if G+ < G 
(3.6)
It turns out that mapping the synaptic weights to two crossbar arrays overwhelms
the learning process, as two crossbars need to be trained rather than one. Moreover,
a consistent change in the memristors on the positive and negative crossbars must
be sustained to ensure network convergence. Therefore, this research suggests a
different approach (called semi-trained crossbar) to realize the synaptic weights [47].
The concept behind the semi-trained crossbar is that one fixed and one tunable
memristors are used to model each bi-polar weight. In the crossbar structure, this
can be modeled by nh ⇥ nk crossbar array of tunable memristors and a fixed set
of reference memristors mapped to the nh ⇥ 1 crossbar. Hence, the net difference
between any tunable memristor and the corresponding fixed memristor results
in bi-polar weight representation. Thus, when the input vector h is presented to
the network in the form of voltage, an output vector t is produced, in which each
element (tj) is given by Equation (3.7), assuming Vs=0.





















In Equation (3.7), f is the jth neuron activation function, which is a non-linear
activation function for ELM and a softmax function in case of SoftELM. Mji = ( 1Gji )
refers to the resistance of the memristor connecting the jth output neuron with the
ith input neuron, Mri = ( 1Gri ) is the reference memristor resistance of the i
th row,
and Rf denotes the inverting Op-Amp feedback resistor. Here the output of each
neuron is stored in a sample-and-hold circuit [99] so that it can be used later to














+   - +   -
+   -
Figure 3.4: A single layer network including neuron circuits and their memristive
synaptic connections modeled using the semi-trained crossbar.
3.3.2 Error Computing Unit
The error computing unit is used to evaluate the network response (t) to a particular
input vector (h) with respect to the expected class labels (y). The error computing












Figure 3.5: Absolute value circuit to compute the absolute of the error signal.
unit consists of a subtractor, sign detection circuit, and absolute value circuit.
The subtractor and the sign detection circuit can be implemented using an Op-
Amp and a comparator (or inverter), whereas the absolute value circuit, whose
output characteristic is shown in Figure 3.6, is implemented using two Op-Amps
(single positive supply) configured as shown in Figure 3.5 [101]. However, the error
computing unit is time-multiplexed by all output neurons to minimize resources.
During the training phase in which the memristor crossbar is trained sequentially,
one column at a time, the error corresponding to the column under training is
computed. For instance, if the jth column is selected for training, its neuron output
stored in the jth sample-and-hold circuit is relayed to the subtractor unit via the
analog multiplexer. Meanwhile, the testbench fetches the expected class label
to the subtractor so that the neuron error is computed. The computed error is
then relayed to the absolute value circuit [abs(tj   yj)] and sign detection circuit.
The absolute circuit converts the error value to a positive range to drive the pass
transistors. The sign detection circuit detects the sign of the error signal, which in
combination with the input feature’s sign determines whether a memristor should
be increased or decreased.
3.3.3 Training system
This unit is responsible for adjusting the memristive synaptic weights in the crossbar
structure with the ultimate goal of network convergence. Figure 3.7 illustrates the
CHAPTER 3. IN-SITU TRAINING FOR FEEDFORWARD NETWORK 39



























Figure 3.6: (Left) Input and output signals to the absolute value circuit. (Right)
Input and output characteristic of the absolute value circuit implemented in IBM
65nm process.
integration of the training system (excluding the main controller) with one column
of the memristive crossbar. The system consists of a writing circuit, Ziksa-VM
(Tr+ and Tr ), and local row and column controllers. Ziksa7 is used to apply the
proper programming voltage across the column memristors under training, while
the local controllers drive Ziksa according to the computed gradient. Training
the jth neuron weights is performed by applying the output of the absolute value
circuit, i.e. error magnitude to the pass transistors, while allowing the current to
flow in either direction by using Ziksa’s transistors. Determining the configuration
of Ziksa is done after the row and column controllers receive the following signals:
En, ColEn, S(Er), S(xi), and Polar. En is used to activate network training
phase, whereas ColEn selects the column under training. S(Er) is the sign of
the error signal and S(hi) is the input feature sign. The Polar signal indicates
whether the cycle of training is positive or negative. Typically, training each column
in the crossbar is performed in two cycles. During the positive cycle (Polar =
High), the memristors whose resistance need to be decreased are adjusted, while
the memristors whose resistance require increments are adjusted in the negative
7One may add another pass transistor between Tr+ output terminal and the memristors stacked
in a column to achieve symmetrical writing circuit.

















Figure 3.7: The training system as integrated in one column of the memristive
crossbar. The system has Ziksa driving transistors (Tr+ and Tr ) controlled by
the row local controller (left) and column local controller (right).
cycle (Polar = Low). Table 3.38 indicates driving signals of the local control units
and Ziksa to enable bi-directional weight adjustment.
Table 3.3: Driving signals of the local control units and Ziksa to enable bi-directional
weight adjustment.
⇠En Polar S(Er) S(h) Pr Nr ColEn Pc Nc   
Higha Low Lowa Low High Low High Low Low N.C.
High Low Low High High High High Low Low "  
High Low High Low High High High Low Low "  
High Low High High High Low High Low Low N.C.
High High Low Low Low Low High High High #  
High High Low High High Low High High High N.C.
High High High Low High Low High High High N.C.
High High High High Low Low High High High #  
Low x x x High Low Low High Low N.C.b
a The high and low states indicate the logic levels ’1’ and ’0’, respectively.
b N.C.: No change
8No change states, when the input or error signals are zero are not included in the table.
CHAPTER 3. IN-SITU TRAINING FOR FEEDFORWARD NETWORK 41
3.4 Chip Design and Fabrication
The first generation of the ELM network design, including the training system,
writing and control circuits, is fully custom-designed in Cadence-Virtuoso using
IBM 65nm technology node. A bottom-up approach is followed in the design
process. Every primitive such as the standard logic gates, registers, comparator, etc.
are built from the transistor level and extensively verified. Then, more complex
building blocks such as counters, accumulators, neuron circuits, etc. are designed.
The design here is manually laid out where every component is carefully placed to
reduce the overall die area. All large transistors are replaced with multiple-finger
transistors and several layout optimization techniques such as common centroid
and inter-digitization are adopted to achieve better transistor matching and to
reduce parasitics. When it comes to design rule check (DRC) and layout versus
schematic (LVS), the main building blocks are verified separately, but all of them
passed the test successfully with no errors.
Figure 3.8 illustrates the physical layout9 of the control circuitries and error
computing unit, which are used to enable small network training (4⇥4 network).
Since the training is performed column-wise, resource sharing is adopted to minimize
the area and power consumption. For instance, the column controller is shared by
all columns, and encoders are used to select columns under training. The circuit
level designs of the components used to build the neurons and other computational
blocks are provided in Appendix B. Figure 3.9 shows the physical layout after the
fabrication process. It is important to mention here that during the first round of
fabrication, only the writing scheme including the control circuitry is fabricated
due to the limited die area and the number of I/Os. In the future rounds, the chip
9Prior to the fabrication process, the circuit parasitics are extracted and post-layout simulation
is performed.







Figure 3.8: Physical layout of the fabricated training circuit including Ziksa writing
circuit.
components modeling ELM or other networks will be placed in a single die.
CHAPTER 3. IN-SITU TRAINING FOR FEEDFORWARD NETWORK 43
Figure 3.9: Training circuit including Ziksa writing circuit after fabrication.
3.5 Experimental Results
3.5.1 Binomial and Multinomial Data Classification
The classification accuracy of various ELM network topologies trained with SGD is
evaluated using different benchmarks. The benchmarks are selected to cover diverse
data size and dimensionality, such as Iris (low size and low dimensionality) [102],
StatLog-Image segmentation and Breast Cancer Wisconsin-Diagnostic (moderate
CHAPTER 3. IN-SITU TRAINING FOR FEEDFORWARD NETWORK 44
Table 3.4: Classification accuracy (mean ± standard deviation) of binomial and
multinomial datasets using ELM and SoftELM networks. The networks are trained
using SGD and the accuracy is reported for both the software version (baseline)
and the hardware behavioral model.
Baseline-SW Hardware
Dataset # Hidden ELM SoftELM ELM‡ SoftELM‡ ELM† SoftELM†
Neurons
MNISTa 180 96.05 96.33 93.66 89.32 94.94 95.88
± 0.1 ± 0.07 ± 0.4 ± 0.58 ± 0.24 ± 0.36
Fashion-MNISTa 180 81.12 82.32 78.69 72.9 80.36 80.83
± 0.25 ± 0.54 ± 0.61 ± 0.85 ± 0.41 ± 0.31
IRIS 40 94.86 96.36 91.52 89.33 94.32 95.79
± 3.38 ± 1.44 ± 3.59 ± 2.45 ± 2.23 ± 2.67
StatLog-Image 50 90.36 92.15 82.7 62.7 86.7 87.64
Segment ± 0.11 ± 1.07 ± 3.43 ± 3.76 ± 2.19 ± 0.75
Breast Cancer 50 96.36 97.76 95.29 61.69 95.56 96.44
Wisconsin ± 1.51 ± 1.51 ± 0.9 ± 2.21 ± 1.48 ± 1.18
a All MNIST and fashion MNIST images are preprocessed by HOG filter prior to presenting
it to ELM or SoftELM.
size and low dimensionality) [103], MNIST [104] and Fashion-MNIST [105] (large
size and large dimensionality) [106]. Table 3.4 presents the accuracy of the ELM
and SoftELM for both the baseline version10 and the hardware behavioral model11,
averaged over 10 runs for small and moderate size datasets and 5 runs for large
datasets. The results show that the baseline SoftELM always performs better
than ELM with ⇡1-2% improvement in accuracy. In the case of the hardware
model, it is found that when only the sign of the error and input are considered
for learning (ELM‡ and SoftELM‡), the network can experience degradation in
accuracy that can reach up to ⇡4% for ELM and ⇡33% for SoftELM. In contrast,
10The baseline version is the software model without hardware constraints.
11The hardware behavioral model of ELM and SoftELM incorporates the hardware constraints
and memristor device non-idealities such as 10% memristor device variability which is equivalent
to variable learning rate and 10% cycle-to-cycle variability. The memristor model is fitted to
the physical device in [107, 108].
CHAPTER 3. IN-SITU TRAINING FOR FEEDFORWARD NETWORK 45
the semi-simplified version of both topologies, ELM† and SoftELM†, exhibits more
stability (small standard deviation) and performance comparable to the baselines.
3.5.2 Analysis of Stuck-at-Faults
We study the network resiliency to device faults, specifically investigating the
impact of stuck-at-faults while classifying MNIST images. During the study
different scenarios have been considered, the first of which considers that a fault
already exists from the beginning of the training process. Usually this sort of
fault occurs due to fabrication issues. The second scenario introduces the fault
after several epochs which is equivalent to experiencing endurance failure. Such
analysis provides us with insight about the network’s robustness to device failure
in general and the possibility of recovery after the network experiences any device
fault. Figure 3.10-(Top) and Figure 3.11-(Top) show the network accuracy versus
the rate of device failure for stuck-at, stuck-on, and stuck-off faults. The results
show slight improvement in ELM† accuracy over ELM‡. In case of the SELM‡,
it exhibits severe degradation in accuracy as compared to other networks (ELM‡,
ELM†, SELM†), whereas SELM† manifests high tolerance to device failure with
accuracy more than 82% in the presence of 50% faults. Figure 3.10-(Bottom)
and Figure 3.11-(Bottom) illustrate the weight distribution of each of the above
networks when experiencing 50% of stuck-off faults in the output layer weights. It
can be seen that the severe drop in the SELM‡ is attributed to the weight saturation
in either extreme12. This makes neurons respond excessively to any presented input,
which eventually leads to more confusion at the output. In contrast, the SELM†
seems to possess better capability in tuning the unaffected weights in a manner
that minimizes network error and improves its performance.
12The weights in the hardware behaviour model is limited between -4.5 and 4.5. This is for the
following given parameters: Rf = 2M⌦, Ron = 200k⌦, Roff=2M⌦, and Mr = 364k⌦.
CHAPTER 3. IN-SITU TRAINING FOR FEEDFORWARD NETWORK 46
10 20 30 40 50















10 20 30 40 50











−4 −2 0 2 4















−4 −2 0 2 4









Figure 3.10: (Top) Classification accuracy versus rate of device failure for different
topologies (a) ELM‡, and (b) ELM†. (Bottom) Output layer weight distributions
for the corresponding networks in (top) when the network experiences 50% stuck-off
faults.
For evaluating the network’s capability to recover after experiencing a device failure,
we performed this experiment with 50% stuck-off faults. Figure 3.12 demonstrates
the recovery of various network topologies trained on MNIST when presenting the
fault at the 10th epoch. It can be seen that although all the networks have severe
degradation in accuracy, they manifest the capability to recover except the SELM‡.
Furthermore, SELM† still converges to the highest accuracy compared to other
network topologies.
CHAPTER 3. IN-SITU TRAINING FOR FEEDFORWARD NETWORK 47
10 20 30 40 50















10 20 30 40 50











−4 −2 0 2 4
















−4 −2 0 2 4









Figure 3.11: (Top) Classification accuracy versus rate of device failure for different
topologies (a) SELM‡, (b) SELM†. (Bottom) Output layer weight distributions for
the corresponding networks in (top) when the network experiences 50% stuck-off
faults.



















Figure 3.12: Network recovery for different topologies after experiencing 50% device
stuck-off faults at the 10th epoch.
CHAPTER 3. IN-SITU TRAINING FOR FEEDFORWARD NETWORK 48
3.5.3 Network Resilience
The impact of device failure on network performance can be mitigated by increasing
the number of neurons in the hidden layer as investigated by Romero et al. [42].
In this study, the ELM† and SELM† networks with 150 neurons in the hidden
layer are trained on the MNIST dataset while experiencing 30% device failure.
Then, we increase the number of hidden neurons at a rate of 50 neurons per
experiment. Figure 3.13 demonstrates the network performance as a function of the
number of hidden neurons. It can be noted that increasing the number of hidden
neurons results in weight redundancy which can increase the network’s ability to
restore its performance back up to the baseline (network without damaged devices).
However, around 250 hidden neurons are needed to regain the accuracy of ELM†
network, whereas 150 neurons were enough to restore the performance of SELM†.
150 200 250 300 350 400
















150 200 250 300 350 400





Figure 3.13: Classification accuracy as a function of increasing number of hidden
neurons when experiencing 30% device faults for (a) ELM† and (b) SELM†.
CHAPTER 3. IN-SITU TRAINING FOR FEEDFORWARD NETWORK 49
3.5.4 Power Consumption
The total power consumption of the ELM† and SELM† is estimated for a small
size network, 4⇥10⇥3, which can be used to classify the Iris dataset. The network
is implemented in a customized manner using IBM 65nm CMOS technology and
Cadence tools. Table 3.5 demonstrates the power consumption of the individual
components, building units, and full-size network for both ELM† and SELM†.
Recall that a full-size network primarily constitutes of memristive crossbars and
neuron circuits. The average power consumption of the memristive crossbar is
measured for the Iris dataset for the worst-case scenario, in which all memristor
resistance is minimum, 200k⌦. This results in power consumption of 27.63µW and
22.27µW for the hidden layer and output layer crossbars, respectively. Here, it can
be observed that although the power consumption of the memristive crossbar is low,
the ELM† output layer power consumption (92.33µW) is still high as compared
to the SELM† (57.16µW). The reason behind this is using Op-Amps to realize
the neuron circuits. In contrast, the SELM† output layer consumes 1.65⇥ less
power than the ELM† output layer as it utilizes a winner-take-all (WTA) circuit
to implement the output layer neurons and realize the softmax operation. Using
WTA suppresses the current in losing neurons while keeping it maximum in the
winners. Consequently, lower power consumption is achieved even when the circuit
scales up in terms of number of neurons.
The results also show that the power consumption of the error unit is dominated
by the absolute value circuit as two single-rail Op-Amps are utilized to compute
the absolute of the error signal. However, time multiplexing amongst the neurons
results in significant power savings. In this instance of the training system, the
reduced local controllers’ complexity leads to low power consumption. This is
advantageous as the number of local controllers scales linearly with the crossbar








  1euron 
 CLrcuLtry 7raLnLng 6ystem CrossEars
Figure 3.14: Power consumption distribution for ELM (ELM†) and SoftELM
(SELM†).
size. Figure 3.14 illustrates the distribution of power consumption for both ELM†
and SELM†. Here we see that most of the network power is actually devoted to
the neurons and intermediate peripheral circuitry rather than the training system.
Further reduction in power consumption of neurons may be achieved if slower
Op-Amps are used (based on the task).
3.6 Summary
In this chapter, an in-situ training system for memristor-based multi-layer neural
networks is presented. The training system is studied within the context of
conventional ELM and SoftELM networks with core learning rule represented by
stochastic gradient descent (SGD). Using the SGD not only enables online learning,
but also provides additional freedom to simplify the learning rule. Simplifying the
learning rule may gracefully degrade the network performance, but can result in
orders of magnitude of power saving. The proposed architecture (trained using SDG
with various levels of simplification) is benchmarked across different classification
tasks and in the presence of memristor device variability and device failure. The
CHAPTER 3. IN-SITU TRAINING FOR FEEDFORWARD NETWORK 51
Table 3.5: Total power consumption of the individual components, building units,
and the full size network.
Component Specs Power consumption
Op-Amp 3-Stage, PMOS-Input 17.54 µW
Comparator 4-stage+Internal Hysteresis 21.63 µW
WTA-1000 Voltage-Mode, 1000 cells 34.9 µW
Sample & Holding [99] - 10 nW
Sign circuit Inverter-based 1.7 µW
Sign circuit Comparator-based 35.17 µW
Subtractor circuit - 15.06 µW
Absolute value circuit - 52.98 µW
Row local controller - 0.04 µW
Column local controller - 0.036 µW
Main controller - 0.243 µW
Semi-Crossbar (hidden layer) Size=4⇥10+4a 27.63 µW
Semi-Crossbar (output layer) Size=10⇥3+10 22.27 µW
Hidden Layer Layer size=4⇥10+4 220.6 µW
Output Layer (ELM) Layer size=10⇥3+10 92.33 µW
Output Layer (SoftELM) Layer size=10⇥3+10 57.16 µW
ELM Network (Total) Network size=4⇥10⇥3 383.24 µW
SoftELM Network (Total) Network size=4⇥10⇥3 348.06 µW
a The added term refers to the size of the reference memristor column in the crossbar.
experimental results show that SoftELM overall yields better performance and
resiliency as compared to conventional ELM. When we investigated networks’
capability to regain their performance via increasing the number of hidden neurons,





Hierarchical temporal memory (HTM) [26, 27] is a theoretical framework that aims
to explain how spatial and temporal sensory information are processed by neuronal
cells, which are grouped into cellular layers stacked into hierarchical structure. HTM
in its early stages, which is proposed by Jeff Hawkins in his book On Intelligence
(2005) [26], was a theory that resulted from repackaging and reinterpreting the earlier
theories such as memory-prediction theory and Bayesian neural network (BNN).
Later, the algorithm was radically modified with strong reliance on neuroscience to
closely capture the structural and algorithmic properties of the neocortex. HTM has
demonstrated high capability in processing spatial and temporal information with
a high degree of plasticity while learning models of the world. HTM also exhibits
natural compatibility with continuous online learning (there is clear absence of
distinct training and testing phases) [28], noise and fault tolerance [29], and low
power consumption which is achieved through sparse neuronal activity [30, 46].
These properties make the algorithm attractive for a wide range of applications
such as visual object recognition and classification [32, 109, 33], prediction of data
streams [34], natural language processing and anomaly detection [110, 35]. In
this chapter, an overview of the HTM network and its algorithmic properties is
provided. The processes of encoding sensory information before presenting it to
CHAPTER 4. HIERARCHICAL TEMPORAL MEMORY 53
HTM network and translating HTM output into a conventional data format are
discussed as well.
4.1 Overview of the HTM Algorithm
HTM is a sequence memory algorithm that aims at emulating the structure and
foundational principles of the neocortex. Inspired by the neocortex, HTM is
structured from ascending hierarchical regions of cellular layers, shown in Figure 4.1,
that enable the network to capture spatial and temporal patterns. The cells in HTM
are a simplified model of the common excitatory neurons in the neocortex, known
as the pyramidal neurons. Similar to pyramidal neurons, HTM cells have hundreds
of synaptic connections that enable them to recognize independent patterns of
cellular activities. The cell synaptic connections are assigned to three integration
zones, namely proximal, basal, and apical [111, 26]1. Each zone is composed
of either one proximal segment or several basal or apical segments. A segment,
either proximal or distal (basal and apical), comprises multiple synapses to capture
the cellular activities of the space to which it is linked. The proximal dendritic
segment defines the cell’s receptive field in the input space (feed-forward input) and
sufficient activities detected on the proximal dendrites lead to the generation of a
somatic action potential. The basal and apical dendritic segments hold the synaptic
connections with nearby cells and other cells in higher levels in the hierarchy.
Therefore, the basal and apical segments are dedicated to observing contextual and
feedback inputs. It is important to note that the activities detected on the basal
and apical dendrites lead to N-Methyl-D-Aspartate (NMDA) spikes. The NMDA
spikes slightly depolarize the cell without generating an action potential, giving
the cell a competitive advantage in responding to future input [29].
1A cell in HTM typically has one proximal segment and multiple distal and apical segments.
CHAPTER 4. HIERARCHICAL TEMPORAL MEMORY 54
Region
Cell
P ef a  
C e
Ne c e
Figure 4.1: The biological neocortex structure including the regions and the building
blocks (neurons), and their correspondence in the HTM network.
The cells in each HTM region are arranged in a columnar organization called
a mini-column. In a given mini-column, cells share the same proximal synaptic
connections, i.e. they share the same feed-forward receptive field and are stimulated
by the same input. Basal segments, on the other hand, allow for the interaction
among cells within the same region as such cells learn and recall sequences. During
learning, the synaptic connections’ strength, which is defined by a positive scalar
value called permanence, is adjusted. However, this process occurs in an online
fashion which enables the algorithm to learn not only the spatial features of the
input, but also the temporal correlation between them [30].
Figure 4.2 shows a high-level diagram of the HTM network equipped with a data
encoder and multiple classifiers. The encoder transforms sensory information into
binary representation, while the classifiers map the HTM output to the correspond-
ing class labels (SDR classifier) and identify anomalies (anomaly classifier). In spite
of the fact that the HTM network, in theory, has several hierarchical levels, this
aspect has not yet been studied thoroughly. Thus, this work will place emphasis on
a single HTM region, which is equivalent to realizing the primary sensory region in











SDR Cla i e  [P edic i ]A al  Cla i e




















Figure 4.2: High-level architecture of the HTM algorithm with three core units:
data encoders, HTM network, and classifiers. The encoder transforms the input
data into binary representation. The HTM network learns spatial information and
captures temporal transitions, while the classifiers map the HTM output to the
corresponding class labels and identify anomalies.
the supra-granular layers of the neocortex. In the following subsection, each part
of the HTM algorithm will be discussed.
4.2 HTM Encoder
The HTM encoders are responsible for converting sensory information into sparse
distributed representation (SDR)2. The SDR defines the HTM underlying data
structure. It enables the algorithm to distinguish the common features between
inputs [112], learn sequences, and make simultaneous predictions [113]. However,
2Encoding sensory information into SDR representation is preferred when working with HTM.
However, in some encoders, such as date and scalar encoders, the sparsity and distribution
features are missing.
CHAPTER 4. HIERARCHICAL TEMPORAL MEMORY 56
the SDR representations here refer to large binary vectors that possess key aspects
of neocortical representations: sparsity and distribution. Sparsity indicates that
the number of active bits should always be less than of the number of inactive
bits (typically, in HTM the percentage of active bits is 2-10% of the vector length).
Each active bit conveys some semantic characteristics regarding the encoded data
such that two different SDRs with overlapping active bits implies some sort of
correlation between these SDRs. Distribution3 refers to the fact that the active bits
should not be clustered in one particular region. Rather, they should be randomly
scattered. One designing an encoder for HTM network has to make sure to meet
the following criteria [114]: 1) Inputs that share some similarities should result
in SDR representations with overlapping bits. 2) Similar inputs should result in
similar SDR output representations. 3) The output vector length should be the
same for all inputs. 4) The sparsity level for all the generated SDRs should also be
the same.
It is important to mention here that over the last decade, Numenta, the company
that Jeff Hawkins and his collaborators started in 2005, has developed several
encoders that enhance the way HTM interacts with the environment. Examples of
the encoders are: i) random distributed scalar encoder which converts any scalar
value, captured for example in streaming data, into an SDR representation; ii)
category encoder and geospatial encoder which are used to encode categorical
information and GPS coordinates into SDR representations, respectively. While
working with the HTM network, our use for the encoder was limited to the random
distributed scalar encoder. Figure 4.3 illustrates an example of encoding time series
data into SDR representations using the aforementioned encoder.
3Distribution also means that the information is encoded in more than one active bit.




Figure 4.3: (bottom) An input data stream representing power consumption in a
gym captured at every hour. (top) SDR representation of the bottom signal as it
is encoded by the HTM random distributed scalar encoder.
4.3 HTM Region
The region in the HTM is responsible for learning, storing, and recalling information.
Generally, the region is structured from a vertically aligned cells, arranged into
two-dimensional plane stacked hierarchically. To simplify the description of the
HTM region and its mathematical description, the following assumption regarding
the HTM structure will be made: 1) only a single region is considered since the
hierarchical structure in HTM has not yet been explored thoroughly. 2) The HTM
region will be flattened4 to have 1D5 shape of size: 1⇥ nc, where nc is the total
number of mini-columns. The same assumption will be applied to the input space.
Given an HTM region, there are two core operations which capture the spatial and
4Flatten the region does not alter the functionality of the algorithm and does not have major
impact on its performance.
5A flatten HTM region can be seen as 1D (mini-column level description) or 2D (cell level
description).
CHAPTER 4. HIERARCHICAL TEMPORAL MEMORY 58
temporal information of a given input, namely the spatial pooler and temporal
memory, discussed in the following subsections.
4.3.1 Spatial Pooler
In HTM, learning the spatial patterns in spatial or sequential data is performed
by the spatial pooler. When an input is presented to the network, it gets encoded
into a set of sparsely distributed active mini-columns using a combination of com-
petitive Hebbian learning rules and homeostasis [115]. The sparse activation of
mini-columns represents the core feature that grants the HTM algorithm appeal-
ing properties, such as distinguishing the common features between inputs [112],
learning sequences, and making simultaneous predictions [113]. Generally, each
mini-column is connected to a unique subset of the input space using a set of
proximal synaptic connections. When the synapses are active and connected to a
reasonable number of active bits in the input space, the proximal dendritic segment
becomes active. The activation of the proximal dendritic segment will nominate
that mini-column to compete with its neighboring mini-columns to represent the
input. By using the k-winner-take-all (k-WTA) computation principle, the mini-
column with the most overlapping active synapses and active inputs inhibits its
neighbors and becomes active (winner). The output of the spatial pooler is a
binary vector, which represents the joint activity of all mini-columns in the HTM
region in response to the current input. This binary vector is also known as an
SDR vector. The operation of the spatial pooler can be divided into three distinct
phases: initialization, overlap and inhibition, and learning.
During the initialization phase, which occurs only once, all the parameters of
the regions are initialized including mini-columns’ connections to the input space,
synapse permanences, and boosting factors. Let Sp be an nx⇥nc array which holds
CHAPTER 4. HIERARCHICAL TEMPORAL MEMORY 59
all the synaptic connections that link nc mini-columns with nx dimensional input
space. Now, let nsp be the maximum number of potential synapses associated with
each mini-column and is defined by the non-zero elements in ~sp (~sp is a row vector
in Sp) whose indexes are generated by a pseudo-random number generator. See
Equation (4.1) and Equation (4.2).
Sind ⇠ rand.pseudo,where Sind 2 Nnsp⇥nc{1,nx} (4.1)














Figure 4.4: Spatial pooler initialization phase. (left) Depicts the process of selecting
mini-columns’ receptive fields, where proximal connections are formed. (right) The
growth level of the individual proximal connection as defined by the permanence is
randomly set.
Similarly, let ⇢p be a nx ⇥ nc array that describes the permanence of the potential
synapses in Sp. The permanence value describes the growth level of a synapse
and it ranges between ‘0-1’. Synapses with permanence values less than Pth are
considered unconnected. However, initially the permanence values are randomly
initialized with a uniform distribution (Equation (4.3)). After initializing the
synaptic connections, the boosting factor for each mini-column is defined to be a
scalar value of one. See Equation (4.4).
CHAPTER 4. HIERARCHICAL TEMPORAL MEMORY 60
⇢p[Sind] ⇠ rand.uniform[0, 1],where ⇢p 2 Rnx⇥nc (4.3)
~b 2 R1⇥nc ,where 8 b[j] = 1 (4.4)
The initialization phase is followed by the overlap and inhibition phase in which the
feed-forward input is collectively represented by a subset of active mini-columns,
namely winning mini-columns. The selection of winning mini-columns occurs after
determining the activation level of each mini-column, called the overlap score (↵).
The mini-columns’ overlap scores for a given region are computed by counting each
mini-column’s active synapses that associate with active bits in the input space.
Mathematically, it is achieved by performing a dot product operation between the
feed-forward input vector (~xt 2 Rnx⇥1{0,1} ) at time t and the active synapses’ array as
in Equation (4.6), where the active synapses array is the result of an element-wise
multiplication (denoted as  ) between Sp and ⇢̄p. ~b, here, denotes the boosting
factor that regulates mini-column activities. ⇢̄p is a permanence binary array to
indicate the status of each potential synapse (Equation (4.5)), where ‘1’ indicates
a connected synapse and ‘0’ an unconnected synapse.
⇢̄p = I(⇢p   Pth) (4.5)
~↵t = ~xt.transpose · (Sp   ⇢̄p) (4.6)
~↵t = ~b  ~↵t (4.7)
Upon the completion of computing the overlap scores, each mini-column overlap
score gets evaluated by comparing it to a threshold, known as minOverlap (↵th),
see Equation (4.8). The resulting vector ( ~e↵t) is an indicator vector representing the
nominated mini-columns with high overlap scores. The nominated mini-columns
CHAPTER 4. HIERARCHICAL TEMPORAL MEMORY 61
Visual Encoder
O erlap score

















Figure 4.5: Spatial pooler overlap and inhibition phase. (left) Demonstrating
overlap score computing and nominating the mini-columns (highlighted with bold
black boarders) for input representation. (right) Inhibition process and selecting
the active mini-columns (highlighted in light-blue) based on the desired level of
sparsity.
compete against each other within a radius defined by ⇠ to represent the feed-
forward input. Based on the mini-column overlap scores and desired level of sparsity
(⌘), nw number of mini-columns will be selected to represent the input, where kmax
is a function that implements k-winner-take-all which returns the top nw elements
within ⇠.
~e↵t = I(~↵t   ↵th) (4.8)
~⇤t = kmax( ~e↵t, ⌘, ⇠) (4.9)
After determining the winning mini-columns and generating the final spatial pooler
output (~⇤t 2 R1⇥nc{0,1}), the learning phase starts to update the permanence values of
the mini-columns’ synapses as necessary, i.e. only the synapses of the active mini-
columns are updated. The approach followed in updating the permanence of the
synapses is based on the Hebbian rule [116]. The rule implies that the connection
of synapses to active bits must be strengthened, increasing their permanence by
P+
p
, while the connection of synapses to inactive bits will be weakened, decreasing
CHAPTER 4. HIERARCHICAL TEMPORAL MEMORY 62
their permanence by P 
p
, where  ⇢p is the change in the permanence array for all












After adjusting the synapses’ permanence, the boosting factor is updated to regulate
the activities of the mini-columns, where āt indicates the mini-column time-averaged
activity level over the last T inputs, and   is a positive constant controlling the
adaptation pace [115].
~̄at =

















Figure 4.6: Spatial pooler learning phase. It involves updating the strength
(permanence) of the proximal connections for the active mini-columns only and
according to Hebbian’s rule.
CHAPTER 4. HIERARCHICAL TEMPORAL MEMORY 63
4.3.2 Temporal Memory
The temporal memory in the HTM is mainly dedicated to learning time-based
sequences and making predictions. The temporal memory operates at the cells’
level, specifically, the cells of the winning mini-columns. When a mini-column
becomes active, at least one of its cells is selected to be active to represent the
input contextually. This usually depends on whether the cells within the winning
mini-columns are predicting the incoming input. If a winning mini-column has a
predictive cell, that cell becomes active and inhibits other cells within the same
mini-column from being active. Otherwise, the joint activation of all cells within
the mini-column represents the input and this is known as massive neuron firing
or bursting. However, once a cell is activated, it forms lateral connections with
the cells that were active in the previous time step. Patterns recognized by lateral
connections lead to a slight depolarization of the cell soma (predictive state),
subsequently predicting the upcoming events. Typically, the lateral connections
are grouped into distal segments. A cell in HTM can have more than one distal
segment and this grants the cells the capability to predict more unique temporal
patterns. The operation of the temporal memory can be divided into three phases:
mini-columns evaluation, prediction, and learning phase [50].
During the mini-columns evaluation phase, the active cells within the winning
mini-columns are selected to represent the input within its context. Let nm be
the number of cells in each mini-column, and At 2 Rnm⇥nc{0,1} be a binary array that
represents the region cells’ activity, where ‘1’ indicates an active cell and ‘0’ is
inactive. Similarly, let  t be also a binary array that has the same size of A, and the
active bits in   refer to the predictive cells. An ith cell within the jth mini-column
is set to be active if ~⇤t
j
= 1 and the cell was in the predictive state in the previous
time step i.e.  t 1
ij
= 1. Otherwise, bursting (all cells within the jth mini-column
CHAPTER 4. HIERARCHICAL TEMPORAL MEMORY 64






















In the second phase of the temporal memory, prediction, the status of the cells for
the next time step is evaluated. This is done via observing the distal segments’
activation level. Let Dij represent a group of distal segments that belong to the
ith cell within the jth mini-column, where a segment in Dij indexed by d is a
binary matrix of size nm ⇥ nc. If ⇢̄dij is the connected distal synapses within the
dth segment, and S̄d
ij
holds its distal connections that are connected to active bits





greater than the segments’ activation threshold, Sth. Otherwise, the segment is set
to a matching state if it has at least one synapse connected to an active cell in At.
Once the status of the distal segments are determined, the cells with active distal
segments are set to be in the predictive state, see Equation (4.17). It is important
to mention here that occasionally cells in HTM may incorrectly predict patterns.
In such scenarios, these cells need to have their synaptic strength reduced to lower
the likelihood of incorrect prediction. After evaluating the cells’ segments, their








= At ⇥ Sd
ij
(4.16)














As aforementioned, the learning in HTM follows Hebbian’s rule and it is applied
solely to active cells. Given at
ij
2 At, where at
ij
= 1 and has an active segment, Dd
ij
,
then all the synaptic connections that are laterally connected to active cells in the
previous time step are potentiated, while those that are connected to inactive cells
are depressed. This implies that the permanence of the distal synaptic connections,
⇢d
ij
, is increased by P+ when it is connected to an active cell, otherwise, it is
decreased by P , as in Equation (4.18), where Sd
ij
is an array that holds the distal
connections of the dth segment of the ith cell within the jth mini-column.
 ⇢d
ij
=  (At 1   Sd
ij
)  P  (4.18)
Now, in order to demonstrate the operation of the temporal memory in an example,
we will consider two different scenarios that cover every aspect of the temporal
memory phase. In the first scenario, we will assume that we are presenting the
following sequence ‘3-4-5’ for the first time to an HTM region (see Figure 4.7). Let’s
also assume that we are at the point of presenting number ‘4’, which has already
been encoded using sparse distributed active mini-columns during the spatial
pooling phase (columns highlighted in light blue). During the temporal memory
phase, the individual active cells within the active mini-columns represent the
same input contextually need to be selected. Since number ‘4’ has been presented
for the first time to the HTM region, a bursting operation will take place and
all the cells within the active mini-columns are set to be active. Then, one cell
per active mini-column is picked as a winning cell (light blue cells with black
border) to learn this particular input. Number ‘5’ is then presented to the network
CHAPTER 4. HIERARCHICAL TEMPORAL MEMORY 66
and the same scenario is repeated. However, right after selecting the active cells
that represent number ‘5’, distal connections between the current winning cells
(representing number ‘5’) and the previous active cells (representing number ‘4’)
are formed. These distal connections grant the HTM cells the capability to predict
















Figure 4.7: Learning sequences (‘3-4-5’) presented to an HTM region for the first
time. Thus, the region experiences bursting operation and formation of distal
connections between the cells.
In the second scenario, we will assume that the network is familiar with the sequence,
i.e. the network has seen the sequence ‘3-4-5’. Thus, when presenting number
‘4’ and selecting the active mini-columns during the spatial pooling, rather than
having bursting, a particular set of cells within the active mini-columns are selected
to represent the input. Usually, the selected active cells were in the predictive state.
However, activating the cells that represent number ‘4’ leads to setting the cells
that represent ‘5’ in the predictive state (highlighted in green, see Figure 4.8). If
the network received number ‘5’ and the mini-columns with predictive cells are
chosen, their predictive cells are turned into active states. Then, the previously
formed distal connections’ strength are updated according to Hebbian’s rule.











Figure 4.8: After learning, the cells that are previously set into the predictive state
are chosen to represent the current input.
4.4 HTM Classifiers
The HTM network can be equipped with several classifiers that translates neuronal
activities, generated by the region, into a meaningful data format. The two widely
used classifiers are the SDR classifier, which can be used in classification and
prediction tasks, and the anomaly classifier. Here, our emphasis will be confined to
the SDR classifier as no anomaly tasks are reported in this dissertation.
4.4.1 SDR classifier
The SDR classifier recognizes the SDR combinations as generated by the HTM
region and generates the predicted class labels. It is based on a softmax function
with ny units, where ny denotes the number of class labels that need to be recognized.
All the SDR classifier units are interconnected with the HTM region in a dense
manner through weighted connections, $ 2 Rnmc⇥ny , where nmc = nc ⇥ nm is the
total number of cells in an HTM region. Initially, the weighted connections of
the classifier are randomly initialized with uniform distribution and then tuned
according to the SGD, given in Equation (4.19). $i,k is the weight of the connection
CHAPTER 4. HIERARCHICAL TEMPORAL MEMORY 68
between the kth classifier unit and ith HTM cell6,   is the learning rate, ~At
i
denotes
the ith cell activity within the flatten array At, and ŷk and yk are the predicated
and expected class labels, respectively.










In this chapter, all the key functional units that can be developed to build a
self-sustained biologically-inspired system are discussed. This includes the encoder
to convert sensory information (e.g. visual, auditory, tactile) into high-dimensional
sparse distributed representations, the HTM algorithm to capture spatial informa-
tion and to learn sequences, and the SDR classifier to translate the set of neuronal
activities generated by HTM into meaningful data format. A detailed description of
the HTM region and its core operations, spatial pooling and temporal memory, are
explained. Further information regarding the spatial pooler and temporal memory
algorithms is provided in Appendix-A.
6When HTM is used to process spatial information, the output of the spatial pooler, mini-columns
activities, is fetched to the SDR classifier rather than activities of the HTM cells.
69
Chapter 5
Design Methodology of HTM
HTM is a fairly complicated algorithm as it attempts at capturing most of the
neocortical operations observed in the neocortex. Mapping the algorithm to
a hybrid CMOS/Memristor design requires various design considerations. For
instance, emulating the synaptic behavior of HTM using memristors turns out
to be challenging. This is because the synapses in HTM are binary in nature,
i.e. they exhibit the same properties if they are above the permanence threshold
regardless of the synapse’s growth level and vice versa. Capturing this behaviour
requires us to develop a special window function, called Z-window function, that
facilitates the fitting of the memristor model to the physical device. The other
design consideration implies forming and pruning the synaptic pathways and
incorporating various plasticity mechanisms. Thus, various approaches for forming
and pruning the feed-forward (proximal) and lateral (distal and apical) synaptic
pathways are discussed. This chapter also explores incorporating various plasticity
mechanism such as neurogenesis and homeostatic intrinsic plasticity to strengthen
the robustness and performance of the HTM network. As alluded to earlier, the
HTM network, in theory, has several hierarchical levels; this aspect has not yet been
studied throughly. This work is therefore confined to studying and implementing
only one level/region in HTM. In the following section, modeling of every aspect of
CHAPTER 5. DESIGN METHODOLOGY OF HTM 70
the region will be discussed.
5.1 HTM Synapse Modeling
The HTM cells have a large number of synaptic connections allowing them to
detect the pattern of activities occurring in the input space and within the region.
Each synaptic level of growth is defined by its permanence value. Typically, the
permanence value ranges between ‘0-1’, where ‘0’ indicates the absence of the
synaptic connection with a likelihood to form one and ‘1’ indicates the full growth
of the synaptic connection [29]. When the permanence value exceeds the threshold,
the synapse provides a low-impedance path to the input and vice versa when the
permanence value is below the threshold. However, HTM synapses are binary in
nature in the sense that if two synapses’ permanence exceeds the threshold, they
exhibit the same properties regardless of their connection strength. While this
is the case, the synapse with the highest permanence is harder to forget. In this
research, memristor devices are chosen to emulate the synaptic connections in the
HTM as they exhibit low energy consumption, small footprints, high integration
density, and non-volatility.
The memristor device is emulated using a representative Verilog-A model named
VTEAM [108]. The memristor, essentially, is described with two variables: w and
D, which define the state variable of the device and its thickness. Changing the state
of the device, i.e. its conductance value (Gmem), is considered to have an analog
nature. Thus, it is gradual and bounded between the memristor’s high conductance
state (HCS ⌘ Gon) and low conductance state (LCS ⌘ Goff). The change in the
memristor is a function of the voltage applied across the device or the current
through it. This chapter mainly focuses on the voltage-driven memristors whose
CHAPTER 5. DESIGN METHODOLOGY OF HTM 71
conductance change can be described by Equation (5.1) and Equation (5.2) [108].
koff , kon, ↵off , and ↵on are constants, voff and von are the memristor threshold




















.f(w), 0 < voff < v







.f(w), v < von < 0
(5.2)
In order to use the memristor device to emulate the HTM synaptic connections,
we need to have a memristor that manifests a slight drift when it moves from the
boundary toward the mid-point of the device, and as it approaches the mid-point,
the drift should be accelerated. In 2017, Jiang et al. proposed a memristor device to
implement the k-nearest neighbour algorithm and that exhibits properties required
for the HTM [118]. However, modeling this device for circuit simulation requires a
special window function so that it exhibits the aforementioned properties. To the
best of our knowledge, there is no memristor window function that captures this
exponential attribute of HTM synapses. Thus, we developed a window function,
called Z-window function. The Z-window function has built-in control parameters
for adjusting its characteristics and it takes into account the memristor device
boundary conditions. Furthermore, it possesses all the attributes of an effective win-
dow function such as providing a linkage with linear dopant drift model, scaling the
window function upward and downward [1, 2], and modeling the non-symmetrical
behavior of some memristor devices. It is important to mention here that our
description of the Z-window function will be limited to the second version only. The
CHAPTER 5. DESIGN METHODOLOGY OF HTM 72
(a) (b)
(c) (d)
Figure 5.1: (a) Characteristic curves of the proposed Z-window function to model
the HTM synapses’ behavior. (b) and (c) The linkage to the linear drift model and
scalability features, respectively. (d) The non-symmetrical behavior feature.
first version, described in [31], offers the advantage of circumventing the boundary
lock problem, but this is at the expense of losing the window function continuity.
The proposed window function is given in Equation (5.3), where ⌧,  , k, and p1
are constants that control the shape of the window function, sliding level (over
the x-axis), scalability, and falling slope as it approaches either ends of the device
terminals, respectively. Figure 5.1 illustrates the window function characteristic
curves. Figure 5.2 and Figure 5.3 show the memristor model hysteresis characteristic
1Nominal parameters used to achieve most of the plots in Figure 5.1 are: ⌧=200,  =0.5, k=1,
and p=4.
CHAPTER 5. DESIGN METHODOLOGY OF HTM 73
Figure 5.2: The conductance change (left) and hysteresis characteristic curve (right)
of the memristor while driving it with a positive pulse signal.
Figure 5.3: The conductance change (left) and hysteresis characteristic curve (right)
of the memristor while driving it with a negative pulse signal.















In attempt to make a comparison with the existing memristor window functions,
the desired HTM synapse features in addition to the evaluation metrics defined
in [1] and [2] are used. Table 5.1 illustrates the attributes of each window function.
2A pulse wave signal has an amplitude of ±1.2v and frequency of 20kHz is used to achieve the
hysteresis plots.
CHAPTER 5. DESIGN METHODOLOGY OF HTM 74
It can be noticed that the proposed Z-window functions (Z-V1 and Z-V2) possess
most of the desired attributes in addition to the accelerated drift in the state
variable which is an essential feature to enable capturing HTM synapse behaviour.
Table 5.1: A comparison of various memristor window functions based on the
desired HTM synapse features and the evaluation metrics described in [1, 2].
Metric Joglekar Prodromakis Zha Z-V1 Z-V2
[119] [1] [2] [31] [49]
Resolve boundary effect ⇥ X⇤ X⇤ X⇤
Linkage with linear dopant drift X⇤ X⇤ X⇤ X⇤
Scalability (0  fmax < k) ⇥ Limited X⇤ X⇤
Dopant movement acceleration ⇥ ⇥ ⇥ X⇤ X⇤
Flexibility (control parameters) X⇤ X⇤ X⇤ X⇤ X⇤
Continuity X⇤ X⇤ ⇥ ⇥ X⇤
Resolve boundary lock ⇥ ⇥ X⇤ X⇤ ⇥
Figure 5.4 illustrates the experimental behavior of the adopted memristor physical
device as a function of the applied pulse, fitted to the memristor model used.
Here, it can be observed that the memristor has minor changes in conductance
level on either side of the permanence threshold (highlighted in green), while the
changes are extreme in the middle. To some extent, this captures the binary nature
of the ideal synapse in HTM. It is important to mention here that in order to
optimize the HTM system performance and maintain low power consumption,
the following assumptions were made: 1) the memristor device exhibits semi-
symmetrical behavior when switching from low/high conductance to high/low; 2)
the memristor device offers fast switching speed and high conductance range.
CHAPTER 5. DESIGN METHODOLOGY OF HTM 75
Figure 5.4: Fitting the memristor model to the physical device behavior while
modulating the device conductance with a train of pulses.
5.2 Proximal Synapses Formation
The receptive field (RF) defines a sub-region in the input space to which a mini-
column’s proximal synaptic connections are tapped. This section discusses the
various approaches of realizing the RFs of the HTM mini-columns. It also highlights
the advantages of each approach and their constraints and feasibility in realizing a
large-scale neuromorphic chip for the HTM algorithm.
5.2.1 Memristive Crossbar
The memristive crossbar is mainly composed of perpendicular metal nanowires
sandwiching memory elements (memristors) [120]. The memristive crossbar offers
several advantages such as enabling the integration of a large number of memory
elements within a compact area and allowing highly-parallel vector-matrix compu-
tations. As most neural networks are dominated by vector-matrix multiplications,
this makes the memristive crossbar a natural fit for such networks. However,
the memristor crossbar structure is really beneficial for densely connected neural
CHAPTER 5. DESIGN METHODOLOGY OF HTM 76
networks. When it comes to sparsely connected networks such as HTM, using
the crossbar would only be possible by randomly disconnecting devices or setting
them to a high impedance state3. Although both these approaches may result
in a sparsely connected crossbar, it is still inefficient modeling. This is because
disconnecting devices requires a special burning process, whereas setting them to a
high impedance will not result in perfect current blocking. Having said that, there
is a research group that has explored the high impedance method to fulfill a part
of HTM’s requirements [91]. The authors suggest using a crossbar in which each
column models an HTM mini-column and the rows represent the mini-column’s
synaptic connections which are connected to the input space. For a given crossbar,
the adjacent columns have to maintain a certain level of overlap in the input
space. Figure 5.5-(a) shows an example of adjacent mini-columns with two proximal
connections each, connected to 4x1 input space (a slice of the presented 4x4 image).
Here, it can be noted that the mini-columns C1 and C2 share the input x3 but not
x2. In spite of the fact that this method results in partially sparse connections and
it enables high-speed computation in the HTM, it has several limitations. The
first of which is the limited range in the overlap that can be achieved among the
neighboring mini-columns because more overlap space implies more unused regions
in the crossbar (called the "dark-spot" in the rest of the dissertation). Second,
it leads to current sneak paths as the memristors cannot be programmed to zero
conductance. Lastly, it lacks reconfigurability, which is the most important feature
in the HTM.
The other possible approach to achieve sparsely connected crossbars is based
on changing their structure. Instead of using the regular perpendicular cross
3There is another approach proposed in [121] to map a sparse matrix to the crossbar. It is based
on decomposing the sparse matrix into a small sub-blocks mapped separately to the crossbar.
The sub-blocks with all-zero elements are excluded from the mapping process and therefore
reduces the crossbar size. However, such a process requires continuous matrix manipulation and
is infeasible for networks with on-chip training. Thus, this approach is not considered here.
CHAPTER 5. DESIGN METHODOLOGY OF HTM 77
M -c Me
C1 C2 C3 C1 C3 C4C2

















Figure 5.5: Mini-column receptive fields modeled by a sparsely connected memristor
crossbar implemented using (a) blocking memristor (b) predefined mini-columns
regional connections.
connections, a regional space to each column is defined such that its connections
can be tapped, as shown in Figure 5.5-(b). However, such an approach may have
its own challenges during the fabrication process and the same current sneak paths
issue.
In general, the biggest challenge of adapting the crossbar approach in order to
establish the receptive field of each mini-column is the integration between the
HTM region and the input space. Using crossbar structure in the ways described
above involves establishing hundreds of connections to the input space. This makes
HTM architecture over-dominated by the interconnects which eventually leads to
undesired noise, scaling limitations, and more power consumption. Furthermore,
these connections are rigid in nature and lack reconfigurability, which is an essential
feature to develop an HTM network on chip.
CHAPTER 5. DESIGN METHODOLOGY OF HTM 78
5.2.2 Dynamic Memristive Crossbar
The principle concept of this approach is based on using a linear feedback shift
register (LFSR) and a memristor crossbar as a single entity to enable crossbar
end-terminal reconfigurability. Due to the fact that the columns in the crossbar
share the rows, a full reconfigurability can only be achieved when the columns are
separated to be one-dimensional arrays, where each column models a mini-column
in HTM. Each column is assigned its own dedicated LFSR which is initialized
by the mini-column index in the HTM region. The RF that is generated by the
LFSR can either be local or global. In the global RF, all the registers of the LFSR,
shown in Figure 5.6-(a), are used to generate random numbers such that the entire
input space can be seen by the mini-columns. Given a mini-column, nsp number of
potential synapses can be generated by its LFSR to link it with nsp locations in the
input space. In the case of the local RF, the LFSR registers are used in a partial
manner. Some of them will be used to generate the random numbers whereas
the rest are dedicated to providing address shifting. Figure 5.6-(b) illustrates the
concept of the partially used LFSR. The registers with a colored base represent the
ones that will generate the synapse addresses while the rest are used for shifting.
For instance, if an 8-bit LFSR is loaded with a seed of 200, random integer numbers
ranged between 192-207 can be achieved if only the 4 least significant bits (LSB)
of the LFSR are used.
It turns out that this approach of generating the RF of HTM mini-columns is more
expensive in terms of resource utilization and latency in comparison to the rigid
memristive crossbar discussed previously in 5.2.1. It is, however, feasible and more
realistic when it comes to scalability because there is no restriction related to the
crossbar size, or the number of interconnects being used. Furthermore, it satisfies
an essential requirement for the HTM which involves providing a reconfigurable
CHAPTER 5. DESIGN METHODOLOGY OF HTM 79
D D3 D2 D1 D0
E c





Figure 5.6: (a) LFSR used to generate a global RF. (b) LFSR with partially used
registers (red-base) to generate the local RF.
interconnect that enables implementing topologies of the HTM RF, both local
and global. It also facilitates the communication of the HTM network with the
environment and reduces the physical interconnects.
5.3 Distal Synapses Formation
The cells in the HTM network interact with each other during the temporal memory
phase. This interaction is essential to enable the network to predict the upcoming
events. As alluded to earlier, the cells’ interaction is enabled through the distal
segments which are established and evolved while learning temporal information.
In hardware, this translates into thousands of interconnects that are continuously
changing in their conductivity level and locations. Due to the fact that interconnects
in VLSI systems are rigid and do not support this level of reconfigurability, memory
units can be used to virtually formulate these connections and to describe their
strength as in [41, 85]. Although such an approach is effective as it endows the
network with the necessary dynamic to learn spatial and temporal information,
it does not suit edge devices which have stringent area and energy constraints.
In the following subsections, two communication schemes for virtual synaptic
description are discussed. The first is the well-known address event representation
CHAPTER 5. DESIGN METHODOLOGY OF HTM 80
and the second is the proposed alternative approach, namely synthetic synapses
representation.
5.3.1 Address Event Representation
Address event representation (AER) is a real-time communication scheme developed
by Mahowald [5] to overcome the limitations of CMOS fan-ins/outs and the
interconnect reconfigurability issue in neuromorphic chips. AER takes advantage
of sparse neuronal activity and high-bandwidth of very large-scale integration
(VLSI) to enable time-multiplexed communication. Hence, it reduces the number of
connections between sending and receiving neuronal arrays from n to log2 n [122].
The way AER works is shown in Figure 5.7 [5]. Given two chips, sender and receiver,
AER produces a unique digital address for every neuron whenever it spikes. This
address is transmitted over a data bus to the receiver chip, where the decoder
produces a spike in the corresponding location.
Figure 5.7: The address-event representation between two cores, sender and receiver.
When a neuron from the sender chip fires, its address gets encoded and sent over
the data bus indicating an event has occurred. Whenever the receiver chip gets the
address, its decoder generates a spike in the corresponding location [5].
It turns out that AER is considered an effective approach for point-to-point
connections, but not for complex networks with sparse connections [41]. The
complex network connectivity is solved through the enhanced AER proposed by
CHAPTER 5. DESIGN METHODOLOGY OF HTM 81
Goldberg et al. [41]. The enhanced AER uses look-up tables (LUTs) to describe
the connectivity network between two sets of neuronal arrays. The LUT contains
the sender address, destination address, and the probability of connectivity. Thus,
complex networks, even sparse ones, can easily be implemented. However, the
enhanced AER demands a large amount of memory and this makes it unsuitable
for power and area constrained devices.
5.3.2 Synthetic Synapses Representation
Synthetic synapses representation (SSR) is a communication scheme which leverages
the LFSRs to describe the sparse connections among neurons. Using the LSFRs
eliminates the need for memory-based address description as the addresses between
neurons are generated rather than stored. This can result in a considerable
reduction in the network area and power consumption. Figure 5.8 provides a high-
level description of the SSR when it is used to form distal synaptic pathways. Given
4⇥4 HTM region with one cell in each mini-column, an active cell can have distal
connections formed randomly with other cells in the region. However, since the
hits occur randomly, there is always a likelihood of forming undesired connections.
Thus, special protocols and encoding/decoding processes are necessary to mitigate
the possibility of forming undesired connections. In the next chapter, extensive
details about the SSR operation and its advantage over AER will be provided.

















Figure 5.8: High-level description of the formation of synaptic pathways using
synthetic synapses representation, where the solid lines represent successfully
formed pathways, while the dotted line indicates an attempt to form connection
with inactive cells which did not go successfully.
5.4 Homeostasis and Neurogenesis Plasticity Mech-
anisms
Homeostasis is an essential mechanism in biologically inspired networks. It prevents
neurons from being hyperactive through regulating their threshold of generating
a somatic action potential [36], named minOverlap in HTM theory. The concept
of homeostasis in HTM does not involve regulating the minOverlap directly.
Rather, it implies exciting the action potential of relatively low-active neurons
through multiplying the action potential by a positive scalar value called a boosting
factor (see Figure 5.9). This results in an effect similar to that of regulating the
minOverlap value. The boosting in the HTM is used to ensure equal likelihood for
mini-columns to represent the spatial inputs in SDR forms. It is applied through
stimulating the mini-columns that have not been active over a predefined time
period, i.e not frequently active with respect to their neighboring mini-columns.
Consequently, low-active mini-columns can have better chance of representing the
feed-forward input in the future.
CHAPTER 5. DESIGN METHODOLOGY OF HTM 83
Figure 5.9: The accumulated activity of a select set of mini-columns along with
the corresponding boosting level recorded after various iterations.
It turns out that using the boosting mechanism is impactful when there is a
uniform statistical distribution of information in the input space. Unfortunately,
this requirement is not guaranteed, especially for visual applications, unless a custom
encoder is used to process all the inputs. An example of a non-uniform distribution
of information in the input space would be the usage of MNIST images. Such non-
uniformity in the input space make several mini-columns rarely active or completely
inactive. Even the use of boosting here would not cut down the number of inactive
mini-columns. Therefore, as a possible solution to overcome this issue, this work
suggests applying the neurogenesis mechanism to the HTM. Neurogenesis is a
structural plasticity mechanism that suggests ‘dead’ neurons be replaced with ‘new’
neurons to enhance network computational capabilities [123]. Just as in homeostasis,
neurogenesis can be applied via tracking the recent mini-column activities over a
predefined period of time and comparing it to its neighbor. The mini-columns that
were not active frequently are considered ‘dead’ neurons and should be replaced with
CHAPTER 5. DESIGN METHODOLOGY OF HTM 84
new ones. For a given ‘dead’ neuron, this is achieved by replacing its connections
with new randomly initialized connections that are connected to different locations
in the input space. Hence, the mini-columns proximal connections will start shifting
toward the most active regions in the input space while maintaining a low number
of connections to rarely active regions. Figure 5.10 demonstrates the influence of
using neurogenesis on the synaptic connections density in the input space. It can
be seen that when the activity in the input space is mediated in the mid-region
and neurogenesis is disabled, the connections on the sides are not involved in any
computations leading to form non-robust sparse representations. In contrast, when
neurogenesis is enabled, the synaptic connections start to move toward the most
active spots in the input space and form better representations.
Figure 5.10: The density of the potential synapses as linked to an input space
with activity centered to the middle region when the neurogenesis mechanism (a)
disabled (b) enabled.
Implementing the neurogenesis mechanism in hardware presents several challenges
due to the lack of reconfigurability in interconnects which model the synaptic
connections. Thus, we developed the concept of synthetic synapses to enable
the reconfigurability in the interconnects and neurogenesis in HTM neuromorphic
CHAPTER 5. DESIGN METHODOLOGY OF HTM 85
systems. By using this, when a given jth mini-column is ‘dead’ and replaced by
jth
new
‘new’ mini-column, all the connections of jth will be removed and replaced by
new connections assigned to different locations in the input space and the strength
of the new connections are again, randomly initialized.
5.5 Summary
In this chapter, several design challenges associated with realizing the HTM network
in hardware using CMOS/Memristor hybrid technology are highlighted including
various possible solutions. A memristor device that exhibits the HTM synapses
essential features is chosen. A new window function, named Z-window function, to
enable fitting the adopted Verilog-A model to the physical device is developed. A
dynamic memristive crossbar and synthetic synapses representation are presented
to enable virtual formation of proximal and distal synaptic connections. Further-




Pyragrid: HTM Neuromorphic SoC
Having described the challenges associated with mapping the HTM algorithm to
hybrid CMOS/Memristor design and incorporating various plasticity mechanisms
in Chapter 5, this chapter will focus on designing and implementing an HTM
system that has the capability to learn and predict from spatial and streaming
inputs. Such system will hold significant promise in pervasive edge computing and
its applications.
The general architecture of the proposed system is composed of mixed-signal compu-
tational units and an underlying digital communication scheme. The computational
units have co-localized processing and memory units to provide high computational
power and real-time processing. The learning here is performed in-situ to enable
fast response time and continuous adaptation, while data processing is sparse in
nature which results in a significant reduction in the overall system power consump-
tion. The computational units collaboratively interact and communicate to process
spatial and temporal information. The communication here is facilitated through
the proposed synthetic synapses representation (SSR), which allows virtual forma-
tion and pruning of physical synaptic connections and reduces the fan-out/fan-in
overhead. Using SSR also reduces the memory usage and the physical interconnects,
which are considered a major source of power consumption and a bottleneck that
CHAPTER 6. PYRAGRID: HTM NEUROMORPHIC SOC 87
hampers network scaling.
6.1 HTM System Design and Implementation
Figure 6.1 demonstrates the high-level architecture of the developed HTM system,
which is mainly composed of an encoder, an SDR classifier, and HTM regions.
The encoder and the SDR classifier are used to convert input sensory information
to SDR representations and transfer the HTM output neuronal activities into
a conventional data format, respectively. The HTM regions are responsible for




nc2 mini-columns with nm cells each, a main control unit (MCU), an arbiter
and selector. The MCU is dedicated to controlling data flow and to generating
the necessary control signals, while the arbiter and selector are responsible for
regulating data sharing among cells within the region. Here, the interaction among
cells is based on the SSR as the cells’ activity is sparse in nature, approximately
4.2%.
At a high level, the system works as follows: when the MCU establishes a connection
with the data encoder, which is done through the hand-shake protocol, it commences
receiving the encoded packets. The received packets are routed through the H-
Tree to all the region’s mini-columns. Here, we used the H-Tree structure to
reduce the parasitic capacitance and to minimize the power consumption [124]
of the developed system. However, there are two H-Trees: one is a digital bus
(34-bit width, 1 + log2 nmc lines are used by the cells, where nmc = nc ⇥ nm)
1Unlike the mathematical description of the HTM, which assumed 2D representation of the region
for simplicity, in the hardware design, we consider a 3D architecture of the region to cut down
the resources and to simplify the communication scheme considerably.
2The number of mini-columns assumed in this work is always of 2k format, where k is an integer
number.

































































Figure 6.1: High-level architecture of the HTM system composed of an encoder,
SDR classifier, and an HTM region. The encoder and SDR classifier are utilized
to convert sensory information to SDR representation and to transform HTM
neuronal activities into conventional data format, respectively. The HTM region
comprises pnc ⇥
p
nc mini-columns with nm cells each to process spatial and
temporal information. The data flow and communication within the region is
facilitated through the control units, and the communication arbiter and selector.
driven by the MCU and the cells to share data. The other one (not shown in
Figure 6.1) is an analog line to enable mini-columns to compete against each other
for input representation. When the winning mini-columns and then the cells are
selected, the arbiter and selector are used to broadcast the information about
the current/previous active cells and their locations in the region so that lateral
connections are formed and future predictions are made. Once the output of the
HTM region is generated, it is relayed to the SDR classifier to generate the final
network output. In the following section, more details about each core unit of the
HTM region are presented, while the communication scheme and the SDR classifier
CHAPTER 6. PYRAGRID: HTM NEUROMORPHIC SOC 89
are discussed in separate sections.
6.2 HTM Region
6.2.1 HTM Mini-Column
A mini-column in the HTM, which encapsulates cells sharing proximal segments,
enables the network to capture spatial features. Figure 6.2 depicts the architecture
of the HTM mini-column. The mini-column is modeled by three units named:
peripheral unit, proximal unit, and WTA cell. The peripheral unit models the
part of the mini-column in which the proximal connections are generated and
connected to the input space. The proximal unit and WTA cell hold the proximal
connection permanences and a contesting unit that enables each mini-column to
compete with its neighbors for the input representation, respectively. The input
to the mini-column is generated by the HTM encoder. The encoder converts each
input into a high-dimensional binary vector sliced into small patches to minimize
data movement and required storage units. Sequentially, each patch is presented to
the mini-column and stored into an Addr_Reg. Meanwhile, the LFSR generates a
random number indexing the observed pattern activities in the feed-forward inputs.
Given an input SDR of size nx3⇥1, it is sequentially fetched to the network in the
form of patches, where each patch is a 31-bit row vector. When the input patch
is stored in the Add_Reg and the LFSR generates an address for a location in
the received patch, a matching score is stored in the synapses’ registers which are
modeled by nsp ⇥ 1 serial-in-parallel-out shift register (called SIPO_Reg). Once
all inputs are received, the output of the SIPO_Reg is presented to the word-line
3nx, in this work, ranges between 527-961.
































Figure 6.2: The circuit diagram of a mini-column in an HTM region. It consists of
a peripheral unit in which the proximal connections are generated and connected
to input space, a proximal unit to store the connections’ strength, and a WTA cell
to enable the mini-columns to compete for input representation.
of nsp⇥1 memristive crossbar where the proximal synapses’ permanences are stored.
The input voltages to the crossbar will be converted in form of current through
the memristor and the output is sensed at the crossbar bit-line. The output of
the crossbar which modulates the mini-column overlap score to current is then
boosted. Boosting is done via the usage of a sense memristor (Ms ⌘ 1gs ). The
boosting factor is inversely proportional to Ms conductance, as such decreasing gs
value leads to increasing the boosting factor and vice versa. The output at this










where gij (⌘ permanence ⇢pij ) indicates the conductance of ith memristor within the
jth mini-column, and Vi is the ith input voltage. V↵j denotes the jth mini-column
overlap score. Upon the completion of computing the overlap score, its value, which
is sampled by the sense memristor, is then presented to a WTA circuit. The WTA
CHAPTER 6. PYRAGRID: HTM NEUROMORPHIC SOC 91
performs a kmax4 operation on V↵j, 8j followed by a thresholding, to generate the





1, Vqj > Vth, where Vqj = f(V↵j)
0, Otherwise
(6.2)
The minimum input to the WTA circuit should be no less than 0.2v so that its cell
is activated. This requirement implicitly realizes the concept of minOverlap in
HTM, which implies that mini-columns’ overlap scores should be large enough to
enable competing against other mini-columns to represent the input. The output of
the WTA cell indicates the mini-columns’ status, where logic ’1’ refers to a winner.
Selecting the winners is followed by the learning process in which each winning
mini-column’s proximal synapses are adjusted in response to the stimulated feed-
forward input and according to Hebbian learning rules. Then, the mini-columns’
status is relayed to their associated cells to start the next phase, temporal memory.
Although the cells are encapsulated within the mini-columns and are considered a
part of them, for the sake of clarity and simplicity we dealt with them separately.
Figure 6.3 demonstrates the process of computing the overlap score and tuning the
proximal synaptic connections for a given mini-column while receiving feed-forward
input. Since the mini-column has a large number of proximal connections, for the
purpose of demonstration, we randomly picked only two. The changes in proximal
connections’ permanences for both HTM-SW and HTM-HW models are shown in
Figure, respectively. Here, it can be observed that any changes in the synapses’
permanences below the permanence threshold, Pth, in the HTM-SW model has
no impact on the overlap score, unlike the HW model where there is no explicit
4The function f(⌘ kmax) is computed using the winner-take-all circuit explained in section
(6.2.1.1).
CHAPTER 6. PYRAGRID: HTM NEUROMORPHIC SOC 92
Figure 6.3: The impact of the synaptic permanence (denoted as Permanence#.)
modulation on the mini-column overlap score as the proximal synapses (denoted as
synapse#) receive feed-forward input.
threshold blocking the memristors from contributing to the overlap score value.
Furthermore, the change in the HTM-HW model synaptic permanence (memristors’
conductances) tends to be non-linear as compared to the HTM-SW counterpart.
However, selecting a memristor device with high conductance range and switching
dynamics as required by the HTM theory made the synapses with high conductance
states dominate the changes in the overlap level. This eventually results in almost
CHAPTER 6. PYRAGRID: HTM NEUROMORPHIC SOC 93
analogous overlap score5 variation for both the SW and HW models.
6.2.1.1 Winner-take-all Circuit
The WTA cells are utilized as a part of the mini-columns circuit to select the
winners in each local (or global) cluster. Also, they are used in the SDR classifier to
find the probability distribution of the network output and to identify the winning
class labels. Figure 6.5 depicts the WTA circuit (a variant of the circuit proposed
in [6] and shown in Figure 6.4) which models a simple local competitive algorithm
which is naturally imposed through Kirchhoff’s current law (KCL). Each branch
in the circuit has an NMOS transistor (T1) to capture the input signal (= V↵
in the mini-column circuit and Vs in the SDR classifier) of one competitor. The
competitors interact with each other through the shared point Vc. When inputs
are presented to the circuit, the potential of Vc follows the input with the highest
voltage and tries to turn off all the other transistors. The cell conveying most
of the bias current, Ic, is identified as a winner. Given that all the transistors
operate in a subthreshold regime, applying an input voltage VGj at the gate of the
transistor in the jth branch results in a current Ij, which can be approximated by
Equation (6.3) [125]:






where Io is the zero-bias current for the given device, WL is the transistor channel
width to length ratio, and UT and n indicate the thermal voltage and the subthresh-
old slope coefficient, respectively. For the given circuit with nk branches, according
to KCL, the branches’ current should sum up to Ic, as given by Equation (6.4). By
5The overlap scores for the HTM-HW and HTM-SW models are not reported up to scale for the
purpose of comparison.







Figure 6.4: Current conveyor circuit proposed by [6] to implement a WTA function.
using Equation (6.3) and Equation (6.4), we can solve for the current flowing in































Figure 6.5: (a) WTA circuit (nk number of cells) with local excitatory feedback.
Recall that the output of the HTM is a voltage and is represented in a binary
sparse form. Thus, we designed the WTA to be a voltage mode circuit. In order
CHAPTER 6. PYRAGRID: HTM NEUROMORPHIC SOC 95
to maintain the same normalized exponential relationship between the input and
output (described in Equation (6.5)), the current in each branch is sent to a current
comparator via a current mirror formed by T3 and T4, as shown in Figure 6.5. The
mirrored current is compared to a fixed reference current resulting in a voltage









where  0 is the channel-length modulation,  0 is the transconductance parameter.
Ai and Vth denote the current mirror gain between T3 and T4 and the transistor
threshold voltage, respectively. By substituting Equation (6.5) in Equation (6.6),
Vqj and the output node are calculated as in Equation (6.7) and Equation (6.8):
















1, Vqj > Vth, where Vqj = f(VGj)
0, Otherwise
(6.8)
Due to the fact that  5 is approximately constant and is given by 2Ai 05 05(VGS5 Vth5)2 ,
Equation (6.7) indicates that the output voltage Vqj for branch j has a normalized
exponential relationship with the input VGj . Such relation has a unique benefit
for the WTA circuit because it maximizes the difference between the inputs. It
generously rewards the input with the highest value and punishes the losing ones.
Most of the power consumption is dominated by the winning cells which are low
in number compared to losing cells. Figure 6.6 demonstrates the WTA circuit
operation while performing softmax operation on a set of neuronal activities (after
CHAPTER 6. PYRAGRID: HTM NEUROMORPHIC SOC 96
carrying out a weighted sum operation). In Cadence, circuit simulation tool, the
data are imported by using the piece-wise source and presented to the WTA circuit
to identify the winning class at each point in time. The figure depicts the input
signals to four randomly picked cells and their corresponding digital outputs. Each
input signal is formed by a piece-wise operation connecting the discrete testing
points over 100 consecutive samples. The output of the softmax is compared to the
labels associated with each input sample and a perfect match is achieved. However,
the resulting digital output signal does not have a unified pulse period, i.e. some
may have longer pulse than others. This is due to the irregular nature of the
piece-wise signals.
It is important to notice here that unlike most other WTA circuits in literature [126,
127, 128], all outputs are buffered to provide enough driving capabilities when
transmitting signals across long distances. Also, few of the previous WTA circuits
are endowed with a hysteresis mechanism to increase network stability and prevent
the selection of a potential winner unless they are strong. Due to the fact that
the hysteresis is achieved via a local excitatory feedback, some of these circuits
require a reset process to any competition as in [129]. In the proposed WTA circuit,
the hysteresis characteristic is introduced via the positive feedback formed by the
transistor Tf . Additionally, having a current comparator improves the stability
further as it imposes a threshold current that needs to be crossed to switch cells’
status. The other advantage of using the current comparator is that it enables
more than one winner, which is a desirable feature especially in HTM as it allows
controlling the network output sparsity level.
CHAPTER 6. PYRAGRID: HTM NEUROMORPHIC SOC 97
Figure 6.6: Simulation results of four randomly picked cells [5-6-7-10] from the pro-
posed WTA circuit while identifying the winning classes. For the shown waveforms,
the expected output labels are: [x-7-7-x-7-x-10-6-x-10-...], where x indicates other
classes (not shown here).
6.2.1.2 Mini-column Training
The learning in HTM is performed in an online fashion and it involves modulating
the synaptic permanence of the winning mini-columns only. As aforementioned, the
proximal synaptic connections of each mini-column are emulated by a memristive
crossbar. Therefore, the training here can be performed simultaneously. By using
Ziksa, the writing scheme discussed in Chapter 3, training each mini-column’s
synaptic connections can be performed in two clock cycles. After computing the














DFF  b ffe















Figure 6.7: (a) The training circuit of the proximal synaptic connections in an
HTM mini-column, (b) A waveform diagram demonstrating the operation of the
training circuit during the testing period (shaded in light gray) and training period.
mini-column overlap scores, the synaptic connections that were connected to active
bits in the input space have their D-Flip-Flop (DFF) set to high and vice-versa
for the synapses connected to inactive bits. All the DFF outputs are buffered
with a modified NOT gate that generates a logical level output during the normal
operation and a training voltage during the learning phase. When the TrEn signal
is generated (active-low), the positive terminals of the memristors will be connected
to VTr if the output of the DFF is high and GND otherwise. The other terminal of
the memristor will be controlled by Tr1 and Tr2. During the first cycle of training,
Tr1 is set to ON by Tune+. If DFF output is low, this causes a voltage drop across
the memristors that need to be adjusted to exceed the threshold leading to an
increase in their resistance. During the second clock cycle, the same procedure will
be applied but in the opposite manner.
The downside of using the inverters of DFFs in conjunction with the Ziksa unit
is that the network will suffer from the sneak path issue, especially during the
learning phase. However, this issue can be overcome by buffering the output of
CHAPTER 6. PYRAGRID: HTM NEUROMORPHIC SOC 99
DFF with a tri-state buffer rather than an inverter gate. Using a tri-state buffer
allows the memristors that are not involved in the training process6 to be floating
such that they do not draw any current. Figure 6.8 covers the possible scenarios



















S ea  c e T a  c e F: F a  e a
NOT_B ffe  (a) NOT_B ffe  (b) T _B ffe  (a) T _B ffe  (b)
Figure 6.8: The possible scenarios for the current sneak paths when a DFF
is buffered with (a) A NOT gate, (b) A Tri-state buffer to drive the proximal
connection memristors.
6.2.2 HTM Cell
The cells in HTM enable the network to capture the temporal patterns, modeling
the input representations within their context, and predicting upcoming events.
The HTM cell circuit developed in this work is composed of a synaptogenesis
unit7, a distal segments unit, and current comparators, shown in Figure 6.9. The
synaptogenesis unit is responsible for forming and pruning distal synaptic pathways
with the previous active cells. The distal segments unit possesses the permanence
values which describe the growth level of the individual distal synaptic pathways,
6The memristors that are not involved in the training are those that need to be decremented
during the first clock cycles or those that need to be incremented in the second clock cycles of
training.
7One may share the synaptogenesis unit between multiple cells of the same mini-column to
cutdown resources and reduce power consumption, but at the expense of increasing the latency.
CHAPTER 6. PYRAGRID: HTM NEUROMORPHIC SOC 100
while the current comparators are utilized to evaluate the segments activation level
and to determine their states (active or matching) accordingly.
Initially, the cells start with no distal synapses. Once the HTM algorithm be-
gins processing the incoming patterns, the distal synapses start forming in the
synaptogenesis unit. Given an HTM region arranged into a 3D space, where the
x and y axes index the mini-columns in the region and the z axis indexes the
cell, when the region receives an input, this causes activation of a population of
cells within the region, and in this context, it is referred to as At3D. If atxyz 2 At3D,
where at
xyz
is an active cell located at xyz, at
xyz
will form connections with the
active cells in At 13D . Let’s assume that the number of active cells in A
t 1
3D is 4.2%
of nc. Then, if nc=961, ⇡ 40 cells will be active in each time step, assuming no
bursting takes place. The active cell at time t establishes connections with the
40 cells (ideal scenario) that were active at t  1 by forming a distal segment. A
cell in HTM can have around 10 or more distal segments, and this enables the
network to recognize temporal patterns and to learn the transitions between them.
Recall that forming and pruning distal connections in hardware platforms requires
high interconnect dynamics which are lacking in most of the existing platforms,
especially ASIC designs, hence the virtual description of the synapse became a
common approach [5, 85]. However, describing the synapses virtually, in most cases,
demands a high memory usage to store the sender/receiver addresses. For instance,
in HTM’s context (assuming there are 961 mini-columns in the region with 4 cells
each), if we assume that the address of each cell is represented with 12 bits and the
distal connection permanence is represented with 16-bits, having 10 segments with
60 distal connections in each cell costs 16.8kb of memory per cell and 64.57Mb
for the entire network. Lets assume that the addresses and the permanences are
stored in a DRAM implemented in 45nm process. If the energy cost per 32 bit of
off-memory access takes 640pJ [13], having 40 active cells at each time step leads
CHAPTER 6. PYRAGRID: HTM NEUROMORPHIC SOC 101
to a total energy consumption of 15.36µJ (first-order approximation). Running the
system at 8MHz can result in a power consumption of 122.88W just to access the
memory, which is a prohibitive amount of power especially for edge devices with
limited power budget.
One possible solution to overcome the above challenge is to reduce the memory usage
in each cell. This can be done through modeling the synaptic permanence using
analog memristors and leveraging the randomness in forming the distal synaptic
connections to generate the addresses rather than storing them. A possible approach
to do so is generating the distal segment addresses through the use of LFSRs. To
demonstrate this, let’s assume that the cell c242 is currently active and trying to
establish a connection with another cell, c333, which was active in the previous time
step. The cell c242 will receive a packet that holds c333 location in 3D space, in this
example 333. Upon receiving the address, the cell, c242, commences the matching
process in which the cell identifies whether there is a possibility to establish a
distal connection with the cell c333. The matching process starts by enabling the
X-LFSR to generate 16 addresses within one clock cycle8. The same is applied for
the Y-LFSR. While the LFSRs generate their random values, the cell translates
any matches between the generated random numbers and the received Cartesian
locations into flags stored into 4-bit registers, which are later decoded by X-DMUX
and Y-DMUX. Here, a match means there is a distal connection established between
the two cells. It is important to mention here that following such an approach
makes the process of forming distal connection probabilistic, while in the HTM
network it is deterministic. However, in HTM, the cells that are currently active
form connections with a subset (typically 50%) of the cells that were active in the
previous time step, and in our design this is achieved naturally through our adopted
the probabilistic approach. Now, in order to estimate the likelihood of matching
8The cells’ LFSRs are clocked with 128MHz, while the system clock is 8MHz.





















































































































Figure 6.9: The circuit diagram depicting the HTM cell with synaptogenesis unit,
which can generate or prune distal segments; a distal dendritic segment to hold the
permanence values of the distal connections; and current comparators to evaluate
the distal segment activation level and consequently the cell status (predictive or
unpredictive).
CHAPTER 6. PYRAGRID: HTM NEUROMORPHIC SOC 103
between distal segment addresses (randomly generated) and the addresses of the
active cells, Equation (6.9) can be used, where nsd is the maximum number of
synapses in a distal segment. Let the distal segment size for a given cell be 256.
Given 961 mini-columns with 40 active at each time step, there is a 0.847 likelihood
that at least 20% of the generated random addresses matches those of the previous
active cells. This likelihood can be significantly increased beyond 0.95 when the

















Figure 6.10: The matching probability between a distal segment address generated
by LFSRs and the address of the active cells in the previous time step for various
segment sizes.
After finishing the matching process and activating the X-DMUX and Y-DMUX,
all the possible combinations of 16 X-addresses and 16 Y-addresses are achieved
through the AND gate array. The output of logic ’1’ for an AND gate, let’s say
gate number 5, may indicate an active cell in the location (x=1 and y=5). The
output of the AND gate enables the corresponding ’green’ 2-bit register to load the
9Increasing the distal segment size costs more cycles to generate more random addresses and
additional memristor devices for each new added synapse.
CHAPTER 6. PYRAGRID: HTM NEUROMORPHIC SOC 104
Z-address, and this represents the cell’s distal synapse that is currently connected
to another active cell at time t  1, whereas the previously formed distal synapses
are stored in the ’blue’ 2-bit register. However, once the registers are loaded,
they are compared and the results are relayed to the distal segment memristors
(only when evaluating the cellular activities detected by distal segment). For the
distal segments unit, this cell architecture leverages the union propriety of the
SDR representation to considerably reduce the cell architecture complexity. The
main concept behind the union property is storing several patterns using one
representation. This can be translated into having one universal distal segment for
each cell rather than multiple of them. The universal segment grows as the cell
learns more temporal information. It is important to mention here that merging
the segments can increase the possibility of false triggering of cell segments and
incorrect predictions. However, this is less likely to happen if we limit the number
of patterns (M) a segment can learn, while setting the number of mini-columns and
cells to be large enough. For instance, in this work, we used 961 mini-columns with
4 cells each. If we stored 30 patterns in a segment and set the matching threshold
for any two given patterns to 5, according to [130], the probability of a false match
(Pfm) is 6.408⇥ 10 14 as calculated using Equation (6.10).




The output current that is collected at the distal segment bitline is received by the
current comparator unit. Then, the current gets mirrored to be compared with
two reference currents: active threshold and learning threshold. If the segment
current is more than the active threshold, the segment is set to be in an active
state and consequently the cell state changes to predictive for the next time step.
On the contrary, current less than the active threshold and more than the learning





















Figure 6.11: Competitive circuit that enables the cells within one mini-column
to interact with each other when a massive firing activity takes place in the
mini-column.
threshold marks the cell as a matching cell. A matching cell has a high probability
to be selected to represent the input when bursting takes place. It is important to
mention here that the prior discussed operations are carried out within the cells,
but running the temporal memory successfully also requires the cells within the
mini-columns to interact with each other to identify whether bursting is necessary.
If bursting takes place in a mini-column, all the cells within the mini-column are set
to be active and one cell is selected to learn the current input pattern. Typically,
this is done either by selecting the best matching cell or least used cell. The
former occurs only when a cell has a sufficient number of potential synapses that
are connected to active cells in the previous time step, i.e. a cell has a matching
segment. Choosing the best matching segment involves selecting the cell with the
highest matching level (distal current). This implies mirroring all cells’ output
CHAPTER 6. PYRAGRID: HTM NEUROMORPHIC SOC 106
Figure 6.12: Waveform diagram illustrating the competitive circuit for the cell to
select the best matching cell.
current to another unit, namely the competitive circuit (a modified current based
winner-take-all circuit originally proposed in [126]), so that the cell with the highest
output current is chosen (see Figure 6.11). In the case when there are no matching
segments, the least used cell is chosen as a winning cell. Selecting the least used
cells is done via selecting the cells with the smallest number of distal segments.
Since this implementation deals with one universal merged segment, a counter
in the cell is used to monitor the flags of added segments and consequently the
CHAPTER 6. PYRAGRID: HTM NEUROMORPHIC SOC 107
number of merged segments in each universal one.
Figure 6.12 demonstrates the operation of the cells’ competitive circuit. Here, three
cells are competing to select the best matching cells. Two scenarios are considered.
In the first (interval 0-10 µs), all cells have high overlapping current (all MFlags =
’0’) so that they are in competition. Since cell1 has the highest overlapping current,
it is selected as a winner. In the second scenario (interval 30-40 µs), cell1 has less
current than the ‘Active Threshold’ (Iactive); for this reason it is excluded from
the competition. This is accomplished via switching T11 to the ON state, and this
eventually blocks the current going to the WTA circuit.
6.3 Synthetic Synapses Representation
The HTM cells’ interaction is made possible by the continuously evolved lateral
connections. Forming lateral connections physically in VLSI systems tends to be
impossible with the existing technology due to their continuous change in location
and strength. Thus, approaches such as enhanced AER and SSR can be used
to virtually formulate these connections and to describe their growth level. The
enhanced AER approach is not attractive for edge devices with limited resource
as it heavily depends on memory units (see Chapter 5 for more details). Thus,
we are presenting the SSR communication scheme that heavily relies on random
generators and memristor devices rather than conventional memory. This results
in significant savings in terms of resources and energy consumption.
Two aspects associated with the SSR are addressed in this work: forming synaptic
connections using LFSRs (discussed earlier in section 6.2.2) and controlling the data
transfer among cells through regulating the access to the H-Tree bus. Considering
the same HTM system with At 1 active cells in the previous time step and At active
CHAPTER 6. PYRAGRID: HTM NEUROMORPHIC SOC 108
cells in the current time step, during the temporal memory phase, every cell in the
network with enough strong connection to At 1 cells can be depolarized for the next
time step and become predictive. The challenge here is how to transfer the At 1
cells’ addresses to all other cells in the network efficiently. Let all the mini-columns
with active cells at time t  1 place a request at the input of the outgoing tri-state
gates (see Figure 6.1). Then, each set of tri-states belonging to the same row
are activated simultaneously through the selector. When a row is selected, all
its tri-state buffers associated with the mini-columns are activated, allowing the
mini-columns to send requests to the arbiter and to receive acknowledgements. The
arbiter circuit is shown in Figure 6.13. It comprises of buffers, a series of nMOS pass
transistors, and a feedback circuit. The buffers are used to store the simultaneous
requests from the selected mini-columns. The series of pass transistors are used to
monitor the status of the individual mini-column requests, whereas the feedback
circuit is used to acknowledge the mini-columns after their requests are processed.
In Figure 6.14, a waveform diagram illustrates the operation of the arbiter, selector,
and other units in the developed system while processing information sent from a
row with 5 mini-columns. Initially, all the winning mini-columns’ (in this example:
2, 3, and 5) requests are directed toward the arbiter and stored in the buffers (latch
or DFF). When latch-3, for instance, receives Req3, it waits in a queue until Req2
is served. Once Req2 is served, the voltage drop at T2 drain will be high. This
will trigger the feedback circuit to send ack3 signal to mini-column 3, which in
turn clears its request and broadcasts the address of its active cell(s). Serving
the requests of all the active cells in the HTM network, given the adopted setup,









⇤[i][j] + 1) (6.11)
CHAPTER 6. PYRAGRID: HTM NEUROMORPHIC SOC 109
Re _C1Ac _C1
VDD
Re _CNAc _CN Se ec
Q'   D
Q    
Q'   D
Q    
Q'   D
Q    
Q'   D
Q    
T1 T2
Figure 6.13: SSR arbiter circuit consisting of buffers to store the simultaneous
requests from the winning mini-columns, a series of nMOS pass transistors to
monitor the status of the individual mini-column’s requests, and a feedback circuit
to clear mini-column requests once served.
Recall that the SSR conveys the same concepts of the AER and the enhanced AER,
but it is designed to serve intra-chip communication while offering the following
advantages:
• In AER, the neuron potential duration must be ⇡500 times more than the
event duration for transmission to time-multiplex the transmission channel [5].
There is no need for such a constraint in the SSR.
• The enhanced AER demands memory units on both sides, sender and receiver,
to hold neuron addresses that are virtually connected (connecting 32x32 cells
requires 11Mb RAM [41]). For a sparse network like the HTM, this is very
overwhelming in terms of memory usage. However, in the SSR, the addresses
are generated rather than stored. This serves two advantages: smaller storage
units are used and random selection is achieved.
• The SSR is synchronous and its capacity, the maximum rate of sample
transmission (considering the worst case scenario and the adopted network
architecture), is 4MSamples/sec. In the AER case, its capacity for SNN with
approximately the same network size is 2.5MSamples/sec [131].
• The SSR uses priority arbiter, which applies a queuing mechanism to access
CHAPTER 6. PYRAGRID: HTM NEUROMORPHIC SOC 110
2 3 5
Figure 6.14: Waveform diagram demonstrating a part of the SSR operation while
processing several concurrent requests sent from several mini-columns located
within the same row of the HTM region.
the H-Tree (or channel) bus, whereas AER utilizes an arbitration mechanism
to access the channel. The latter is known to lengthen the communication
cycle period and reduce channel capacity [131].
• The AER is deemed an effective approach for inter-chip communication,
where neuronal information is communicated by means of encoded events. At
the targeted destination, the encoded events are typically decoded and routed
to the proper accessible neurons. The encoder size here is highly dependent on
CHAPTER 6. PYRAGRID: HTM NEUROMORPHIC SOC 111
the number of neurons, whereas in the SSR, the decoding process complexity
is defined by the number of synapses associated with the targeted neurons.
This property is extremely beneficial for sparse networks like HTM.
• The enhanced AER offers better flexibility in updating the synaptic connec-
tions individually. The opposite is true for the SSR, in which changing the
seeds of LFSRs enables the cell to form a new set of synaptic connections.
6.4 SDR Classifier
The SDR classifier in the HTM is designed using a softmax unit (a memristive
crossbar integrated with a winner-takes-all circuit), time buffers, and sequential
digital comparators. The softmax unit generates the probability distribution for a
given joint neuronal activity produced by the HTM region and maps the inputs to
the corresponding class labels. The time buffers hold the Cartesian coordinates
of the neuronal activities in the HTM region, whereas the digital comparators are
used to identify novel inputs presented to the network. Figure 6.15 depicts the
schematic of the SDR classifier when used for visual data classification and time
series prediction. In time series applications, when the combined neuronal activities
and the digitized analog inputs are presented to the classifier, they get buffered.
However, due to the time difference in the presentation, (the neuronal activities
are presented after completing the spatial pooling and temporal memory phases),
the digitized input is processed first. The digitized input goes through a series of
comparisons to identify whether it is a novel input or not. The comparison here
is made in a sequential manner, bit-by-bit, by using XOR logic gates. Whenever
two inputs are not identical, the output of the XOR, for the non-identical bits, will
be high leading to loading the corresponding DFF with ’1’, otherwise, the DFF is


















Wi e - ake -a
Me i i e C ba















SDR c a i e CLR1 CLR2 CLR
P i i  a bi e
Figure 6.15: The SDR classifier schematic, which is mainly composed of a softmax
unit (memristive crossbar + winner-takes-all circuit) to classify the HTM neuronal
activities, time buffer to store data over time, and sequential digital comparators
to recognize novel inputs, which helps in maintaining input data records.
disabled. Once the comparison process is completed, the output is passed to the
priority arbiter. The priority arbiter units with active inputs pick one of the empty
registers (ŷi) to load the digitized novel input. Overwriting the previously stored
inputs in the ŷi registers is obviated through a feedback signal (CLRi); CLRi=’1’
indicates the register that needs to be overwritten is filled. The SDR classifier
CHAPTER 6. PYRAGRID: HTM NEUROMORPHIC SOC 113
will then stay in standby until the neuronal activities are received from the HTM
region. Once the HTM output is generated, it relays through the time buffer and
to the softmax unit, where the softmax function is applied and the probability
distribution of the outputs is generated. The output of the softmax unit with the
highest probability enables the corresponding ŷk register to release its output.
As alluded to earlier, in the work, the SDR classifier is used for visual data classifi-
cation and time series prediction. Most of the previously described components of
the SDR classifier are used in prediction applications. In the case of classification,
the use is confined to the memristive crossbar and winner-takes-all circuit.
6.5 Experimental Setup
6.5.1 HTM Network Setup
In order to set the optimal network parameters for the utilized verification bench-
marks, the particle swarm optimization (PSO) [132] algorithm is used. The
algorithm is integrated with the software model of the HTM and the search space
is defined within a range that meets the hardware constraints. The search space of
the optimal hyper-parameters is observed by using 50 particles randomly initialized
within the predefined range, and the algorithm runs over 100 iterations. The evalu-
ation for any given set of hyper-parameters is performed using the SDR classifier,
the highest accuracy of which represents the optimal point. For three separated
runs and different benchmarks, we run PSO to get the optimal hyper-parameters
that result in the highest recognition accuracy. The hyper-parameters that are
included in the search space are: the number of winning mini-columns which
impacts the network sparsity level, MinOverlap which influences the sparsity level
CHAPTER 6. PYRAGRID: HTM NEUROMORPHIC SOC 114
Table 6.1: HTM-SW parameters when benchmarking the network on visual tasks
(MNIST and YaleFaces) and time series prediction.
Parameter Range MNIST YaleFaces Time-Series
Number of winning mini-columns 5-40 40 16 40
MinOverlap 1-25 3 20 1
Proximal permanence threshold 0-0.8 0.52 0.5 0.5
Proximal permanence increment (P+) 0-0.2 0.01 0.1 0.05
Proximal permanence decrement (P ) -(0-0.2) -0.01 -0.15 -0.08
Proximal segment size 10-500 31 250 128
Distal permanence threshold 0-0.8 - - 0.5
Distal permanence increment (P+) 0-0.2 - - 0.1
Distal permanence decrement (P ) -(0-0.2) - - -0.1
Distal segment size 100-512 - - 120
and noise robustness, permanence parameters which determine the learning and
forgetting rate, and the proximal and distal segments size. These control each
mini-column overlap level with the input space, and the number of distinct patterns
a cell can learn. Table 6.1 lists the hyper-parameters’ search space and the optimal
values for the spatial-temporal tasks performed by the developed system. It is
important to mention here that most of the aforementioned optimal parameters
are used for HTM-SW and HTM-HW, with some exceptions related to synaptic
connections count and their growth level rate. These are set in a manner that
ensures a reasonable network performance while maintaining low latency and high
power efficiency.
6.5.2 Memristor Device Parameters
The non-linear Verilog-A memristor model, VTEAM [108], combined with the
developed Z-window function, is fit to the physical memristor device by Jiang
et al. [118], as discussed earlier in Chapter 5. However, in order to achieve a
CHAPTER 6. PYRAGRID: HTM NEUROMORPHIC SOC 115
reasonable performance of the developed HTM system while maintaining low power
consumption, the following assumptions were made:
• High conductance range is chosen to fulfill the design constraints and to
ensure proper operation.
• The memristor device exhibits semi-symmetrical behavior when switching
from low/high conductance to high/low.
• The number of distinct transitions of the memristor resistance is set to be
the same when the device is used in mini-columns and cells.
• The memristor device offers fast switching speed.
Table 6.2: The device parameters used in the mini-column and cell designs.
Parameter Value [mini-column] Value [cell]
Proximal memristor range 150k⌦ - 10M⌦ 150k⌦ - 10M⌦
Memristor threshold ±0.95v ±0.95v
No. of switching pulsesa 51 51
Training voltage 1.1 v 1.1 v
Sense memristor range 20k⌦-80k⌦ -
a The number of pulses required to transition from Gon/Goff to Goff/Gon.
6.6 Summary
In this chapter, the process of mapping the HTM algorithm to a hybrid CMOS-
Memristor mixed-signal design is presented. The proposed system is supported
by various plasticity mechanisms and has the capability to process spatial and
temporal information. The overall architecture design process and operation are
explained. Novel designs of the core mixed-signal computational blocks (mini-
columns and cells) are described in extensive detail. The interaction among the
CHAPTER 6. PYRAGRID: HTM NEUROMORPHIC SOC 116
computational blocks is facilitated through the proposed communication scheme,
synthetic synapses representation (SSR). SSR turns out to be not only an effective
communication scheme for sparse intra-chip communication, but it also offers
several advantages over the address event representation, such as high capacity,
low usage of memory, and resource efficiency. The network parameters that ensure
optimal performance are chosen with the help of particle swarm optimization (PSO)
algorithm, whereas the memristor physical device is emulated using the VTEAM
model and the developed Z-window function.
117
Chapter 7
HTM Results and Discussion
The HTM algorithm is known to have a unified computational platform suited for
a myriad of applications such as medical diagnosis [34], image classification [115],
stock market prediction, and anomaly detection [35]. To this end, the capability of
the proposed hybrid CMOS/Memristor mixed-signal HTM system (Pyragrid) in
performing various spatial and temporal tasks is explored. For the spatial tasks,
the HTM system is evaluated for visual data classification on two benchmarks:
MNIST (hand-written digits dataset) and Yalefaces (grayscale images of facial
expressions). For the temporal tasks, the HTM system is used to predict future
events based on the current input and previously acquired knowledge. Several
time series benchmarks with various sizes and sampling rates are used for this
task including Hot-Gym (daily power consumption in a gym), NYC-Taxi-Demand
(hourly demand for taxis in NYC), etc.
From the hardware perspective, our ultimate goal is to port our system on resource
constrained edge devices. Thus, in this chapter we also investigate the efficiency
and limitations of the proposed design in terms of latency, lifespan, power efficiency,
and robustness to noise and device failure. Two HTM models are used for metrics
evaluation. The first is a golden model (HTM-SW) that runs the HTM system
without any constraints. This model is used to find the optimal network performance
CHAPTER 7. HTM RESULTS AND DISCUSSION 118
for a given task. The second model (HTM-HW) is an emulation of the hardware
design under predefined circuit constraints. The HTM-HW model facilitates
creating large scale networks and accelerates the simulation time.
7.1 Spatial Tasks
As alluded to earlier, the HTM system is equipped with all the units required
to perform spatial and temporal tasks. However, for spatial tasks like image
recognition, only spatial pooling (SP) and classifying of SDR representations will
be carried out.
7.1.1 Image Recognition
The HTM system performance is evaluated on the image recognition task which is
conducted using several benchmarks including MNIST1 and Yalefaces2 datasets
(see Figure 7.1). For MNIST, all the images are resized from their original size,
28x28, to 31x31 pixels. Then, all images are binarized by thresholding prior to
introducing them to the HTM. The same processes are applied for Yalefaces, but
here the images are cropped using the Open-CV face detection python library prior
to the resizing. The binarization here is performed using adaptive thresholding to
preserve most image details during the conversion process.
In separate experiments, the data is introduced to the HTM as training and
1MNIST is the standard benchmark for hand-written images. It has grayscale images of 28x28
pixels associated with 10 classes for numbers from 0 to 9. The images are split into 60,000
training examples and 10,000 testing examples.
2Yalefaces dataset contains 165 grayscale images corresponding to 15 subjects, 11 images each.
The images are taken under different conditions and variations including illumination effects,
facial expression, etc. In this work, the set is randomly split into training and testing examples,
where the training examples contain 8 samples from each subject and 3 samples are used for
testing.
CHAPTER 7. HTM RESULTS AND DISCUSSION 119
Figure 7.1: Samples from the classification benchmarks used to evaluate HTM
system performance for visual information processing. (left) Hand-written digits
from MNIST dataset. (right) Yalefaces dataset.
testing sets. When the training set is introduced, the HTM mini-columns learn
the feed-forward input (images) in an unsupervised fashion. The output of the
HTM region, which is the joint activity of mini-columns, is then relayed to the
SDR classifier. The SDR classifier, which is trained in a supervised fashion using
stochastic gradient descent, maps the SDR representation of the input images to
the corresponding class labels. When the testing set is presented to the network,
the same procedure is repeated but the learning is disabled in both the HTM region
and SDR classifier. Table 7.1 demonstrates the HTM system’s ability in classifying
the presented images with testing accuracy of 90.33± 0.17%3 for MNIST. In the
case of Yalefaces, due to the limited available training samples, the same training
set is presented to the system several times and the resulting accuracy for testing,
averaged over 10 runs, is 86.86± 3.82% (see Table 7.2).
In an attempt to compare our results with previous implementations of the HTM,
for MNIST, it is found that although our network is smaller in size, it still offers a
comparable accuracy to other implementations. In the case of our previous work,
in which 100 mini-columns are used, the high accuracy is mainly attributed to
the use of high-performance classifier, support vector machine (SVM). When it
comes to other sparsity classifiers such as the locally competitive algorithm (LCA),
3Classifying unbinarized MNIST images using the HTM software model can increase the test
accuracy to 94.93% (for 2025 mini-column) and 93.41% (for 961 mini-column).
CHAPTER 7. HTM RESULTS AND DISCUSSION 120
Table 7.1: Summary of image recognition accuracy for MNIST dataset using HTM
and other algorithms. Here, SP stands for the spatial pooler, and nc denotes the
number of mini-columns used by the network.
Work nc Classifier Accuracy (%) ± STD
F-HTM [87] 784 SP+SVM (Linear kernel) 91.98
Memristive-LCA [133] 300 LCA 90.0
Digital-HTM [50] 100 SP+SVM (RBF kernel) 91.16
Crossbar HTM [91]a 1024 SP+X ⇡90.5
This work 484 SP+SDR classifier 90.33 ± 0.17b
a In [91], 95% classification accuracy is reported for using 4096 mini-columns, but it is
not mentioned if this is for a hardware implementation. Furthermore, the authors
did not mention the type of classifier used with the HTM; for this reason, we denote
it by X.
b The high-level simulation model that we developed to model the HTM network has
accounted for memristor cycle-to-cycle variability and device-to-device variability.
The cycle-to-cycle variability here is confined to device resistance range which is
emulated as a variation in the weight range, while the device-to-device variability
(write variation) is modeled by adding noise to the learning rule.
Table 7.2: Summary of image recognition accuracy for Yalefaces dataset using
HTM and other algorithms. Here, SP stands for the spatial pooler; nc denotes the
number of mini-columns used by the network.
Work nc Classifier Accuracy (%) ± STD
Memristor HTM [89]b - SP+XOR classifier 86.67
Smooth-MFA [134]a - S-MFA 81.1
This work 1024 SP+SDR classifier 86.86 ± 3.82c
a Software implementation.
b In [89], the SP parameters to achieve the aforementioned accuracy are not included.
Also, the reported average accuracy is when the network is tested separately only
on emotions, light conditions, and facial expressions portions of the dataset.
c The high-level simulation model that we developed to model the HTM network has
accounted for memristor cycle-to-cycle variability and device-to-device variability.
The cycle-to-cycle variability here is confined to device resistance range which is
emulated as a variation in the weight range, while the device-to-device variability
(write variation) is modeled by adding noise to the learning rule.
SP+SDR classifier still outperforms this classifier by ⇡ 0.33% margin. For Yalefaces,
SP+SDR classifier outperforms the smooth-marginal fisher analysis (S-MFA) and
offers higher average accuracy than the HTM implementation in [89] which did not
consider the entire dataset during the training and testing.
CHAPTER 7. HTM RESULTS AND DISCUSSION 121
7.1.2 Noise Robustness
In order to quantify the noise robustness of the HTM system for image recognition
applications, two experiments are performed. The first involves classifying MNIST
images in the presence of noise and the second one involves interrupting the training
process by injecting random SDRs. For the first experiment, the HTM system
is trained on clean training MNIST images and tested with noisy test images.
The noise here is added by flipping the image pixels randomly. The noise level
is defined by the percentage of the flipped pixels in an image. For a noise level
ranging between 0% to 10%, both the SDR classifier (⌘ softmax classifier) and
the SP+SDR classifier are tested separately. Figure 7.2 demonstrates the drop in
recognition accuracy as both classifiers are tested with corrupted MNIST images
with various noise levels. It can be observed that the SP+SDR classifier was able
to handle the noise with a graceful degradation in accuracy in comparison to the
SDR classifier whose accuracy dropped to 37.7%, when 10% noise is added to the
images.
Figure 7.2: Recognition accuracy of MNIST dataset classified with SDR classifier
and SP+SDR classifier in the presence of a noise level ranging between 0% and
10%.
CHAPTER 7. HTM RESULTS AND DISCUSSION 122
Figure 7.3: Noise robustness of the SP+SDR classifier when presenting MNIST
dataset as a stream of data mediated by noisy information.
In the second experiment, the HTM region is used to generate the SDR represen-
tations of MNIST training and testing sets. Then, the SDRs representations are
presented to the SDR classifier during the training and testing processes, where
one cycle of training and testing is considered as one epoch. MNIST SDRs are
presented to the SDR classifier for 100 epochs mediated by noise injection. The
noise here consists of a set of random SDR vectors with a sparsity level similar
to that in MNIST SDR vectors as generated by the SP. 10,000 noise vectors are
generated and injected in parts after the SDR classifier settles to a reasonable
accuracy level (⇡ 90%). Between epochs 50-55, 1000 random SDR vectors are
presented to the SDR classifier, between epochs 65-75, 5000 random SDR vectors
are used, and between epochs 80-95, 10,000 random vectors are used. Figure 7.3
illustrates the drop in recognition accuracy when the data streams are replaced by
a stream of noisy vectors and the fast recovery after the noisy vectors are removed.
The fast recovery here is attributed to training the SDR classifier on sparse inputs
which makes the likelihood of adjusting the critical connections (i.e. weights) less
likely. This, consequently, shows that the degradation in classifier performance,
even after removing the noise vectors, is almost negligible.
CHAPTER 7. HTM RESULTS AND DISCUSSION 123
7.2 Temporal Tasks
7.2.1 Time-Series Prediction
The prediction accuracy of the proposed HTM system is evaluated in terms of the
mean absolute percentage error (MAPE) using real-world streaming data. Given
an input dataset of length nn, where each data point presented to the HTM system
at time t is represented by yt, while the corresponding predicted value is given by









Figure 7.4-(a) shows a snapshot of the Hot-Gym dataset [135], the power consump-
tion in a gym, over a small period. The power consumption is recorded at every
hour for 4 months (total samples count = 4390). Here, the HTM system is used to
predict the power consumption for the next 2 and 5 hours. Initially, the software
model, HTM-SW, is used in the prediction. Then, the same prediction is made
using the HTM-HW model. Figure 7.5 shows the accumulated MAPE recorded at
every 250 samples. It can be seen that the initial value of the MAPE is really high,
but over time it decreases as the network learns patterns and uses the acquired
knowledge to make valid predictions in the future. However, the overall MAPE
of the software version, assuming the first 500 samples presented to the network
are dedicated to learning, is calculated to be 0.154 ± 0.0014 (0.171 ± 0.002 for
5-step prediction), while the hardware equivalent is 0.174 ± 0.002 (0.205 ± 0.0046
for 5-step prediction). This degradation may be attributed to the asymmetrical
characteristics of the memristor devices leading to disparity in the network learning
CHAPTER 7. HTM RESULTS AND DISCUSSION 124
Figure 7.4: A snapshot of (a) the power consumption of the Hot-Gym dataset
recorded every hour over approximately 4 days, (b) the taxi demand in New York
City estimated at every hour, (c) The daily minimum temperature in Melbourne,
Australia, (d) The number of successful observed sunspots for 230 years.
and forgetting rate. Applying the union property to the distal segment and forming
its distal synapses using LFSRs might have negative consequences as well, especially
when making higher order predictions.
Table 7.3 illustrates the HTM system evaluation using other time series benchmarks
including NYC-Taxi4 [117, 136], daily temperature5 [137], and monthly observed
sunspots6 [138], while predicting the next 2 and 5 steps in advance. Here, one
may observe that the overall network MAPE of the HTM-SW and HTM-HW
models shows marginal differences for most benchmarks. However, the MAPE is
relatively low for large datasets such as NYC-Taxi, reflecting network capability in
4The passenger demand for New York city taxis recorded at every hour (dataset size = 10,321).
The dataset is publicly available by the New York City transportation authority.
5Daily minimum temperatures in Melbourne, Australia (1981-1990). The dataset size has 3,605
samples, and it is smoothed with moving window average (window size = 5) to reduce the noise.
6Monthly count of the number of the observed sunspots for over 270 years (1749 - 1983). Total
number of samples is 2820, and it is smoothed with moving window average as well.
CHAPTER 7. HTM RESULTS AND DISCUSSION 125
Figure 7.5: MAPE for predicting the power consumption in a gym for the next
2 and 5 hours using HTM software (HTM-SW) and HTM hardware (HTM-HW)
models.
learning more temporal patterns and making valid predictions. For small datasets
with observable trends (e.g. daily temperature dataset), the MAPE is also low
as the network takes into account the seasonal effects while making predictions.
In the cases where the network has a high MAPE score, especially when making
advance predictions (5 steps or more) as in the sunspots benchmark, the network
partially failed in making accurate prediction due to the high level of noise and the
inconsistent fluctuations which make the processed data less stationary.
Table 7.3: HTM network evaluation while predicting the next 2-5 steps ahead
in time of various time series benchmarks. We indicate the averaged MAPE ±
standard deviation over 5 trials.
Benchmark MAPE 2-Step MAPE 5-Step
HTM-SW HTM-HW HTM-SW HTM-HW
Hot-Gym 0.154 ± 0.0014 0.170 ± 0.0019 0.171 ± 0.002 0.202 ± 0.0017
NYC-Taxi 0.092 ± 0.0005 0.0996 ± 0.0014 0.149 ± 0.0026 0.156 ± 0.0084
Daily-Temp 0.106 ± 0.001 0.120 ± 0.0005 0.165 ± 0.0012 0.177 ± 0.0013
Sunspot 0.151 ± 0.0018 0.157 ± 0.0023 0.268 ± 0.0005 0.268 ± 0.0033
It is important to mention here that the HTM system is seeing each input only
CHAPTER 7. HTM RESULTS AND DISCUSSION 126
once and bringing the error level down usually takes several online training cycles.
Since measuring the overall MAPE takes into account the earlier predictions (after
presenting 500 samples), its values are expected to be relatively high.
7.2.2 Noise Robustness
Two experiments are considered to evaluate the HTM system robustness to noise
while processing temporal information. The first considers the time-series data
presented to the HTM network encoder is already corrupted with noise. The goal
here is to observe the network capability in handling noisy input (external noise)
within acceptable range of signal-to-noise ratio (SNR). In the second, the noise
is presented after the data gets encoded and the aim here is to see the network
robustness to internal noise that may occur due to unexpected device failure (e.g.
dead cells) or inefficient encoding. In both experiments, the Hot-Gym dataset will
be considered.
Figure 7.6: A sample of the Hot-Gym dataset before and after superimposing
Gaussian noise with Bernoulli distribution with probability of 0.5. The SNR is
19.74 dB.
CHAPTER 7. HTM RESULTS AND DISCUSSION 127
In the first experiment, the Hot-Gym dataset is superimposed with varying de-
grees of uniform or Gaussian noise. The amount of noise added to input data is
defined by the voltage SNR and is injected based on the Bernoulli distribution
with a probability of 0.3 and 0.5, respectively. Figure 7.6 shows a sample of the
dataset before and after superimposing Gaussian noise (SNR=19.74 dB, P=0.5 for
Bernoulli distribution). Figure 7.7 depicts the variation in the HTM-HW MAPE
as a function of the SNR level measured in dB. The region not shaded in light
red represents the acceptable SNR range where the network performance should
not be severely degraded. Regardless of the noise type, it can be seen that the
HTM-HW MAPE is reduced by ⇡ 4.7%, reflecting the network capability in han-
dling external noise within the recommended range. When compared with the
HTM-SW, where no hardware constraints are considered, it is found the range of
the MAPE for the same degree of uniform/Gaussian noise (P=0.5) is [0.157±0.001
- 0.221±0.007]/[0.162±0.001 - 0.279±0.003]7.
In the second experiment, we use the same dataset, but the noise is injected
internally within the network. Specifically, the noise is injected at the encoder,
spatial pooler, and temporal memory outputs. Initially, the noise injection is
confined to individual components to see how the noise propagates downstream
and changes the network’s decision and eventually its performance. Then, the
noise is collectively added to every part of the network at the same time. Since
the underlying data structure in the HTM system is binary SDR, the noise here is
injected by flipping the bits, where 10% of noise means (0.1 ⇥ nw) of the active
and inactive bits in a given SDR representation are flipped. For instance, if an
7For reference, the same experiments are performed for long-short term memory (LSTM) with
no hardware constraints. A stacked LSTM structure with sliding window of 100 samples is
used. The epoch, which is basically the number of times the same sliding window is presented to
the network for training, is set to 10. The approximate range of the measured MAPE for the
same uniform/Gaussian noise level as in the HTM is [0.186±0.011 - 0.249±0.005]/[0.169±0.014 -
0.285±0.003]. Thus, HTM outperforms LSTM with marginal differences in the measured MAPE,
and HTM is superior in terms of latency.
CHAPTER 7. HTM RESULTS AND DISCUSSION 128
Figure 7.7: MAPE while predicting the power consumption in a gym for the next
two hours in the presence of various degrees of uniform noise (left) and Gaussian
noise (right).
SDR representation (length = 512 bits, number of active bits = 35) is generated by
an encoder, 10% of noise causes a relocation of 20% of the active bits. Figure 7.8
demonstrates the HTM network performance after injecting the noise. The following
can be observed:
• The noise injection at the early stages of the network, such as the encoder
degrades the network performance more severely. This is because the noise
added to the encoder output can negatively impact the spatial representation
of the input as generated by the spatial pooler, leading to unrobust prediction
and sequence learning by temporal memory.
• The limited sparsity in the early stages as compared to downstream compo-
nents (encoder output sparsity = 7%, temporal memory output sparsity =
1-4%) lead to more disruption in the SDR representations and more loss of
information.
• Unlike the spatial tasks, the dependencies among the time series patterns
get corrupted by the added noise and the network error is accumulated over
time.
CHAPTER 7. HTM RESULTS AND DISCUSSION 129
Figure 7.8: The impact of injecting internal noise on HTM system performance. In
separated experiments, the noise is injected into the encoder, spatial pooler, and
temporal memory outputs, respectively. Then, the noise is collectively added to
the entire network at the same time.
7.3 System Evaluation
The metrics utilized to evaluate the proposed HTM system include system latency,
system reliability and lifespan, system robustness to device failure, and power
consumption. The dataset used during the evaluation is the Hot-Gym dataset.
7.3.1 Latency
The latency is measured as the time required for the HTM system to process an
SDR input generated by the encoder. In this context, HTM processes SDR inputs
of the Hot-Gym dataset, where each input is encoded with 527-bit binary vector.
The spatial pooler and temporal memory phases here are performed simultaneously8
and in a pipelined fashion to minimize the latency, which is estimated to be 11.64
µs. Figure 7.9 shows the latency of the CMOS digital HTM (system clk = 100MHz)
8Spatial pooler and temporal memory operate simultaneously when the H-Tree bus is exploited
by either of them.
CHAPTER 7. HTM RESULTS AND DISCUSSION 130
and the proposed mixed-signal HTM (system clk = 8MHz) as a function of the
network size, given by the number of mini-columns9. One can notice that the
latency in the digital HTM is always higher than the mixed-signal counterpart.
This can be attributed to several reasons. The first is the need for the initialization
phase10 in the digital HTM design to set the synaptic connections’ permanences,
particularly the proximal synapses, prior to receiving any input. The initialization
of the synaptic connections’ permanence is achieved for free in the mixed-signal
design as the memristors after the formation process have random conductance
with Gaussian distribution [22]. Second, tuning the synaptic connections, proximal
or distal, is performed simultaneously at the cell and mini-column levels, but within
them it is sequential because the permanence values are stored in distributed
memory, where the read/write operations take several clock cycles. In the mixed-
signal design, on the contrary, the tuning process is performed concurrently even
within the mini-columns or cells and usually takes two clock cycles. Finally, in the
digital HTM, the winning mini-columns that represent the input are decided in a
sequential fashion to cut down the resource cost and power consumption. This in
turn translates to longer latency that is proportional to the number of mini-columns.
In the mixed-signal design, we utilized a WTA circuit (discussed in Chapter 6),
which processes all the inputs concurrently.
Table 7.4 shows the latency of the core building blocks in the HTM system for both
digital and mixed signal designs. It can be seen that although the mixed signal
design is running ⇡ 12⇥ slower than the digital counterpart, in most cases, we
have an improvement in the computational speed. The only scenario in which the
computational speed is reduced is when the SDR classifier is used for prediction.
9The number of cells per mini-column is set to 4. The distal segment size in each cell increases
with the network size. The rate of increase is 20 and the maximum distal segment size a cell can
have is 140. In the mixed-signal HTM, the distal segment size is fixed and set to 256.
10The initialization process imposes additional latency only when we run the network for the first
time.
CHAPTER 7. HTM RESULTS AND DISCUSSION 131
Figure 7.9: Latency of the digital and mixed-signal HTM designs as a function of
the network size, given by the number of mini-columns (the number of cells per
mini-column is 4).
This is because of the limited bandwidth for transferring the active cells’ indices
from the HTM region to the SDR classifier11. One may mitigate the bandwidth
bottleneck effect by transferring the active cells’ status rather than the indices of
the active ones directly to the SDR classifier (similar to what we did in the spatial
pooler+SDR classifier [31]), but this will be at the cost of increasing the number of
interconnects and eventually the power consumption.
7.3.2 Network Reliability and LifeSpan
The memristor device write endurance, which is the number of times a memory
cell can be overwritten successfully, turns out to be a crucial factor in determining
network sustainability for learning. The memristor devices, particularly oxide-
based devices, have a typical endurance range between 106   1012 [139]. This
low endurance reduces the network reliability for online learning and continuous
adaptation especially when the network is densely connected and all neurons
11In the mixed-signal design, the active cells’ indices are transferred at rate ⇡ 12⇥ slower than
the digital counterpart.
CHAPTER 7. HTM RESULTS AND DISCUSSION 132
Table 7.4: Latency of the core components in the HTM system (nc = 961 and
nm = 4) estimated for the digital and mixed signal designs while processing the
Hot-Gym dataset. The digital system is clocked at 100MHz, while the mixed signal
system is dual clocked; 8MHz is used as a system clock and 128MHz is used to
clock the cells LFSRs.
Module Digital Design Mixed-Signal Design Improvement
Spatial pooler (µs) 30.65 5.25 5.83⇥
Temporal memory (µs) 40.25 7.070 5.69⇥
HTM region (µs) 40.25 11.625 3.462⇥
SDR classifiera (µs) 1.82 0.125 14.52⇥
SDR classifierb (µs) [Prediction] 12.66 15.125 0.837⇥
Spatial pooler + 32.47 5.375 6.04⇥
SDR classifiera (µs)
HTM region + 52.91 26.75 1.977⇥
SDR classifier [Prediction] (µs)
a When the SDR classifier is used for classification, it receives the status of each mini-column
rather than the index of the active ones i.e. no encoding or decoding processes are necessary.
b The reduction in the latency of the SDR classifier (as a standalone unit) when used for
prediction is 7.12⇥.
need to be updated continuously. For the HTM network, this is not the case, as
cells/mini-columns activity is sparse in nature and the learning is confined only to
the active ones. This feature endows the network with longer elasticity (lifespan) in
comparison to other networks. In order to estimate the elasticity of mini-columns
in the HTM network, we need to estimate their successful training rounds and
likelihood of activation, as given by Equation (7.2), where Lr indicates the number
of successful learning rounds, and Ed is the memristor device endurance. One may
notice that here we consider the likelihood of activating mini-columns rather than
cells. The mini-columns activation is chosen for this analysis because the proximal





CHAPTER 7. HTM RESULTS AND DISCUSSION 133
In the ideal scenario, mini-columns in the HTM network are activated with equal
likelihood by patterns detected at the proximal segments. Thus, the number
of successful learning rounds that can be made, given Ed = 109, nc=961, and
nw=40, is 240 ⇥ 108. This is equivalent to ⇡ 8 years of successful continuous
learning performed at a rate of ⇡ 10msec. Comparatively, this is 24x more than a
conventional network with no sparse activities, and X12 times more than the SNN13.
In spite of the fact that SNNs are asynchronous and sparse in nature, usually a
neuron in the SNN fires and its synaptic connections are tuned multiple times while
processing a single input. This is because each input is stochastically encoded as a
stream of spikes.
The above comparison hypothesizes that the mini-columns’ activation is perfectly
regularized by incorporating the homeostasis plasticity mechanism (or boosting).
In real-world scenarios, this is not the case, because the mini-columns activation
is highly affected by feed-forward input statistics. Figure 7.10 is an example
demonstrating an estimation of the developed system elasticity (lifespan) for the
Hot-Gym dataset. Here, we see that after year 4, a gradual loss in mini-columns’
elasticity starts to occur. Even after 8 years of work, ⇡ 309 mini-columns are still
elastic and have the capability to acquire new information. However, the overall
network performance at that time would be limited. Figure 7.11 illustrates the
capability of extending the lifespan of the HTM network via increasing the number
of mini-columns while maintaining the same rate of performing training process
(⇡ 10ms). It can observed that doubling the number of mini-columns in the HTM
region can extend the network lifespan up to ⇡ 17 years.
12X is not specified here because it is highly affected by the input and the encoding approach.
13SNN is usually trained with spike-time-dependent-plasticity (STDP) rules. STDP requires
neurons to be tuned based on the time difference between the pre and post synaptic neuron
spike generation.
CHAPTER 7. HTM RESULTS AND DISCUSSION 134
Figure 7.10: Elasticity (lifespan) of the overall HTM mini-columns in the ideal and
real-world scenarios.
Figure 7.11: The impact of the HTM network scaling (number of mini-columns nc)
on the elasticity (lifespan).
7.3.3 Device Failure and Network Robustness
There are various types of memristor defects that may affect network performance,
and usually they occur due to process variation [69, 42]. Examples of device defects
are ageing faults, endurance degradation faults, switching delay faults, and stuck-at
faults [70, 71] (discussed earlier in Chapter 2). Here, we will emphasize the stuck-at
fault as it is ubiquitous and has high impact on network performance [71]. Two
CHAPTER 7. HTM RESULTS AND DISCUSSION 135
Figure 7.12: The MAPE (averaged over 5 runs) of the HTM-HW predicting two
steps ahead in time for the Hot-Gym dataset while experiencing various types of
stuck-at faults.
types of stuck-at faults are studied. The first investigates the impact of stuck-on
(high conductance state) on HTM system performance while making two-step ahead
prediction for the Hot-Gym dataset. The second focuses on the stuck-off (low
conductance state) effect. Figure 7.12 illustrates the averaged MAPE over 5 runs
for the HTM-HW prediction as a function of the faulty14 device percentage for the
aforementioned cases. It can be seen that the stuck-off fault has a positive marginal
impact on the network performance as it leads to an increase in the network sparsity
level. In contrast, the stuck-on increases the MAPE by 1.54% and it can go up to
4.54% when the fault percentage is 30%. This degradation in performance arises
from the fact that the SDR classifier is implemented using a softmax classifier with
weighted synapses realized using a memristive crossbar. Having 10% of stuck-on
faults in a crossbar means on average, every row and column in the crossbar has 55
and 344 defective devices, respectively. This eventually makes the softmax classifier
output nodes unable to distinguish various pattern activities and fire excessively.
During the fault analysis, it is also found that applying the fault solely to the mini-
14The fault is applied to the proximal connections and SDR classifier weights.
CHAPTER 7. HTM RESULTS AND DISCUSSION 136
columns’ proximal synapses results in a marginal change in the system performance.
This is because each input sample presented to the HTM is spatially represented
by a small population of active mini-columns, and having a slight change in the
representation pattern, which may result from the fault, has very low impact.
Furthermore, using the k-winner (inhibition) and boosting mechanisms mitigates
the changes that may occur in spatial patterns as generated by the HTM region.
7.3.4 Power Analysis
7.3.4.1 Power Consumption
The average total power consumption of the developed HTM system (excluding
the SDR classifier15) while predicting time-series data from the Hot-Gym dataset
is estimated to be 28.94mW and 29.38mW16 when the online learning is enabled.
The high power consumption during the training is due to the use of high voltage
(memristor training voltage ⇡1.1v) and extra clock cycles to modulate the memristor
devices. Figure 7.13 demonstrates the estimated total power consumption over time.
Initially, 17.18 mW is consumed while transferring the input SDRs through the
H-Tree17 to the mini-columns and establishing the proximal synaptic connections,
which take place in simultaneous fashion. The power then abruptly increases due to
the activation of the proximal segments to compute the mini-column overlap scores.
Once the winning mini-columns are selected, the spatial pooler learning phase
starts, in which the memristors associated with proximal synapses are modulated.
Meanwhile, the prior active cells’ addresses are routed to each cell in the winning
mini-columns to compute their overlap score (the overlap score used for segment
15The average power consumption of the SDR classifier is ⇡ 3.35mW.
16The approach used to estimate the power consumption is described in our previous work [31].
17The H-Tree structure might be buffered with full-swing and reduced swing buffers, proposed
in [124], to minimize the power consumption further.
CHAPTER 7. HTM RESULTS AND DISCUSSION 137
matching check). This gives rise to activating the distal segments and another
abrupt increase in the power consumption (at time ⇡ 10.5µs). However, due to the
fact that computing the cells’ overlap scores is confined only to the cells within the
winning mini-column while other cells are disabled through clock-gating, a smaller
increase in power consumption is achieved as compared to the one which occurred
while computing the overlap scores of the mini-columns.
Figure 7.13: The total power consumption of the developed HTM system as it
processes and predicts time-series data from Hot-Gym dataset.
After computing the cells’ overlap scores, the cells of the winning mini-columns
locally compete to represent the input contextually. The selected active cells
form the lateral connections with the neighboring cells and tune their distal
connections accordingly. One may observe from the previous discussion that
tuning and computing the overlap scores here turns out to be the most power-
hungry operations as there are more than 45.15k synapses involved in the network
computations. Figure 7.14 demonstrates the averaged power consumption of the
proximal and distal segments involved in training. At run time, the proximal
segments are initialized with random permanence (⇡50% of the synapses are
connected). After the network starts learning the feed-forward connections, the
number of connected proximal synapses is reduced to 47%. This clearly justifies the
CHAPTER 7. HTM RESULTS AND DISCUSSION 138
exponential reduction in the power consumption which later settles to ⇡ 2.05mW. In
contrast to proximal segments, the distal segments start with no formed connections;
thereby the average power consumption for distal segments starts with almost
zeros. However, as the network starts learning sequences, new distal connections
are formed and the power starts to increase.
Figure 7.14: The average power consumption of the proximal segments (left) and
distal segments (right) involved in training.
During overlap computing, the average power consumption of the proximal segments
follows the same pattern, but it is significantly higher (see Figure 7.15). This is
because all the proximal segments of the HTM system are involved in computing
the mini-columns’ overlap scores. Unlike the proximal segments, the distal segments
still consume a small amount of power as only the segments associated with active
mini-columns are involved in computations. It is important to mention here that
there are various approaches that can be adopted to cut down the power consumed
by the proximal or distal segments. One possible way is to modify network structure
or to split the power hungry computations (e.g. computing the overlap scores) into
multiple stages at the mini-column or cell level, but this will be at the expense of
increasing the overall network latency.
CHAPTER 7. HTM RESULTS AND DISCUSSION 139
Figure 7.15: The average power consumption of the all proximal segments (left)
and distal segments (right) while computing the overlap scores.
7.3.4.2 Energy-Delay Product and Power Distribution
The energy-delay product of the developed HTM system is estimated as a function
of the network size in terms of the number of mini-columns and cells. It is found
that the energy-delay product value increases sharply with the number of cells as
compared to the number of mini-columns due to the cell’s complexity. Figure 7.16
depicts the contour plot of the energy-delay-product which can be used to pick
the optimal network architecture for a given power consumption and latency
requirement.
Figure 7.17 shows the distribution of the power consumption among the different
entities of the proposed HTM system during the training and testing modes. It
implies that in the HTM-Test, most of the power consumption is devoted to the
HTM cells as they are more complicated and have a large number of synaptic
connections. During the training mode, HTM-Train, the cells and mini-columns
pull further power to modulate their synaptic connections. On the contrary, the
MCU and other units (arbiter and selector, excluding the H-Tree) consume a small
fraction of the total power as they are less complex and have limited memory usage.
CHAPTER 7. HTM RESULTS AND DISCUSSION 140
Figure 7.16: Contour of energy-delay-product for the developed HTM system as a
function of the network size.
Figure 7.17: The distribution of the power consumption among the building entities
of the developed HTM system during the training and testing modes.
7.3.4.3 Discussion
In an endeavour to compare our work with previous HTM implementations in
literature (see Table 7.5), we found that performing relative comparisons is a
challenging process due to the lack of similarity in network architectures, technology
nodes, operating frequency, etc. Thus, we attempt to bring all networks to the
same size in terms of the mini-columns and cell count. Also, we hypothesize that
the size of the networks can be scaled linearly and the same is applied to their
power consumption.
CHAPTER 7. HTM RESULTS AND DISCUSSION 141
Figure 7.18: Normalized power consumption of Pyragrid with respect to GPU [7],
CPU, state-of-the-arts digital custom designs (green bars), and memristor-based
analog and mixed-signal designs (red bar).
Starting with Krestinskaya et. al. [140]18, here we scaled only the number of
mini-columns and the single pixel processing elements (total = 961⇥ 1) as detailed
information about the distal segments and their sizes are not reported, and this
results in 17.45⇥ improvement (see Figure 7.1819). In the case of fully CMOS
digital design, 161.37⇥ is achieved when compared to our previous work in [50], and
31.75⇥ and 22.29⇥ when compared to the work done by Li Weifu et al. [88, 7]. In
contrast to other previous works, the power consumption reported in [88] does not
consider the register files, which are usually the most power-hungry components in
the design. In the case of [7], it is unclear if the register files’ power consumption is
included. It is important to mention here that, in most cases, the overall networks’
synaptic connections have not been included in the aforementioned scaling process
as there is no clear approach to estimate the power consumption for the individual
synaptic primitives. However, since our design uses more synaptic connections,
equating our design with previous works in terms of the synaptic connection count
may result in further improvement in power consumption.
18The authors in this paper also consider linear scaling for the network size and the power
consumption.
19Orders of magnitude in power efficiency is also observed when comparing Pyragrid with CPU
and GPU implementations.
CHAPTER 7. HTM RESULTS AND DISCUSSION 142
Table 7.5: A comparison of the proposed HTM system with the previous work. One
may note that these implementations are on different substrates, thereby this table
offers a high-level reference template for HTM hardware rather than an absolute
comparison.
Algorithm Memrsitive PIM HTM PE-1 HTM PE-2 HTM This
HTM [140] [50] [88] [7] work
Task Classification Classification Prediction Image Classification
&Prediction Recognition &Prediction
Operating - 100MHz 100MHz 100MHz Dual
Frequency - 8-128MHZ
Proximal 9 16 1 40 31
Segment Size
Distal Segments - 5x10 - 12x16 Shared
⇥ Size - - 256
Total Power 13.34mWb 417mW 516mWa 4.1W 29.38mW
consumed
Dataset AR, TIMIT MNIST MNIST KTH MNIST
, & ORL & Hot-Gym
Mini-columns 25⇥Xc 100⇥3 400⇥2 2048⇥32 961⇥4
⇥ cells
Latency - 0.0057 0.0045 6.04 0.0116
(ms)
Technology TSMC TSMC Nangate GF IBM
node 180nm 65nm 45nm 65nm 65nm
a In [88], the power consumption is reported for a single processing element (PE) without
considering the register files. Thus, we linearly scaled the power for an HTM network of size
400⇥ 2.
b In this reference, the temporal memory power is reported for single pixel processing. This
value is multiplied by the total number of mini-columns to estimate a total power of an HTM
region with 25 mini-columns with one cell each.
c X denotes unknown number of cells.
7.4 Summary
In this chapter, the proposed hybrid CMOS/Memristor mixed-signal architecture
of the HTM network (Pyragrid) is evaluated. The high-level behavioral model of
CHAPTER 7. HTM RESULTS AND DISCUSSION 143
the architecture is verified for image classification and time-series data prediction.
It is found that the performance (classification and prediction) of the hardware
model is less than that in the software counterpart with marginal differences.
This degradation is mainly attributed to the memristor devices’ non-idealities and
the use of synthetic synapses representation. The proposed architecture is also
evaluated for latency and lifespan. We found that the mixed-signal implementation
is ⇡ 2⇥ faster than the pure CMOS implementation and it is less affected by
network scaling, while the network elasticity (lifespan) can be more than 8 years,
assuming that learning occurs every 10ms. When it comes to network robustness,
it is observed that the HTM network is robust to device failure, but this is not the
case for its SDR classifier, which is impacted by stuck-on faults. Furthermore, it is
observed that the power consumption in the proposed architecture is dominated by
the cells, particularly the proximal and distal segments. Thus, in our design, we
strive to limit their use thereby reducing the average total power consumption of
the network to 29.38mW.
144
Chapter 8
Conclusions and Future Work
The challenges associated with enabling on-device training and processing spatial-
temporal information locally on edge devices without or with minimum cloud
support are addressed in this work. The first challenge, which involves enabling
on-device training, is addressed by developing a training system with a writing
scheme (Ziksa) optimized for area and power consumption. The proposed training
system leverages time-multiplexing, resource sharing, and simplified learning rules
to endow the resource-limited devices with the capability to be trained locally
without the need for cloud communication. The proposed training system is verified
with random projection networks (ELM) and biologically inspired algorithms (HTM
and SNNs). Analysis of the developed training system demonstrates order(s) of
magnitude reduction in power consumption as compared to conventional training
approaches.
Enabling on-device spatial and temporal information processing is addressed
through developing a hybrid CMOS/Memristor mixed-signal architecture of the
HTM algorithm (Pryagrid). The proposed architecture offers several advantages
over previous digital and analog designs as it incorporates several plasticity mech-
anisms such as synaptogenesis and neurogenesis that endow the network high
degree of plasticity with life-long learning and minimal energy dissipation. All
CHAPTER 8. CONCLUSIONS AND FUTURE WORK 145
the plasticity mechanisms are enabled locally on-device, and made possible by the
proposed training system and the synthetic synapses representation, a novel digital
communication scheme. Furthermore, the proposed architecture offers real-time
processing which turns out to be less affected by network scale, unlike pure CMOS
digital designs. Besides the real-time processing, orders of magnitude reduction
in power consumption are achieved via leveraging network sparsity, limiting the
activities of the power hungry units, and using low-power techniques such as clock
gating.
The utility of the proposed systems is demonstrated on a range of tasks including
image classification and time series data prediction. Despite the fact that HTM
exhibits a reasonable accuracy in classification tasks, in time series prediction, a
performance comparable to state-of-the-art networks is achieved. However, what
can make the HTM favorable is the ability to perform multiple tasks using the
same framework without radical changes in network structure, the capability to
carry out online learning in the presence of noise and device failure, and the ability
to handle branching sequences.
Future work may involve further exploration of the following:
On-device training: Future work regarding on-device training may involve explor-
ing the concept of stacking (3D architecture) and tiling of the memristive crossbar.
The tiling implies dividing the crossbar into small tiles to reduce the sneak paths of
current and IR effect during the training process. Although the tiling process may
incur an increase in resources, the increase in resources can be limited solely to the
writing scheme, Ziksa, which is extremely simple. Speeding up the training can
be another important topic to explore. The current design considers column-wise
training. Future design may observe the possibility of training more than one
column or the whole crossbar at the same time. While this may bring us back to
CHAPTER 8. CONCLUSIONS AND FUTURE WORK 146
the high power consumption issue during the training process, using approaches
such as structured gradient sparsification can limit the write operation to a small
number of memristors sparsely distributed across the crossbar. Leveraging the
sparsity and the reduction in gradient computing can result in significant power
savings and a speed up in the training procedure.
HTM: In this work, the implementation of the HTM algorithm is limited to
one processing region (no hierarchy) and one learning rule (classical Hebbain
rule). Future directions may involve observing the hierarchical structure of the
HTM regions and studying various approaches of incorporating hybrid learning
(supervised and unsupervised) mechanisms. Including the hierarchy and hybrid
learning attributes may speed up the learning process, facilitate handling more
complex tasks, and enable bi-directional data flow. The bi-directional data flow is
essential to enable hybrid learning where the supervised aspect of the learning is
driven by information regarding "success," "reward," "punishment," and "novelty".
Another direction may focus on the synaptic complexity, for instance using a more
complex synapse model rather than binary synapses. This may address challenges
associated with catastrophic forgetting (spatial information) in the HTM. I believe
exploring these directions may help in paving the way for an HTM algorithm that







ALGORITHM 1: HTM-Spatial Pooling
Input: ~xt 2 Rnx⇥1{0,1} , where ~x
t ⇢ Xt and Xt 2 Rnx⇥nn{0,1} ;
Output: ~⇤t 2 R1⇥nc{0,1} ; /* nc:Number of mini-columns */
1 // Initialization: /* nx:Input vector length */
2 Sind ⇠ rand.pseudo, where Sind 2 N
nsp⇥nc
{1,nx} ;
3 Sp[Sind] 1,where S 2 Rnx⇥nc ;
4 ⇢p[Sind] ⇠ rand.uniform[0,1], where ⇢ 2 Rnx⇥nc ;
5 ~bt 2 R1⇥nc , where 8 bt[j] = 1;
6 repeat
7 // Overlap and Inhibition:
8 ⇢̄p  I(⇢p   Pth) ;
9 ~↵t  ~bt  
⇥
~xt.transpose · (Sp   ⇢̄p)
⇤
;
10 ~e↵t  I(~↵t   ↵th) ;
11 ~⇤t  kmax( ~e↵t, ⌘, ⇠); /* kmax:k-WTA function */
12 // Learning:
13 if Learning == ’Enable’ then
14  ⇢p  ~⇤t   Sp   ( ~xt   P p );
15 ~bt  e  (āt <at>) ;
16 end
17 until t > nn;
APPENDIX A. HTM PSEUDO CODE 149
A.2 Temporal Memory
ALGORITHM 2: HTM-Temporal Memory
Input: ~⇤t 2 R1⇥nc{0,1} ; /* nc: Number of columns */
Output: At 2 Rnm⇥nc{0,1} ; /* nm: Number of cells */
1 zeros_cnt = 0;
2 repeat
3 # Phase-1: Mini-columns evaluation:
4 for j  1 to nc do
5 if ~⇤t[j] == 1 then
6 for i 1 to nm do
7 if  t 1[i, j] == 1 then
8 At[i, j] 1;
9 else
10 zeros_cnt zeros_cnt+ 1;
11 end
12 if zeros_cnt == nm then
13 At[i, j] 1, 8i;
14 zeros_cnt = 0;
15 end
16 # Phase-2: Prediction:
17 for j  1 to nc do
18 for i 1 to nm do
19 if ~⇤t[j] == 0 and  t 1[i, j] == 1 then
20 for d 1 to nd do
21 if D[i, j][d].MatchingSegment then
22  ⇢[i, j][d] (At 1   S[i, j][d])⇥ P 10 ;
23 end
24 else if ~⇤t[j] == 0 then
25 for d 1 to nd do
26 ⇢̄[i, j][d] I(⇢[i, j][d]   Pth) ;
27 S̄[i, j][d] At   S[i, j][d] ;
28 ↵t  ||S̄[i, j][d] · ⇢̄[i, j][d]||1 ;
29 if ↵t   Dth then
30 D[i, j][d].ActiveSegment 1;
31  t[i, j] 1 ;
32 else if ||At · Sd
ij
||1 > 0 then




APPENDIX A. HTM PSEUDO CODE 150
ALGORITHM 3: HTM-Temporal Memory (Cont.)
37 # Phase-3: Learning:
38 for j  1 to nc do
39 if ~⇤t[j] == 1 then
40 for i 1 to nm do
41 if At[i, j] == 1 then
42 for d 1 to nd do
43 if D[i, j][d].ActiveSegment == 1 then




48 until t > nn;
151
Appendix B














Figure B.1: Transistor-level schematic of the comparator. The comparator circuit
uses additional buffers and positive feedback. The buffers are used to speed up
signal propagation, whereas the positive feedback endows the comparator the
capability to handle external/internal noise. Additional transistors (T10 and T9)









V _ V _
T7
C
Figure B.2: Transistor-level schematic of 3-stage Op-Amp with output stage of
class B to provide low-output impedance.
APPENDIX B. CIRCUITS AND DEVICE SIZES 152
Table B.1: Transistor sizes of the comparator.
Transistor Label Total Width Finger Width Finger Length
T4 (Vref input) 8 µm 1 µm 0.06 µm
T3 (In input) 8 µm 1 µm 0.06 µm
T2 (Diff-mid) 7 µm 1 µm 0.06 µm
T1 (Current Mir.) 0.5 µm 0.5 µm 0.06 µm
T6 (Positive FB) 3.6 µm 1.2 µm 0.06 µm
T7 (Positive FB) 3.6 µm 1.2 µm 0.06 µm
T5 (Active Load) 2 µm 1 µm 0.06 µm
T8 (Active Load) 2 µm 1 µm 0.06 µm
T11 (Amp. Stage) 0.20 µm 0.20 µm 0.06 µm
T12 (Amp. Stage) 2 µm 1 µm 0.06 µm
T13 (Buffer 1) 0.20 µm 0.20 µm 0.06 µm
T14 (Buffer 1) 0.50 µm 0.50 µm 0.06 µm
T15 (Buffer 2) 10 µm 1 µm 0.06 µm
T16 (Buffer 2) 4 µm 1 µm 0.06 µm
Table B.2: Transistor sizes of the Op-Amp.
Transistor Label Total Width Finger Width Finger Length
T3 (+ve input) 8 µm 1 µm 0.5 µm
T4 (-ve input) 8 µm 1 µm 0.5 µm
T2 (Diff-mid) 2 µm 1 µm 0.5 µm
T1 (Current Mir.) 2 µm 1 µm 0.5 µm
T5 (Active Load) 4 µm 1 µm 0.5 µm
T6 (Active Load) 4 µm 1 µm 0.5 µm
T7 (Amp. Stage) 2 µm 1 µm 0.5 µm
T8 (Amp. Stage) 8 µm 1 µm 0.5 µm
T9 (Output State) 25 µm 1 µm 0.5 µm
T10 (Output Stage) 25 µm 1 µm 0.5 µm
Cx 5 µm - 5 µm
153
Appendix C
HTM on Edge Devices
The recent advances in the field of machine learning are tightly coupled with a
major increase in algorithm’s complexity and their storage requirements. With
the limited available resources of edge devices, it is advised to train/test machine
learning networks on the cloud. Although such a solution powers the mobile
applications with the necessary support, it demands continuous network access.
It also raises concerns related to data privacy, battery life, and response time.
Optimization for simplifying machine learning algorithms to run them on edge
devices has achieved limited success as they cost orders of magnitude in delay
and energy as compared to cloud-based inference [141]. This is attributed to the
fact that these simplified algorithms still run on edge devices’ hardware such as
CPUs and GPUs. The aforementioned concerns and challenges motivate us to
develop custom-designed neuromorphic systems to run machine learning algorithms
successfully and effectively on edge devices. In this chapter, we investigate the
possibility of porting the developed HTM system, Pyragrid, on edge devices,
specifically smartphones. In order to quantify the impact of porting HTM system
on the devices’ battery life, analysis on the battery life while porting HTM with
various network sizes is performed.
APPENDIX C. HTM ON EDGE DEVICES 154
C.1 Smartphones Power Budget and Distribution
Smartphones are equipped with lithium-ion batteries of limited size and power
budget due to their stringent constraints of device size and weight. The typical
battery capacity of smartphones devices ranges between 7 Wh - 19 Wh1 [142, 143].
Managing the distribution of this budget in an optimal manner is a real necessity
to extend the battery life of these devices. However, this demands an extensive
analysis of the power distribution among different components of smartphones.
Figure C.1: Power consumption breakdown of smartphones based on regular daily
use as estimated by [8].
Few research groups investigate the power distribution in smartphones [144, 145, 8].
For instance, in 2010, Caroll et al. profiled the power consumption in Openmoko
Neo Freerunner, HTC Dream (N1), and Google Nexus One (G1) smartphones
where the power is measured at the component level of the real hardware [8].
According to Figure C.1, we see most of the power consumption is devoted to the
wireless network (GSM) during regular operation, followed by CPU and graphics
which have almost the same power consumption. As the functionality of the
1Usually the battery budget of smartphones is reported in mAh. Here, we are reporting it in
watt-hour, assuming the typical battery voltage is 3.7 v.
APPENDIX C. HTM ON EDGE DEVICES 155
smartphones starts to go beyond the communication to be used in video games,
web browsing, and navigation, the power distribution equation is totally changed.
The demand for more computing power gives rise to incorporating GPUs and SoC
on mobile devices and directing most of the power budget toward them. Figure C.2
demonstrates the power consumption breakdown in a modern smartphone (Nexus-6)
as estimated by [146], while running video games, which are considered one of the
most power-hungry applications [144].
Figure C.2: Power consumption breakdown of smartphone (Nexus-6) while playing
video games.
The new trend in supporting smartphones’ applications with machine learning
algorithms may increase the power budget allocated for computation and commu-
nication even further. However, this support can either be done through a hosted
cloud or locally on the device itself. In some scenarios, the hosted cloud is an
attractive option if:
• The data upload/download processes are not performed through mobile
base stations (the average signaling duration, which is basically the time
from establishing to releasing the radio resource control connection, is high,
estimated to be 41/50 seconds to upload/download 2.5MB of data [142]).
APPENDIX C. HTM ON EDGE DEVICES 156
• There is not enough compute resource to run machine learning algorithms
locally on the device.
• The task performed by the machine learning algorithm hosted by the cloud
is not mission critical i.e. is not life saving missions where losing the network
connection can lead to severe loss.
• When data privacy and response time are not user’s high priority.
In other scenarios, running machine learning algorithms locally is favorable, only
when specialized neuromorphic chips are available. Having neuromorphic chips on
smartphones enables offloading the cognitive computations from CPUs and GPUs.
This can result in a significant saving in power budget, and consequently longer
smartphone battery life.
It is important to mention here that the proliferation of edge devices, especially
smartphones, will undoubtedly start to pose serious pressure on the communication
networks and data centers in the near future. Going along with this by increasing
the network base stations and by expanding data centers, which means orders of
magnitude of Giga-Watt-Hour [142], may not provide a long-term solution. This is
due to the continuous high rate of increase in data generation outside the cloud.
According to Cisco, the data generation outside the cloud is expected to reach up
to 850 ZB by 2021. Despite the fact that only 10% of this data may require storage
or cloud processing, 85 ZB of data is still 4.2x more than the global data center
traffic [147].
APPENDIX C. HTM ON EDGE DEVICES 157
C.2 HTM on Smartphones
Porting Pyragrid, HTM system, on smartphones can not only result in a significant
saving in power budget, but can also expand their applications in healthcare
industry, security, and many other fields. However, the saving in power that HTM
can bring may result from running the compute-intensive computations such as
visual information processing, time-series data prediction, and anomaly detection on
HTM system rather than running it on device hardware, particularly CPU and GPU.
To quantitatively analyze the saving in power consumption, we will consider Nexus-
5 and Nexus-6 smartphones for our study. Nexus-5 is equipped with CPU (2.26GHz
quad-core Krait 400), GPU (Adreno 330, 450MHz), and its battery capacity is
2300mhA. The average power consumption while running a convolutional neural
network (AlexNet [148]) for object recognition is 2962.58mW, which is estimated
to be 22.2% and 27.14% of CPU and GPU utilization, respectively [141]. When
running video games, the average power consumption is estimated to be 3470mW.
Nexus-6, the following generation of Nexus devices, has CPU (Qualcomm 2.7GHz
quad-core Krait 450) and GPU (Adreno 420), and its battery capacity is 3220mhA.
The average power consumption of the device during the normal mode while playing
video games is 2997mW. ⇡ 75% of this power is dedicated to CPU and GPU.
Now, in order to investigate how the developed HTM system can contribute to
extending the device battery life and reducing the compute load, we will make the
following assumptions:
• HTM system has the capability to offload part of the CPU and GPU compu-
tations.
• According to [150] and based on the estimated power consumption given in
Table C.1, we will also assume that the averaged power consumed by CPU
APPENDIX C. HTM ON EDGE DEVICES 158
Table C.1: Average power consumption by smartphones’ CPUs and GPUs when
used for compute intensive tasks such as running machine learning algorithms or
video games.
Device Task Device Resource Consumption
CPU [mW] GPU [mW] Battery [mW]
Nexus-5 [141]a CNN - - 2962
Nexus-5 Video games - - 3470
Galaxy S3 [149] Video games 230 767 2400
Nexus-6 [146]b Video games 1408.59 839.16 2997
a In [141], the authors report only the utilization percentage of the smartphone CPU
and GPU, which equals to 22.2% and 27.14%, respectively.
b The utilization percentage of the CPU and GPU is 47% and 28%, respectively.
and GPU when performing compute intensive tasks is 2247 Wh.
Figure C.3 demonstrates the estimated saving in Nexus-6 battery life if HTM system
is successfully ported on the targeted device and is able to carry out part of the
compute-intensive tasks (here, we assumed that HTM has the capability to reduce
the power consumption of CPU and GPU by 25% when used for compute-intensive
operations). It can be seen that using HTM can extend the battery life by 48-53
minutes when small network size (mini-columns=256) is used, and by 34-50 minutes
when using a larger network (2048 mini-columns). One may notice that the change
in battery life seems to be linear when the number of mini-columns goes beyond
750. This is due to the saturation in the distal segment size (maximum distal
segment size in HTM is 250) and the number of winning mini-columns, which is
picked to be 40 for network size ranging between 768-2048 mini-columns.
Owing to the fact that speculating the tasks can be offloaded from CPU and
GPU to HTM system, the amount of saving in power consumption can be really
challenging. Thus, in Figure C.4, we tried to cover a wide range of scenarios in
which the reduction in power consumption is assumed to range between 10% and
50%. Having such a reduction in power consumption of CPU and GPU can extend
APPENDIX C. HTM ON EDGE DEVICES 159
Figure C.3: The estimated battery life of Nexus-6 smartphone after porting HTM
system (various sizes), assuming HTM has the capability to reduce the power
consumption of CPU and GPU by 25% when used for compute-intensive operations.
the battery life of Nexus-6 smartphone between 16 to 136 minutes.
Figure C.4: The estimated battery life of Nexus-6 smartphone versus the reduction




[1] T. Prodromakis, B. P. Peh, C. Papavassiliou, and C. Toumazou, “A versatile
memristor model with nonlinear dopant kinetics,” IEEE transactions on
electron devices, vol. 58, no. 9, pp. 3099–3105, 2011.
[2] J. Zha, H. Huang, and Y. Liu, “A novel window function for memristor model
with application in programming analog circuits,” IEEE Transactions on
Circuits and Systems II: Express Briefs, vol. 63, no. 5, pp. 423–427, 2016.
[3] J. Hsu, “How IBM got brainlike efficiency from the TrueNorth chip,” IEEE
Spectrum, vol. 51, no. 10, pp. 17–19, 2014.
[4] S. H. Jo, T. Chang, I. Ebong, B. B. Bhadviya, P. Mazumder, and W. Lu,
“Nanoscale memristor device as synapse in neuromorphic systems,” Nano
letters, vol. 10, no. 4, pp. 1297–1301, 2010.
[5] M. Mahowald, “VLSI analogs of neuronal visual processing: a synthesis of
form and function,” Ph.D. dissertation, California Institute of Technology,
1992.
[6] A. G. Andreou, K. A. Boahen, P. O. Pouliquen, A. Pavasovic, R. E. Jenkins,
and K. Strohbehn, “Current-mode subthreshold MOS circuits for analog VLSI
neural systems,” IEEE Transactions on neural networks, vol. 2, no. 2, pp.
205–213, 1991.
[7] W. Li, “Design of hardware accelerators for hierarchical temporal memory
and convolutional neural network.” Ph.D. dissertation, North Carolina State
University, 2019.
[8] A. Carroll, G. Heiser et al., “An analysis of power consumption in a smart-
phone.” in USENIX annual technical conference, vol. 14. Boston, MA, 2010,
pp. 21–21.
[9] R. Rawassizadeh, T. J. Pierson, R. Peterson, and D. Kotz, “Nocloud: Explor-
ing network disconnection through on-device data analysis,” IEEE Pervasive
Computing, vol. 17, no. 1, pp. 64–74, 2018.
BIBLIOGRAPHY 161
[10] O. Krestinskaya, A. P. James, and L. O. Chua, “Neuromemristive circuits
for edge computing: A review,” IEEE transactions on neural networks and
learning systems, 2019.
[11] A. Disney, J. Reynolds, C. D. Schuman, A. Klibisz, A. Young, and J. S.
Plank, “DANNA: A neuromorphic software ecosystem,” Biologically Inspired
Cognitive Architectures, vol. 17, pp. 49–56, 2016.
[12] P. A. van der Made, M. Elkhatib, and N. Y. Oros, “Low power neuromor-
phic voice activation system and method,” Aug. 10 2017, US Patent App.
15/425,861.
[13] S. Han, “Efficient methods and hardware for deep learning,” Ph.D. dissertation,
Stanford University, 2017.
[14] A. Boulch, “Reducing parameter number in residual networks by sharing
weights,” Pattern Recognition Letters, vol. 103, pp. 53–59, 2018.
[15] Y. Li, Z. Wang, R. Midya, Q. Xia, and J. J. Yang, “Review of memristor
devices in neuromorphic computing: materials sciences and device challenges,”
Journal of Physics D: Applied Physics, vol. 51, no. 50, p. 503002, 2018.
[16] F. Alibart, E. Zamanidoost, and D. B. Strukov, “Pattern classification by
memristive crossbar circuits using ex situ and in situ training,” Nature
communications, vol. 4, 2013.
[17] A. L. Loke, D. Yang, T. T. Wee, J. L. Holland, P. Isakanian, K. Rim, S. Yang,
J. S. Schneider, G. Nallapati, S. Dundigal et al., “Analog/mixed-signal design
challenges in 7-nm CMOS and beyond,” in 2018 IEEE Custom Integrated
Circuits Conference (CICC). IEEE, 2018, pp. 1–8.
[18] A. P. Jacob, R. Xie, M. G. Sung, L. Liebmann, R. T. Lee, and B. Taylor,
“Scaling challenges for advanced CMOS devices,” International Journal of
High Speed Electronics and Systems, vol. 26, no. 01n02, p. 1740001, 2017.
[19] B. Chakrabarti, M. A. Lastras-Montaño, G. Adam, M. Prezioso, B. Hoskins,
M. Payvand, A. Madhavan, A. Ghofrani, L. Theogarajan, K.-T. Cheng
et al., “A multiply-add engine with monolithically integrated 3D memristor
crossbar/CMOS hybrid circuit,” Scientific reports, vol. 7, p. 42429, 2017.
[20] D. B. Strukov, G. S. Snider, D. R. Stewart, and R. S. Williams, “The missing
memristor found,” nature, vol. 453, no. 7191, p. 80, 2008.
[21] J. Borghetti, G. S. Snider, P. J. Kuekes, J. J. Yang, D. R. Stewart, and R. S.
Williams, “Memristive switches enable stateful logic operations via material
implication,” Nature, vol. 464, no. 7290, pp. 873–876, 2010.
[22] M. Prezioso, F. Merrikh-Bayat, B. Hoskins, G. Adam, K. K. Likharev, and
D. B. Strukov, “Training and operation of an integrated neuromorphic network
BIBLIOGRAPHY 162
based on metal-oxide memristors,” Nature, vol. 521, no. 7550, pp. 61–64,
2015.
[23] C. Merkel, “Current-mode memristor crossbars for neuromorphic computing,”
in Proceedings of the 7th Annual Neuro-inspired Computational Elements
Workshop, 2019, pp. 1–6.
[24] G. S. Snider, “Spike-timing-dependent learning in memristive nanodevices,”
in Nanoscale Architectures, 2008. NANOARCH 2008. IEEE International
Symposium on. IEEE, 2008, pp. 85–92.
[25] T. M. Taha, R. Hasan, and C. Yakopcic, “Memristor crossbar based multicore
neuromorphic processors,” in System-on-Chip Conference (SOCC), 2014 27th
IEEE International. IEEE, 2014, pp. 383–389.
[26] J. Hawkins and S. Blakeslee, On intelligence: How a new understanding of
the brain will lead to the creation of truly intelligent machines. Macmillan,
2005.
[27] J. Hawkins, D. George, and J. Niemasik, “Sequence memory for prediction,
inference and behaviour,” Philosophical Transactions of the Royal Society of
London B: Biological Sciences, vol. 364, no. 1521, pp. 1203–1209, 2009.
[28] D. George and J. Hawkins, “Towards a mathematical theory of cortical
micro-circuits,” PLoS computational biology, vol. 5, no. 10, p. e1000532, 2009.
[29] J. Hawkins and S. Ahmad, “Why neurons have thousands of synapses, a
theory of sequence memory in neocortex,” Frontiers in neural circuits, vol. 10,
p. 23, 2016.
[30] D. E. Padilla-Baez, “Analysis and spiking implementation of the hierarchi-
cal temporal memory model for pattern and sequence recognition,” Ph.D.
dissertation, University of South Australia, 2015.
[31] A. M. Zyarah and D.Kudithipudi, “Neuromemrisitive architecture of HTM
with on-device learning and neurogenesis,” ACM Journal on Emerging Tech-
nologies in Computing Systems (JETC), vol. 15, no. 3, p. 24, 2019.
[32] K. L. Rice, T. M. Taha, and C. N. Vutsinas, “Hardware acceleration of
image recognition through a visual cortex model,” Optics & Laser Technology,
vol. 40, no. 6, pp. 795–802, 2008.
[33] J. Xing, T. Wang, Y. Leng, and J. Fu, “A bio-inspired olfactory model using
hierarchical temporal memory,” in Biomedical Engineering and Informatics
(BMEI), 2012 5th International Conference on. IEEE, 2012, pp. 923–927.
[34] N. O. El-Ganainy, I. Balasingham, P. S. Halvorsen, and L. A. Rosseland,
“On the performance of hierarchical temporal memory predictions of medical
streams in real time,” in 2019 13th International Symposium on Medical
BIBLIOGRAPHY 163
Information and Communication Technology (ISMICT). IEEE, 2019, pp.
1–6.
[35] A. Lavin and S. Ahmad, “Evaluating real-time anomaly detection algorithms–
the numenta anomaly benchmark,” in Machine Learning and Applications
(ICMLA), 2015 IEEE 14th International Conference on. IEEE, 2015, pp.
38–44.
[36] W. Zhang and D. J. Linden, “The other side of the engram: experience-driven
changes in neuronal intrinsic excitability,” Nature Reviews Neuroscience,
vol. 4, no. 11, p. 885, 2003.
[37] A. Citri and R. C. Malenka, “Synaptic plasticity: multiple forms, functions,
and mechanisms,” Neuropsychopharmacology, vol. 33, no. 1, p. 18, 2008.
[38] F. Akopyan, J. Sawada, A. Cassidy, R. Alvarez-Icaza, J. Arthur, P. Merolla,
N. Imam, Y. Nakamura, P. Datta, G.-J. Nam et al., “Truenorth: Design and
tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip,”
IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems, vol. 34, no. 10, pp. 1537–1557, 2015.
[39] B. V. Benjamin, P. Gao, E. McQuinn, S. Choudhary, A. R. Chandrasekaran,
J.-M. Bussat, R. Alvarez-Icaza, J. V. Arthur, P. A. Merolla, and K. Boahen,
“Neurogrid: A mixed-analog-digital multichip system for large-scale neural
simulations,” Proceedings of the IEEE, vol. 102, no. 5, pp. 699–716, 2014.
[40] D. H. Goldberg, G. Cauwenberghs, and A. G. Andreou, “Probabilistic synaptic
weighting in a reconfigurable network of VLSI integrate-and-fire neurons,”
Neural Networks, vol. 14, no. 6-7, pp. 781–793, 2001.
[41] ——, “Probabilistic synaptic weighting in a reconfigurable network of VLSI
integrate-and-fire neurons,” Neural Networks, vol. 14, no. 6-7, pp. 781–793,
2001.
[42] L. P. Romero, S. Ambrogio, M. Giordano, G. Cristiano, M. Bodini,
P. Narayanan, H. Tsai, R. M. Shelby, and G. W. Burr, “Training fully con-
nected networks with resistive memories: Impact of device failures,” Faraday
Discussions, vol. 213, pp. 371–391, 2019.
[43] Y. LeCun, C. Cortes, and C. Burges, “MNIST handwritten digit database,”
AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, vol. 2,
2010.
[44] D. Lee, Y. Kim, G. Pekhimenko, S. Khan, V. Seshadri, K. Chang, and
O. Mutlu, “Adaptive-latency DRAM: Optimizing DRAM timing for the
common-case,” in 2015 IEEE 21st International Symposium on High Perfor-
mance Computer Architecture (HPCA). IEEE, 2015, pp. 489–501.
[45] A. M. Zyarah, N. Soures, L. Hays, R. B. Jacobs-Gedrim, S. Agarwal,
M. Marinella, and D. Kudithipudi, “Ziksa: On-chip learning accelerator
BIBLIOGRAPHY 164
with memristor crossbars for multilevel neural networks,” in Circuits and
Systems (ISCAS), 2017 IEEE International Symposium on. IEEE, 2017,
pp. 1–4.
[46] A. M. Zyarah and D. Kudithipudi, “Neuromemristive multi-layer random
projection network with on-device learning,” in 2019 International Joint
Conference on Neural Networks (IJCNN). IEEE, 2019, pp. 1–8.
[47] ——, “Semi-trained memristive crossbar computing engine with in situ learn-
ing accelerator,” ACM Journal on Emerging Technologies in Computing
Systems (JETC), vol. 14, no. 4, p. 43, 2018.
[48] N. Soures, A. Zyarah, K. D. Carlson, J. B. Aimone, and D. Kudithipudi,
“How neural plasticity boosts performance of spiking neural networks.” Sandia
National Lab.(SNL-NM), Albuquerque, NM (United States), Tech. Rep.,
2017.
[49] A. M. Zyarah, K. Gomez, and D. Kudithipudi, “Neuromorphic system for spa-
tial and temporal information processing,” IEEE Transactions on Computers,
pp. 1–14, 2020.
[50] A. M. Zyarah and D. Kudithipudi, “Neuromorphic architecture for the hi-
erarchical temporal memory,” IEEE Transactions on Emerging Topics in




[52] “Growth in AI tasks on edge devices,” https://www.forbes.com/sites/intelai/
2018/09/21/the-paradigm-changing-effects-of-ai-innovation-at-the-edge/
#6316535a4d92.
[53] M. Davies, N. Srinivasa, T.-H. Lin, G. Chinya, Y. Cao, S. H. Choday,
G. Dimou, P. Joshi, N. Imam, S. Jain et al., “Loihi: a neuromorphic manycore
processor with on-chip learning,” IEEE Micro, vol. 38, no. 1, pp. 82–99, 2018.
[54] J. Schemmel, D. Briiderle, A. Griibl, M. Hock, K. Meier, and S. Millner, “A
wafer-scale neuromorphic hardware system for large-scale neural modeling,” in
Proceedings of 2010 IEEE International Symposium on Circuits and Systems.
IEEE, 2010, pp. 1947–1950.
[55] M. M. Khan, D. R. Lester, L. A. Plana, A. Rast, X. Jin, E. Painkras, and
S. B. Furber, “SpiNNaker: mapping neural networks onto a massively-parallel
chip multiprocessor,” in 2008 IEEE International Joint Conference on Neural
Networks (IEEE World Congress on Computational Intelligence). Ieee, 2008,
pp. 2849–2856.
[56] R. Wang and A. van Schaik, “Breaking Liebig’s law: an advanced multipurpose




[58] “International roadmap for devices and systems,” 2017.
[59] V. Saxena, X. Wu, I. Srivastava, and K. Zhu, “Towards neuromorphic learning
machines using emerging memory devices with brain-like energy efficiency,”
Journal of Low Power Electronics and Applications, vol. 8, no. 4, p. 34, 2018.
[60] D. Fan, M. Sharad, A. Sengupta, and K. Roy, “Hierarchical temporal memory
based on spin-neurons and resistive memory for energy-efficient brain-inspired
computing,” IEEE transactions on neural networks and learning systems,
vol. 27, no. 9, pp. 1907–1919, 2016.
[61] G. Indiveri, B. Linares-Barranco, R. Legenstein, G. Deligeorgis, and T. Pro-
dromakis, “Integration of nanoscale memristor synapses in neuromorphic
computing architectures,” Nanotechnology, vol. 24, no. 38, p. 384010, 2013.
[62] C. D. Schuman, J. D. Birdwell, and M. Dean, “Neuroscience-inspired inspired
dynamic architectures,” in Proceedings of the 2014 Biomedical Sciences and
Engineering Conference. IEEE, 2014, pp. 1–4.
[63] H. Yu, L. Ni, and H. Huang, “Distributed in-memory computing on bi-
nary memristor-crossbar for machine learning,” in Advances in Memristors,
Memristive Devices and Systems. Springer, 2017, pp. 275–304.
[64] I. Chakraborty, G. Saha, and K. Roy, “Photonic in-memory computing
primitive for spiking neural networks using phase-change materials,” Physical
Review Applied, vol. 11, no. 1, p. 014063, 2019.
[65] Q. Chen, Q. Qiu, H. Li, and Q. Wu, “A neuromorphic architecture for anomaly
detection in autonomous large-area traffic monitoring,” in Proceedings of the
International Conference on Computer-Aided Design. IEEE Press, 2013, pp.
202–205.
[66] L. Chua, “Memristor-the missing circuit element,” IEEE Transactions on
circuit theory, vol. 18, no. 5, pp. 507–519, 1971.
[67] Y. Ho, G. M. Huang, and P. Li, “Nonvolatile memristor memory: device char-
acteristics and design implications,” in Proceedings of the 2009 International
Conference on Computer-Aided Design. ACM, 2009, pp. 485–490.
[68] S. Salahuddin, K. Ni, and S. Datta, “The era of hyper-scaling in electronics,”
Nature Electronics, vol. 1, no. 8, pp. 442–450, 2018.
[69] S. Kannan, N. Karimi, R. Karri, and O. Sinanoglu, “Detection, diagnosis,
and repair of faults in memristor-based memories,” in 2014 IEEE 32nd VLSI
Test Symposium (VTS). IEEE, 2014, pp. 1–6.
[70] S. Kumar, Z. Wang, X. Huang, N. Kumari, N. Davila, J. P. Strachan, D. Vine,
A. D. Kilcoyne, Y. Nishi, and R. S. Williams, “Oxygen migration during
BIBLIOGRAPHY 166
resistance switching and failure of hafnium oxide memristors,” Applied Physics
Letters, vol. 110, no. 10, p. 103503, 2017.
[71] V. Ravi and S. Prabaharan, “Fault tolerant adaptive write schemes for im-
proving endurance and reliability of memristor memories,” AEU-International
Journal of Electronics and Communications, vol. 94, pp. 392–406, 2018.
[72] J. Xu, Y. Huan, K. Yang, Y. Zhan, Z. Zou, and L.-R. Zheng, “Optimized
near-zero quantization method for flexible memristor based neural network,”
IEEE Access, 2018.
[73] G. Hinton, “Neural networks for machine learning, online subject, lecture
notes,” 2013.
[74] D. Chabi, Z. Wang, W. Zhao, and J.-O. Klein, “On-chip supervised learning
rule for ultra high density neural crossbar using memristor for synapse and
neuron,” in Proceedings of the 2014 IEEE/ACM International Symposium on
Nanoscale Architectures. ACM, 2014, pp. 7–12.
[75] M. Hu, H. Li, Y. Chen, Q. Wu, G. S. Rose, and R. W. Linderman, “Memris-
tor crossbar-based neuromorphic computing system: A case study,” IEEE
transactions on neural networks and learning systems, vol. 25, no. 10, pp.
1864–1878, 2014.
[76] D. Soudry, D. Di Castro, A. Gal, A. Kolodny, and S. Kvatinsky, “Memristor-
based multilayer neural networks with online gradient descent training,” IEEE
transactions on neural networks and learning systems, vol. 26, no. 10, pp.
2408–2421, 2015.
[77] M. R. Hasan, “Memristor based low power high throughput circuits and
systems design,” Ph.D. dissertation, University of Dayton, 2016.
[78] C. Li, D. Belkin, Y. Li, P. Yan, M. Hu, N. Ge, H. Jiang, E. Montgomery,
P. Lin, Z. Wang et al., “Efficient and self-adaptive in-situ learning in multilayer
memristor neural networks,” Nature Communications, vol. 9, no. 1, p. 2385,
2018.
[79] E. Giacomin, T. Greenberg-Toledo, S. Kvatinsky, and P.-E. Gaillardon, “A
robust digital RRAM-based convolutional block for low-power image process-
ing and learning applications,” IEEE Transactions on Circuits and Systems
I: Regular Papers, vol. 66, no. 2, pp. 643–654, 2018.
[80] T. Greenberg-Toledo, R. Mazor, A. Haj-Ali, and S. Kvatinsky, “Supporting
the momentum training algorithm using a memristor-based synapse,” IEEE
Transactions on Circuits and Systems I: Regular Papers, vol. 66, no. 4, pp.
1571–1583, 2019.
[81] C. Merkel and D. Kudithipudi, “Method and apparatus for training memristive
learning systems,” Nov. 24 2016, US Patent App. 15/159,949.
BIBLIOGRAPHY 167
[82] M. V. Nair, L. K. Muller, and G. Indiveri, “A differential memristive synapse
circuit for on-line learning in neuromorphic computing systems,” Nano Futures,
vol. 1, no. 3, p. 035003, 2017.
[83] H. Manem and G. S. Rose, “A read-monitored write circuit for 1T1M multi-
level memristor memories,” in 2011 IEEE International Symposium of Circuits
and Systems (ISCAS). IEEE, 2011, pp. 2938–2941.
[84] P. Vyas and M. Zaveri, “Verilog implementation of a node of hierarchical
temporal memory,” Asian Journal of Computer Science & Information Tech-
nology, vol. 3, no. 7, 2013.
[85] A. M. Zyarah and D. Kudithipudi, “Reconfigurable hardware architecture
of the spatial pooler for hierarchical temporal memory,” in System-on-Chip
Conference (SOCC), 2015 28th IEEE International. IEEE, 2015, pp. 143–
153.
[86] A. M. Zyarah, “Design and analysis of a reconfigurable hierarchical temporal
memory architecture,” Master’s thesis, Rochester Institute of Technology,
2015.
[87] L. Streat, D. Kudithipudi, and K. Gomez, “Non-volatile hierarchical temporal
memory: Hardware for spatial pooling,” arXiv preprint arXiv:1611.02792,
2016.
[88] W. Li and P. Franzon, “Hardware implementation of hierarchical temporal
memory algorithm,” in System-on-Chip Conference (SOCC), 2016 29th IEEE
International. IEEE, 2016, pp. 133–138.
[89] A. P. James, I. Fedorova, T. Ibrayev, and D. Kudithipudi, “HTM spatial
pooler with memristor crossbar circuits for sparse biometric recognition,”
IEEE Transactions on Biomedical Circuits and Systems, 2017.
[90] O. Krestinskaya, T. Ibrayev, and A. P. James, “Hierarchical temporal mem-
ory features with memristor logic circuits for pattern recognition,” IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems,
vol. 37, no. 6, pp. 1143–1156, 2018.
[91] S. N. Truong, K. Van Pham, and K.-S. Min, “Spatial-pooling memristor
crossbar converting sensory information to sparse distributed representation
of cortical neurons,” IEEE Transactions on Nanotechnology, vol. 17, no. 3,
pp. 482–491, 2018.
[92] X. Wang, Y. Han, V. C. Leung, D. Niyato, X. Yan, and X. Chen, “Convergence
of edge computing and deep learning: A comprehensive survey,” IEEE
Communications Surveys & Tutorials, 2020.
[93] J. E. Auerbach, C. Fernando, and D. Floreano, “Online extreme evolutionary
learning machines,” in Artificial Life 14: Proceedings of the Fourteenth
BIBLIOGRAPHY 168
International Conference on the Synthesis and Simulation of Living Systems,
no. EPFL-CONF-200273. The MIT Press, 2014, pp. 465–472.
[94] A. M. Zyarah and D. Kudithipudi, “Extreme learning machine as a generaliz-
able classification engine,” in Neural Networks (IJCNN), 2017 International
Joint Conference on. IEEE, 2017, pp. 3371–3376.
[95] M. Riedmiller and H. Braun, “A direct adaptive method for faster back-
propagation learning: The RPROP algorithm,” in Proceedings of the IEEE
international conference on neural networks, vol. 1993. San Francisco, 1993,
pp. 586–591.
[96] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
arXiv preprint arXiv:1412.6980, 2014.
[97] S. K. Gonugondla, M. Kang, and N. R. Shanbhag, “A variation-tolerant
in-memory machine learning classifier via on-chip training,” IEEE Journal of
Solid-State Circuits, no. 99, pp. 1–11, 2018.
[98] P. O’Connor and M. Welling, “Deep spiking networks,” arXiv preprint
arXiv:1602.08323, 2016.
[99] M. O’Halloran and R. Sarpeshkar, “A 10-nW 12-bit accurate analog storage
cell with 10-aA leakage,” IEEE Journal of Solid-State Circuits, vol. 39, no. 11,
pp. 1985–1996, 2004.
[100] C. Yakopcic, “Memristor device modeling and circuit design for read out
integrated circuits, memory architectures, and neuromorphic systems,” Ph.D.
dissertation, University of Dayton, 2014.
[101] M. Gerstenhaber and R. Malik, “More value from your absolute value circuit-
difference amplifier enables low-power, high-performance absolute value cir-
cuit,” Analog Dialogue, vol. 44, no. 2, pp. 7–8, 2010.
[102] M. Lichman, “UCI machine learning repository,” 2013. [Online]. Available:
http://archive.ics.uci.edu/ml
[103] D. Dheeru and E. Karra Taniskidou, “UCI machine learning repository,”
2017. [Online]. Available: http://archive.ics.uci.edu/ml
[104] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11,
pp. 2278–2324, 1998.
[105] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel image dataset for
benchmarking machine learning algorithms,” arXiv preprint arXiv:1708.07747,
2017.
[106] G.-B. Huang, H. Zhou, X. Ding, and R. Zhang, “Extreme learning machine
for regression and multiclass classification,” IEEE Transactions on Systems,
BIBLIOGRAPHY 169
Man, and Cybernetics, Part B (Cybernetics), vol. 42, no. 2, pp. 513–529,
2012.
[107] E. Yalon, A. Gavrilov, S. Cohen, D. Mistele, B. Meyler, J. Salzman, and D. Rit-
ter, “Resistive switching in HfO2 probed by a metal–insulator–semiconductor
bipolar transistor,” IEEE electron device letters, vol. 33, no. 1, pp. 11–13,
2011.
[108] S. Kvatinsky, M. Ramadan, E. G. Friedman, and A. Kolodny, “VTEAM:
A general model for voltage-controlled memristors,” IEEE Transactions on
Circuits and Systems II: Express Briefs, vol. 62, no. 8, pp. 786–790, 2015.
[109] W. J. Melis and M. Kameyama, “A study of the different uses of colour
channels for traffic sign recognition on hierarchical temporal memory,” in
Innovative Computing, Information and Control (ICICIC), 2009 Fourth
International Conference on. IEEE, 2009, pp. 111–114.
[110] Numenta, “The science of anomaly detection (how HTM enables anomaly
detection in streaming data),” 2014.
[111] Y. Cui, C. Surpur, S. Ahmad, and J. Hawkins, “A comparative study of HTM
and other neural network models for online sequence learning with streaming
data,” in Neural Networks (IJCNN), 2016 International Joint Conference on.
IEEE, 2016, pp. 1530–1538.
[112] P. Földiak, “Forming sparse representations by local anti-Hebbian learning,”
Biological cybernetics, vol. 64, no. 2, pp. 165–170, 1990.
[113] S. Ahmad and J. Hawkins, “Properties of sparse distributed representa-
tions and their application to hierarchical temporal memory,” arXiv preprint
arXiv:1503.07469, 2015.
[114] S. Purdy, “Encoding data for HTM systems,” arXiv preprint arXiv:1602.05925,
2016.
[115] Y. Cui, S. Ahmad, and J. Hawkins, “The HTM spatial pooler  a neocortical
algorithm for online sparse distributed coding,” Frontiers in Computational
Neuroscience, vol. 11, 2017.
[116] D. Hebb, 0.(1949) The Organization of Behavior. Wiley, New York, 1988.
[117] Y. Cui, S. Ahmad, and J. Hawkins, “Continuous online sequence learning
with an unsupervised neural network model,” Neural computation, vol. 28,
no. 11, pp. 2474–2504, 2016.
[118] Y. Jiang, J. Kang, and X. Wang, “RRAM-based parallel computing architec-
ture using k-nearest neighbor classification for pattern recognition,” Scientific
reports, vol. 7, p. 45233, 2017.
[119] Y. N. Joglekar and S. J. Wolf, “The elusive memristor: properties of basic
electrical circuits,” European Journal of Physics, vol. 30, no. 4, p. 661, 2009.
BIBLIOGRAPHY 170
[120] W. Lu, K.-H. Kim, T. Chang, and S. Gaba, “Two-terminal resistive switches
(memristors) for memory and logic applications,” in Proceedings of the 16th
Asia and South Pacific Design Automation Conference. IEEE Press, 2011,
pp. 217–223.
[121] J. Cui and Q. Qiu, “Towards memristor based accelerator for sparse ma-
trix vector multiplication,” in Circuits and Systems (ISCAS), 2016 IEEE
International Symposium on. IEEE, 2016, pp. 121–124.
[122] R. J. Vogelstein, F. Tenore, R. Philipp, M. S. Adlerstein, D. H. Goldberg, and
G. Cauwenberghs, “Spike timing-dependent plasticity in the address domain,”
in Advances in Neural Information Processing Systems, 2003, pp. 1171–1178.
[123] N. Soures, A. Zyarah, K. D. Carlson, J. B. Aimone, and D. Kudithipudi,
“How neural plasticity boosts performance of spiking neural networks.” Sandia
National Lab.(SNL-NM), Albuquerque, NM (United States), Tech. Rep.,
2017.
[124] F. H. A. Asgari and M. Sachdev, “A low-power reduced swing global clocking
methodology,” IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, vol. 12, no. 5, pp. 538–545, 2004.
[125] B. Razavi, Design of Analog CMOS Integrated Circuits. McGraw-Hill, 2017.
[126] J. Lazzaro, S. Ryckebusch, M. A. Mahowald, and C. A. Mead, “Winner-
take-all networks of O(n) complexity,” in Advances in neural information
processing systems, 1989, pp. 703–711.
[127] S. Ramakrishnan and J. Hasler, “Vector-matrix multiply and winner-take-all
as an analog classifier,” IEEE Transactions on Very Large Scale Integration
(VLSI) Systems, vol. 22, no. 2, pp. 353–361, 2014.
[128] T. Kulej and F. Khateb, “Sub 0.5-v bulk-driven winner take all circuit based
on a new voltage follower,” Analog Integrated Circuits and Signal Processing,
vol. 90, no. 3, pp. 687–691, 2017.
[129] A. Fish, V. Milrud, and O. Yadid-Pecht, “High-speed and high-precision
current winner-take-all circuit,” IEEE Transactions on Circuits and Systems
II: Express Briefs, vol. 52, no. 3, pp. 131–135, 2005.
[130] S. Ahmad and J. Hawkins, “How do neurons operate on sparse distributed
representations? a mathematical theory of sparsity, neurons and active
dendrites,” arXiv preprint arXiv:1601.00720, 2016.
[131] K. A. Boahen, “Communicating neuronal ensembles between neuromorphic
chips,” in Neuromorphic systems engineering. Springer, 1998, pp. 229–259.
[132] R. Eberhart and J. Kennedy, “A new optimizer using particle swarm theory,”
in Micro Machine and Human Science, 1995. MHS’95., Proceedings of the
Sixth International Symposium on. IEEE, 1995, pp. 39–43.
BIBLIOGRAPHY 171
[133] W. Woods, J. Bürger, and C. Teuscher, “Synaptic weight states in a lo-
cally competitive algorithm for neuromorphic memristive hardware,” IEEE
Transactions on Nanotechnology, vol. 14, no. 6, pp. 945–953, 2015.
[134] D. Cai, X. He, Y. Hu, J. Han, and T. Huang, “Learning a spatially smooth
subspace for face recognition,” in Computer Vision and Pattern Recognition,
2007. CVPR’07. IEEE Conference on. IEEE, 2007, pp. 1–7.
[135] “Hot-gym: power consumed in a gym,” https://github.com/numenta/
nupic/tree/master/examples/opf/clients/hotgym/prediction/one_gym.
[136] “Passanger demand for new york city taxis,” http://www.nyc.gov/html/tlc/
html/about/trip\_record\_data.shtml.
[137] R. J. Hyndman and Y. Yang, “Daily minimum temperatures in melbourne,
australia (1981-1990),” https://pkg.yangzhuoranyang.com/tsdl/, 2018.
[138] Andrews and Herzberg, “Monthly sunspot,” https://raw.githubusercontent.
com/jbrownlee/Datasets/master/monthly-sunspots.csv, 1985.
[139] M. Coll, J. Fontcuberta, M. Althammer, M. Bibes, H. Boschker, A. Calleja,
G. Cheng, M. Cuoco, R. Dittmann, B. Dkhil et al., “Towards oxide electronics:
a roadmap,” Applied surface science, vol. 482, pp. 1–93, 2019.
[140] O. Krestinskaya, T. Ibrayev, and A. P. James, “Hierarchical temporal mem-
ory features with memristor logic circuits for pattern recognition,” IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems,
vol. 37, no. 6, pp. 1143–1156, 2018.
[141] T. Guo, “Cloud-based or on-device: An empirical study of mobile deep
inference,” in 2018 IEEE International Conference on Cloud Engineering
(IC2E). IEEE, 2018, pp. 184–190.
[142] M. Yan, C. A. Chan, A. F. Gygax, J. Yan, L. Campbell, A. Nirmalathas,
and C. Leckie, “Modeling the total energy consumption of mobile network
services and applications,” Energies, vol. 12, no. 1, p. 184, 2019.
[143] S. Zhidkov, A. Sychev, A. Zhidkov, and A. Petrov, “On smartphone power con-
sumption in acoustic environment monitoring applications,” Applied System
Innovation, vol. 1, no. 1, p. 8, 2018.
[144] X. Chen, Y. Chen, Z. Ma, and F. C. Fernandes, “How is energy consumed in
smartphone display applications?” in Proceedings of the 14th Workshop on
Mobile Computing Systems and Applications, 2013, pp. 1–6.
[145] M. Tawalbeh, A. Eardley et al., “Studying the energy consumption in mobile
devices,” Procedia Computer Science, vol. 94, pp. 183–189, 2016.




[147] “Five things that are bigger than the internet,” https://blogs.cisco.com/sp/
five-things-that-are-bigger-than-the-internet-findings-from-this-years-global\
n-cloud-index.
[148] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with
deep convolutional neural networks,” in Advances in neural information
processing systems, 2012, pp. 1097–1105.
[149] A. Carroll, “Understanding and reducing smartphone energy consumption,”
Ph.D. dissertation, University of New South Wales Sydney, Australia, 2017.
[150] “Understanding power usage in a smartphone,” https://qnovo.com/
understanding-power-usage-in-a-smartphone/.
