Hardware implementation of deep network accelerators towards healthcare and biomedical applications by Rahimi Azghadi, Mostafa et al.
1138 IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. 14, NO. 6, DECEMBER 2020
Hardware Implementation of Deep Network
Accelerators Towards Healthcare and
Biomedical Applications
Mostafa Rahimi Azghadi , Senior Member, IEEE, Corey Lammie , Student Member, IEEE, Jason K.
Eshraghian , Member, IEEE, Melika Payvand , Member, IEEE, Elisa Donati , Member, IEEE,
Bernabé Linares-Barranco , Fellow, IEEE, and Giacomo Indiveri , Senior Member, IEEE
Abstract—The advent of dedicated Deep Learning (DL) accel-
erators and neuromorphic processors has brought on new op-
portunities for applying both Deep and Spiking Neural Network
(SNN) algorithms to healthcare and biomedical applications at
the edge. This can facilitate the advancement of medical Internet
of Things (IoT) systems and Point of Care (PoC) devices. In this
paper, we provide a tutorial describing how various technologies
including emerging memristive devices, Field Programmable Gate
Arrays (FPGAs), and Complementary Metal Oxide Semiconductor
(CMOS) can be used to develop efficient DL accelerators to solve a
wide variety of diagnostic, pattern recognition, and signal process-
ing problems in healthcare. Furthermore, we explore how spiking
neuromorphic processors can complement their DL counterparts
for processing biomedical signals. The tutorial is augmented with
case studies of the vast literature on neural network and neuro-
morphic hardware as applied to the healthcare domain. We bench-
mark various hardware platforms by performing a sensor fusion
signal processing task combining electromyography (EMG) signals
with computer vision. Comparisons are made between dedicated
neuromorphic processors and embedded AI accelerators in terms
of inference latency and energy. Finally, we provide our analysis
of the field and share a perspective on the advantages, disadvan-
tages, challenges, and opportunities that various accelerators and
neuromorphic processors introduce to healthcare and biomedical
domains.
Manuscript received July 9, 2020; revised September 29, 2020; accepted
October 30, 2020. Date of publication November 6, 2020; date of current
version December 30, 2020. This work was supported in part by the European
Union’s Horizon 2020 ERC project NeuroAgents under Grant 724295, in part
by EU H2020 under Grants 824164 “HERMES,” 871371 “Memscales,” and
PCI2019-111826-2 “APROVIS3D,” and in part by the Ministry of Science
and Innovation of Spain under Grant PID2019-105556GB-C31 (NANOMIND)
(with support from the European Regional Development Fund). This paper was
recommended by Associate Editor Dr. Kea-Tiong Tang. (Corresponding author:
Mostafa RahimiAzghadi.)
Mostafa Rahimi Azghadi and Corey Lammie are with the College of Science
and Engineering, James Cook University, Townsville, QLD 4811, Australia
(e-mail: mostafa.rahimiazghadi@jcu.edu.au; corey.lammie@jcu.edu.au).
Jason K. Eshraghian is with the Department of Electrical Engineering and
Computer Science, The University of Michigan, Ann Arbor, MI 48109-2122
USA (e-mail: jeshraghian@gmail.com).
Melika Payvand, Elisa Donati, and Giacomo Indiveri are with the Institute
of Neuroinformatics, University and ETH Zurich, 8092 Zürich, Switzerland
(e-mail: melika@ini.uzh.ch; elisa@ini.uzh.ch; giacomo@ini.uzh.ch).
Bernabé Linares-Barranco is with the Instituto de Microelectrónica de Sevilla
IMSE-CNM, CSIC and Universidad de Sevilla, Sevilla 41092, Spain (e-mail:
bernabe@imse-cnm.csic.es).
Color versions of one or more of the figures in this article are available online
at https://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TBCAS.2020.3036081
Index Terms—CMOS, deep neural networks, FPGA, healthcare,
medical IoT, memristor, neuromorphic hardware, point-of-care,
RRAM, spiking neural networks.
I. INTRODUCTION
ARTIFICIAL intelligence is uniquely poised to cope withthe growing demands of the universal healthcare sys-
tem [1]. The healthcare industry is projected to reach over
10 trillion dollars by 2022, and the associated workload on
medical practitioners is expected to grow concurrently [2]. As
the reliability of DL improves, it has pervaded various facets of
healthcare from monitoring [3], [4], to prediction [5], diagno-
sis [6], treatment [7], and prognosis [8]. Fig. 1(a) shows how
data collected from the patient, which may be a combination of
bio-samples, medical images, temperature, movement, etc., can
be processed using a smart DL system that monitors the patient
for anomalies and/or to predict diseases. DL systems can be used
to recommend treatment options and prognosis, which further
affect monitoring and prediction in a closed-loop scenario.
The capacity of Artificial Intelligence (AI) to meet or exceed
the performance of human experts in medical-data analysis [9]–
[11] can, in part, be attributed to the continued improvement of
high-performance computing platforms such as Graphics Pro-
cessing Units (GPUs) [12] and customized Machine Learning
(ML) hardware [13]. These can now process and learn from a
large amount of multi-modal heterogeneous general and medical
data [14]. This was not readily achievable a decade ago.
While the field of DL has been growing at an astonishing
rate in terms of performance, network size, and training run
time, the development of dedicated hardware to process DL
algorithms is struggling to keep up. Concretely, the compute
loads of DL have doubled every 3.4 months since 2012. Moore’s
Law targets the doubling of compute power every 18-24 months,
and appears to be slowing down [15]. The progress in hard-
ware accelerator development currently relies on advances by
a handful of technology companies, most notably Nvidia and
its GPUs [16], [17] and Google and its Tensor Processing Units
(TPUs) [13], in addition to new startups and research groups
developing Application Specific Integrated Circuits (ASICs) for
DL training and acceleration.
While there are significant advances in tailoring deep network
models and algorithms for various healthcare and biomedical
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
AZGHADI et al.: HARDWARE IMPLEMENTATION OF DEEP NETWORK ACCELERATORS 1139
Fig. 1. Depiction of (a) the usage of DL in a smart healthcare setting, which typically involves monitoring, prediction, diagnosis, treatment, and prognosis. The
various parts of the DL-based healthcare system can run on (b) the three levels of the IoT, i.e. edge devices, edge nodes, and the cloud. However, for healthcare
IoT and PoC processing, edge learning and inference is preferred.
applications [18], most computationally expensive deep net-
works are trained on either GPUs or in data centers [12], [19].
The latter typically requires access to cloud computing services
which is not only costly and comes with high power demands, but
also compromises data privacy. This is distinct to the effective
deployment of DL at the edge on an increasing number of
medical IoT devices [20] and PoC systems [21], as illustrated
in Fig. 1(b). Edge learning and inference enables the option
to move processing away from the cloud. This is critical for
highly sensitive medical data and offline operation. Edge-based
processing must combine compactness, low-power, and rapid
(high throughput) at a low-cost, to make smart health monitoring
viable and affordable for integration into human life [22].
Specialized embedded DL accelerators, such as the Nvidia
Jetson and Xavier series [23], and the Movidius Neural Compute
Stick [24], [25], have shown the promise of edge computing.
More recently, the Nvidia Clara Embedded was released as a
healthcare-specific edge accelerator. This is a computing plat-
form for edge-enabled AI on the Internet of Medical Things
(IoMT). However, embedded devices remain relatively power
hungry and costly, and many state-of-the-art algorithms far
exceed the memory bandwidth of resource-constrained devices.
They are not yet ideal learning/inference engines for ambient-
assisted precision medicine systems. There is a need for inno-
vative systems which can satisfy the stringent requirements of
healthcare edge devices to be made affordable to the community
at large scales.
To that end, in this paper we focus on the use of three various
hardware technologies to develop dedicated deep network accel-
erators which will be discussed from a biomedical and healthcare
application point-of-view. The three technologies that we cover
here are CMOS, memristors, and FPGAs. It is worth noting that,
while our focus targets edge inference engines in the biomedical
domain, the techniques and hardware advantages discussed here
are likely to be useful for efficient offline deep network learning,
or online on-chip learning. Herein, the term DL ‘accelerator’ is
used to refer to a device that is able to perform DL inference and
potentially training.
This tutorial on DL accelerators within the biomedical sphere
commences with a brief introduction to artificial and spiking
neural networks. Next, we introduce the computational demands
of DL by shedding light on why they are power- and resource-
intensive. This will justify the need for application specific
hardware platforms. After that, we discuss recent hardware ad-
vances which have led to improvements in training and inference
efficiency. These improvements ultimately guide us to viable
edge inference engine options.
After reviewing the literature on these DL accelerators, we
quantify the performance of various algorithms on different
types of DL processors. The results allow us to draw a per-
spective on the potential future of spike-based neuromorphic
processors in the biomedical signal processing domain. Based
on our analysis and perspective, we conjecture that, for edge
processing, neuromorphic computing and SNNs [26] will likely
complement DL inference engines, either through signaling
anomalies in the data or acting as ‘intelligent always-on watch-
dogs’ which continuously monitor the data being recorded, but
only activate further processing stages if and when necessary.
We expect this tutorial, review and perspective to provide
guidance on the history and future of DL accelerators, and the
potential they hold for advancing healthcare. Our contributions
are summarized as follows:
 Our paper is the first to discuss the use of three differ-
ent emerging and established hardware technologies for
facilitating DL acceleration, with a focus on biomedical
applications.
 We provide tutorial sections on how one may implement
a typical biomedical task on FPGAs or simulate it for
deployment on memristive crossbars.
 Our paper is the first to discuss how event-based neuro-
morphic processors can complement DL accelerators for
biomedical signal processing.
1140 IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. 14, NO. 6, DECEMBER 2020
Fig. 2. Popular Artificial Neural Network (ANN) structures. MLP/Dense/Fully Connected are typically well-suited for cross-sectional quantitative data, whereas
Recurent Neural Networks (RNNs) and Long Short Term Memorys (LSTMs) networks are optimized for sequential data. Convolutional Neural Networks (CNNs)
are equipped for both types.
 We provide open-source code and data to enable the repro-
duction of our results.
The remainder of the paper is organized as follows. In Sec-
tion II, we define the technical terminology that is used through-
out this paper and cover the working principles of artificial
and spiking neural networks. We also introduce a biomedical
signal processing task for hand-gesture classification, which is
used for benchmarking the different technologies and algorithms
discussed in this paper. In Section III, we step through the
design, simulation, and implementation of Deep Neural Net-
works (DNNs) using different hardware technologies. We show
sample cases of how they have been deployed in healthcare
settings. Furthermore, we demonstrate the steps and techniques
required to simulate and implement hardware for the benchmark
hand-gesture classification task using memristive crossbars and
FPGAs.
In Section IV, we provide our perspective on the challenges
and opportunities of both DNNs and SNNs for biomedical ap-
plications and shed light on the future of spiking neuromorphic
hardware technologies in the biomedical domain. Section V
concludes the tutorial.
II. DEEP ARTIFICIAL AND SPIKING NEURAL NETWORKS
A. Nomenclature of Neural Network Architectures
Although most DNNs reported in literature are ANNs, DNNs
refer to more than one hidden layer, independently of whether
the architecture is fully connected, convolutional, recurrent,
ANN or SNN, or of any other structure. For example, the most
widely used DNN type in image processing, i.e. a CNN, can
be physically implemented as an ANN or SNN, and in both
cases it would be ‘deep’. However, in this paper, whenever we
use the terms ‘deep,’ DL, or deep network, we refer to Deep
Artificial Neural Networks. For Deep Spiking Neural Networks,
we simply use the term SNN.
B. Deep Artificial Neural Networks
Traditional ANNs and their learning strategies that were first
developed several decades ago [27] have, in the past several
years, demonstrated unprecedented performance in a plethora
of challenging tasks which are typically associated with human
cognition. These have been applied to medical image diagno-
sis [28] and medical text processing [29], using DNNs.
Fig. 2 illustrates a simplified overview of the structure of some
of the most widely-used DNNs. The most conventional form
of these architectures is the Multi-Layer Perceptron (MLP).
Increasing the number of hidden layers of perceptron cells
is widely regarded to improve hierarchical feature extraction
which is exploited in various biomedical tasks, such as seizure
detection from electroencephalography (EEG) [30], [31]. CNNs
introduce convolutional layers, which use spatial filters to en-
courage spatial invariance. CNNs often include pooling lay-
ers to downsample their outputs to reduce the search space
for subsequent convolutional layers. CNNs have been widely
used in medical and healthcare applications, as they are very
well-suited for spatially structured data. Their use in medical
image analysis [32] will form a major part of our discussions in
subsequent sections.
RNNs are another powerful network architecture recently
used both individually [33], and in combination with CNNs [34],
in biomedical applications. RNNs introduce recurrent cells with
a feedback loop, and are especially useful for processing se-
quential data such as temporal signals and time-series data,
e.g. electrocardiography (ECG) [34], and medical text [35].
The feedback loop in recurrent cells gives them a memory of
previous steps and builds a dynamic awareness of changes in
the input. The most well-known type of RNNs are LSTMs
which are designed to mine patterns in data sequences using
their short-term memory of distant events stored in their memory
cells. LSTMs have been widely used for processing biomedical
signals such as ECGs [33], [36]. Although there are many other
variants of DNN architectures, we will focus on these most
commonly used types.
1) Automatic Hierarchical Feature Extraction: The above
mentioned DNNs learn intricate features in data through mul-
tiple computational layers across various levels of abstrac-
tion [37]. The fundamental advantage of DNNs is that they mine
the input data features automatically, without the need for human
knowledge in their supervised learning loop. This allows deep
networks to learn complex features by combining a hierarchy
of simpler features learned in their hidden layers [37].
AZGHADI et al.: HARDWARE IMPLEMENTATION OF DEEP NETWORK ACCELERATORS 1141
2) Learning Algorithms: Learning features from data in a
DNN, e.g. the networks shown in Fig. 2, is typically achieved
by minimizing a loss function. In most cases, this is equivalent
to finding the maximum likelihood using the cross-entropy
between training data and the learned model distribution. Loss
function minimization is achieved by optimizing the network
parameters (weights and biases). This optimization process min-
imizes the loss function from the final network layer backward
through all the network layers and is therefore called backprop-
agation. Widely used optimization algorithms in DNNs include
Stochastic Gradient Descent (SGD) and those that use adaptive
learning rates [37].
3) Backpropagation in DNNs is Computationally Expensive:
Despite the continual improvement of hardware platforms for
running and training DNNs, reducing their power consumption
is a computationally formidable task. One of the dominant
reasons is the feed-forward error backpropagation algorithm,
which depends on thousands of epochs of computationally in-
tensive Vector Matrix Multiplication (VMM) operations [27],
using huge datasets that can exceed millions of data points.
These operations, if performed on a conventional von Neumann
architecture which has separate memory and processing units,
will have a time and power complexity of order O(N2) for
multiplying a vector of length N in a matrix of dimensions
N ×N .
In addition, an artificial neuron in DNNs calculates a sum-
of-products of its input-weight matrix pairs. For instance, a
CNN spatially structures the sum-of-products calculation into
a VMM operation. In digital logic, an adder tree can be used to
accumulate a large number of values. This, however, becomes
problematic in DNNs when one considers the sheer number
of elements that must be summed together, as each addition
requires one cycle.
4) Transfer Learning: A major assumption when training
DNNs is that both training and test samples are drawn from
the same feature space and distribution. When the feature space
and/or distribution changes, DNNs should be retrained. Rather
than training a new model from scratch, trained parameters from
an existing model can be fixed, tuned, or adapted [38]. This
process of transfer learning can be used to greatly reduce the
computational expense of training DNNs.
In the medical imaging domain, transfer learning from natu-
ral image datasets, particularly ImageNet [39], using standard
large models and corresponding pretrained weights has become
a de-facto method to speed up training convergence and to
improve accuracy [40]. Transfer learning can also be used to
leverage personalized anatomical knowledge accumulated over
time to improve the accuracy of pre-trained CNNs for specific
patients [41], i.e., to perform patient-specific model tuning. This
is an important topic in biomedical application domains, which
will be further discussed in IV-F.
C. DL Accelerators
In Table I, we depict some popular CNN architectures, accom-
panied with the total number of weights, and MAC operations
that must be computed for a single image (input resolutions
TABLE I
NUMBER OF WEIGHTS AND MULTIPLY-AND-ACCUMULATE (MAC) OPERATIONS
IN VARIOUS CNN ARCHITECTURES FOR A SINGLE IMAGE AND FOR
VIDEO PROCESSING AT 25 FRAMES PER SECOND
Fig. 3. Typical hardware technologies for DNN acceleration. In this paper we
cover the top two layers of the pyramid, which include specialized hardware
technologies for high-performance training and inference of DNNs. While the
apex is labelled RRAM, this is intended to broadly cover all programmable
non-volatile resistive switching memories e.g. CBRAM, MRAM, PCM, etc.
of 656 × 468 for OpenPose, 224 × 224 for the rest). This
table highlights two key facts. Firstly, MACs are the dominant
operation of DNNs. Therefore, hardware implementations of
DNNs should strive to parallelize a large number of MACs to
perform effectively. Secondly, there are many predetermined
weights that must be called from memory. Reducing the energy
and time consumed by reading weights from memory provides
another opportunity to improve efficiency.
Consequently, significant research has been being conducted
to achieve massive parallelism and to reduce memory access
in DNN accelerators, using different hardware technologies and
platforms as depicted in Fig. 3. Although these goals are towards
general DL applications, they can significantly facilitate fast and
low-power smart PoC devices [21] and healthcare IoT systems.
In addition to conventional DL accelerators, there have been
significant research efforts to utilize biologically plausible SNNs
for learning and cognition [42]. Spiking neuromorphic proces-
sors have also been used for biomedical signal processing [43]–
[45]. Below, we provide a brief introduction to SNNs, which will
be discussed as a method complementary to DL accelerators
for efficient biomedical signal processing later in this paper.
We will also perform comparisons among SNNs and DNNs in
performing an EMG processing task.
1142 IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. 14, NO. 6, DECEMBER 2020
Fig. 4. DNNs and SNN neuromorphic processors adopt different operation models. In DNNs, inputs are processed in batches which propagate serially.
Consequently, they require clocks for process synchronization. SNNs are asynchronous and process temporally encoded inputs independently. Time series signals,
such as the EMG signal presented in (a) can be either (b) temporally encoded using spike train encoding schemes such as [43], before being fed into (j) neuromorphic
processors, or (c) digitally sampled, before being concatenated into batches, to be fed into (d) DNNs. Similarly, photographs captured from (e) lenses can be (i)
temporally encoded into spike trains using (h) DVSs [50] or (f) digitally encoded using conventional cameras to build (g) image frames.
D. Spiking Neural Networks
SNNs are neural networks that typically use Integrate-and-
Fire neurons to dynamically process temporally varying signals
(see Fig. 4(j)). By integrating multiple spikes over time, it is
possible to reconstruct an analog value that represents the mean
firing rate of the neuron. The mean firing rate is equivalent to
the value of the activation function of ANNs. So in the mean
firing rate limit, there is an equivalence between ANNs and
SNNs. By using spikes as all-or-none digital events (Fig. 4(i)),
SNNs enable the reliable transmission of signals across long
distances in electronic systems. In addition, by introducing the
temporal dimension, these networks can efficiently encode and
process sequential data and temporally changing inputs [46].
SNNs can be efficiently interfaced with event-based sensors
since they only process events as they are generated. An example
of such sensors is the Dynamic Vision Sensor (DVS), which is
an event-based camera shown in Fig. 4(h). The DVS consists of
a logarithmic photo-detector stage followed by an operational
transconductance amplifier with a capacitive-divider gain stage,
and two comparators. The ON/OFF spikes are generated every
time the difference between the current and previous value of the
input exceeds a pre-defined threshold. The sign of the difference
corresponds to the ON or OFF channel where the spike is
produced. This is different to conventional cameras (Fig. 4(f)),
which produce image frames (Fig. 4(g)). Intuitively, it makes
sense to use asynchronous event-based sensor data in asyn-
chronous SNNs, and synchronously generated frames (i.e., all
pixels are given at a regular clock interval) through synchronous
ANNs. But it is worth noting that conventional frames can be
encoded as asynchronous spikes with frequencies that vary based
on pixel intensity, and event streams can be integrated over time
into synchronously generated time-surfaces [47], [48]. Event-
based sensors have been used to process biomedical signals [43],
[49] (Fig. 4(a)), which can be encoded to spike trains (Fig. 4(b))
to be processed by SNNs or be digitally sampled (Fig. 4(c)) for
use in DNNs for learning and inference (Fig. 4(d)).
E. Benchmarking on a Biomedical Signal Processing Task
In Section III we will present a use-case of bio-signal pro-
cessing where FPGA and memristive DNN accelerators are
implemented and simulated. These are later compared to equiv-
alent existing implementations1 using DNN accelerators and
1[Online]. Available: https://github.com/Enny1991/dvs_emg_fusion/blob/
master/full_baseline.py
AZGHADI et al.: HARDWARE IMPLEMENTATION OF DEEP NETWORK ACCELERATORS 1143
neuromorphic processors from [45]. To perform comparisons,
we use the same hand-gesture recognition task as in [45].
Tasks such as prosthesis control can be performed using EMG
signals, hand-gesture classification, or a combination of both.
Here, the adopted hand-gesture dataset [45] is a collection of
5 hand gestures recorded with two sensor modalities: muscle
activity from a Myo armband that senses EMG electrical activity
in forearm muscles, and a visual input in the form of DVS events.
Moreover, the dataset provides accompanying video captured
from a traditional frame-based camera, i.e., images from an
Active Pixel Sensor (APS) to feed DNNs. Recordings were
collected from 21 subjects including 12 males and 9 females
between the ages 25 and 35, and were taken over three separate
sessions.
For each implementation, we compare the mean and standard
deviation of the accuracy obtained over a 3-fold cross validation,
where each fold encapsulates all recordings from a given session.
Additionally, for all implementations, we compare the energy
and time required to perform inference on a single input, as well
as the Energy-Delay Product (EDP), which is the average energy
consumption multiplied by the average inference time.
III. DNN ACCELERATORS TOWARDS HEALTHCARE AND
BIOMEDICAL APPLICATIONS
In this Section, we cover the use of CMOS and memristors in
DL acceleration. We discuss how they use different strategies to
achieve two of the key DNN acceleration goals, namely MAC
parallelism and reduced memory access. We also discuss and
review FPGAs as an alternative reconfigurable DNN accelerator
platform, which has shown great promise in the healthcare and
biomedical domains.
A. CMOS DNN Accelerators
General edge-AI CMOS accelerator chips can be used for
DNN-enabled healthcare IoT and PoC systems. Therefore,
within this subsection, we first review a number of these chips
and provide examples of potential healthcare applications they
can accelerate. We will also explore some common approaches
to CMOS-driven acceleration of AI algorithms using mas-
sive MAC parallelism and reduced memory access, which are
useful for both edge-AI devices and offline data center scale
acceleration.
1) Edge-AI DNN Accelerators Suitable for Biomedical Ap-
plications: The research and market for ASICs, which focus on
a new generation of microprocessor chips dedicated entirely to
machine learning and DNNs, have rapidly expanded in recent
years. Table II shows a number of these CMOS-driven chips,
which are intended for portable applications. There are many
other examples of AI accelerator chips (for a comprehensive
survey see [51]), but here we picked several prolific examples,
which are designed specifically for DL using DNNs, RNNs, or
both. We have also included a few general purpose AI acceler-
ators from Google [52], Intel [53], and Huawei [54].
Although developed for general DNNs, the accelerators
shown in Table II can efficiently realize portable smart DL-based
healthcare IoT and PoC systems for processing image-based
(medical imaging) or dynamic sequential medical data types
(such as EEG and ECG). For instance, the table shows a few
exemplar healthcare and biomedical applications that are picked
based on the demonstrated capacity of these accelerators to run
(or train [55]) various well-known CNN architectures such as
VGG, ResNet, MobileNet, AlexNet, Inception, or RNNs such as
LSTMs, or combined CNN-RNNs. It is worth noting that most
of the available accelerators are intended for CNN inference,
while only some [56]–[58] also include recurrent connections
for RNN acceleration.
The Table shows that the total power per chip in most of these
devices is typically in the range of hundreds of mW, with a few
exceptions consuming excessive power of around 10 Watts [53],
[54]. This is required to avoid large heat sinks and to satisfy
portable battery constraints. The Table also shows the com-
puting capability per unit time (column ‘Computational Power
(GOP/s)’). Regardless of power consumption, this column re-
veals the computational performance and consequently the size
of a network one can compute per unit time. It is demonstrated
that several of these chips can run large and deep CNNs such
as VGG and ResNet, which enable them to perform complex
processing tasks within a constrained edge power budget.
For instance, it has been previously shown in [60] that VGG
CNN (shown to be compatible with Cambricon-x [59]), can
successfully analyze ECog signals. Therefore, considering the
power efficiency of Cambricon-x, it can be used to implement
a portable automatic ECog analyzer for PoC diagnosis of var-
ious cardiovascular diseases [78]. Similarly, Eyeriss [61] can
run VGG-16, which is shown to be effective in diagnosing
thyroid cancer [62]. In addition, Eyeriss can run AlexNet for
several different medical imaging applications [32]. Therefore,
Eyeriss can be used as a mobile diagnostic tool that can be
integrated into or complement medical imaging systems at the
PoC. Origami [63] is another CNN accelerator chip, which can
be used for other healthcare applications based on a CNN. For
instance, [64] proposes a CNN-based ECG analysis for heart
monitoring, or [65] introduces a two-stage end-to-end CNN for
human activity recognition for elderly and rehailitation monitor-
ing, whereas Origami can be used to develop a smart healthcare
IoT edge device. Similarly, the CNN processor proposed in [66]
is shown to be able to run AlexNet, which can be deployed in
a PoC ultrasound image processing system [67]. Envision [68]
is another accelerator that has the capability to run large-scale
CNNs. It can also be used as an edge inference engine for a
multi-layer CNN for EEG/ECog feature extraction for epilepsy
diagnosis [69]. Neural processor [70] is another CNN accelerator
that is shown to be able to run Inception V3 CNN, which can be
used for skin cancer detection [11] at the edge. LNPU [55] is the
only CNN accelerator shown in Table II, which unlike the others
can perform both learning and inference of a deep network such
as AlexNet and VGG-16, for applications including on edge
medical imaging [32] and cancer diagnosis [62].
Unlike the above discussed chips that are capable of running
only CNNs, DNPU [56], Thinker [57], and UNPU [58] are ca-
pable of accelerating both CNNs and RNNs. This feature makes
them suitable for a wider variety of edge-based biomedical
applications such as ECG analysis for BCI using a cascaded
1144 IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. 14, NO. 6, DECEMBER 2020
TABLE II
A NUMBER OF RECENT EDGE-AI CMOS CHIPS SUITABLE FOR PORTABLE HEALTHCARE AND BIOMEDICAL APPLICATIONS
RNN-CNN [34], PoC MRI construction from motion ultra-
sounds using a long-term recurrent CNN [71], intelligent medi-
cal consultation using a CNN-RNN [35], respiratory sound clas-
sification in wearable devices enabled by patient specific model
tuning using a CNN-RNN [72], or on-chip online and personal-
ized prediction of missing Photoplethysmographic data [73].
Table II lists three general purpose AI accelerator chips, which
have been deployed for low-cost and easy-to-access skin cancer
detection using MobileNet V1 CNN [25], on edge health moni-
toring for fall detection using LSTMs [74], chest X-ray analysis
using ResNet CNN [76], long term bowel sound monitoring
and segmentation using a CNN [77], cardiovascular arrhythmia
detection from ECG using an LSTM [33], or heart rate variability
analysis from ECG signals through a bidirectional LSTM [36],
just to name a few. These general-purpose chips have the po-
tential to be used for other biomedical edge-based applications
such as robust long-term decoding in intracortical BMIs using
MLP and ELM networks in a sparse ensemble machine learning
platform [75].
In addition to the edge-AI CNN and RNN acceleration chips
or general ML chips mentioned in Table II, there have been
other works that have developed custom CMOS platforms for
AZGHADI et al.: HARDWARE IMPLEMENTATION OF DEEP NETWORK ACCELERATORS 1145
biomedical applications. Examples of these CMOS designs
include [79] that has developed a 128-Channel ELM-based
neural decoder for BMI, and [80] that has implemented an
autoencoder neural network as part of a neural interface proces-
sor for brain-state classification and programmable-waveform
neurostimulation.
2) Common Approaches to CMOS-Driven DL Acceleration:
Accelerators will typically target either data center use or embed-
ded ‘edge-AI’ acceleration. Edge chips, such as those discussed
above, must operate under restrictive power budgets (e.g., within
thermal limits of 5 W) to cope with portable battery constraints.
While the scale of tasks, input dimension capacity, and clock
speeds will differ between edge-AI and modular data center
racks, both will adopt similar principles in the tasks they seek to
optimize.
Most of the accelerator chips, such as those discussed in
Table II, use similar optimization strategies involving reduced
precision arithmetic [55], [58], [66], [68] to improve computa-
tional throughput. This is typically combined with architectural-
level enhancements [56], [57], [59], [61], [70] to either reduce
data movement (using in- or near-memory computing), height-
ened parallelism, or both. In addition, there are many other
approaches commonly used to make neural network implemen-
tations more efficient. Examples of these include tensor de-
composition, pruning, and mixed-precision data representation,
which are often integrated in hardware with in-memory and near-
memory computing. A thorough review of these approaches can
be found in [81] and [82].
Sequential and combinational logic research is largely ma-
tured, so outside of emerging memory technologies, the domi-
nant hardware benefits are brought on by optimizing data flow
and architecture. An early example is the neuFlow system-on-
chip (SoC) processor which relies on a grid of processing tiles,
each made up of a bank of processing operators and a multiplexer
based on-chip router [83]. The processing operator can serially
perform primitive computation (MUL, DIV, ADD, SUB, MAX),
or a parallelized 1D/2D convolution. The router configures data
movement between tiles to support streaming data flow graphs.
Since the development of neuFlow, over 100 startups and
companies have developed, or are developing, machine learning
accelerators. The Neural Processing Unit (NPU) [84] gener-
alizes the work from neuFlow by employing eight processing
engines which each compute a neuron response: multiplication,
accumulation, and activation. If a program could be partitioned
such that a segment of it can be calculated using MACs, then it
would be partially computed on the NPU. This made it possible
to go beyond MLP neural networks. The NPU was demonstrated
to perform Sobel edge detection and fast Fourier transforms as
well.
NVIDIA coupled their expertise in developing GPUs with
machine learning dedicated cores, namely, tensor cores, which
are aimed at demonstrating superior performance over regular
Compute Unified Device Architecture (CUDA) cores [17]. Ten-
sor cores target mixed-precision computing, with their NVIDIA
Tesla V100 GPU combining 672 tensor cores on a single unit. By
merging the parallelism of GPUs with the application specific
nature of tensor cores, their GPUs are capable of energy efficient
general compute workloads, as well as 12 trillion floating-point
operations per seconds (TFLOPSs) of matrix arithmetic.
Although plenty of other notable architectures exist (see
Table II), a pattern begins to emerge, as most specialized proces-
sors rely on a series of sub-processing elements which each con-
tribute to increasing throughput of a larger processor [81], [82].
Whilst there are plenty of ways to achieve MAC parallelism,
one of the most renowned techniques is the systolic array, and is
utilized by Groq [85] and Google, amongst numerous other chip
developers. This is not a new concept: systolic architectures were
first proposed back in the late 1970s [86], [87], and have become
widely popularized since powering the hardware DeepMind
used for the AlphaGo system to defeat Lee Sedol, the world
champion of the board game Go in October 2015. Google also
uses systolic arrays to accelerate MACs in their TPUs, just one
of many CMOS ASICs used in DNN processing [13].
B. FPGA DNNs
FPGAs are fairly low-cost reconfigurable hardware that can
be used in almost any hardware prototyping and implementation
task, significantly shortening the time-to-market of an elec-
tronic product. They also provide parallel computation, which
is essential when simultaneous data processing is required such
as processing multiple ECG channels in parallel. Furthermore,
there exists a variety of High Level Synthesis (HLS) tools and
techniques [88], [89] that facilitate FPGA prototyping without
the need to directly develop time-consuming low-level Hardware
Description Language (HDL) codes [90]. These tools allow
engineers to describe their targeted hardware in high-level pro-
gramming languages such as C to synthesize them to Register
Transfer Level (RTL). The tools then offload the computational-
critical RTL to run as kernels on parallel processing platforms
such as FPGAs [91].
1) Accelerating DNNs on FPGAs: FPGAs have been previ-
ously used to realize mostly inference [89], [92], [93], and in
some cases training of DNNs with reduced-precision-data [94],
or hardware-friendly approaches [95]. For a comprehensive
review of previous FPGA-based DNN accelerators, we refer the
reader to [89].
Here, we demonstrate an example of accelerating DNNs to
benchmark the biomedical signal processing task explained in
Subsection II-E. For our acceleration, we use fixed-point param-
eter representations on a Starter Platform for OpenVINO Toolkit
FPGA using OpenCL. OpenCL [88] is an HLS framework for
writing programs that execute across heterogeneous platforms.
OpenCL specifies programming languages (based on C99 and
C++11) for programming the compute devices and Application
Programming Interfaces (APIs) to control and execute its devel-
oped kernels on the devices, where depending on the available
computation resources, an accelerator can pipeline and execute
all work items in parallel or sequentially.
Fig. 5 depicts the compilation flow we adopted. The trained
DNN PyTorch model is first converted to. prototxt and. caf-
femodel files using Caffe. All weights and biases are then
converted to a fixed point representation using MATLAB’s
Fixed-point toolbox using word length and fractional bit lengths
1146 IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. 14, NO. 6, DECEMBER 2020
Fig. 5. Compilation flow used to deploy an EMG classification CNN to an OpenVINO FPGA adopting fixed-point number representations using OpenCL.
defined in [96], prior to being exported as a single binary .dat
file for integration with PipeCNN, which is used to generate
the necessary RTL libraries, and to perform compilation of the
host executable and the FPGA bit-stream. We used Intel’s FPGA
SDK for OpenCL 19.1, and provide all files used during the com-
pilation shown in Fig. 5 in a publicly accessible complementary
GitHub repository.2
2) FPGA-Based DNNs for Biomedical Applications: De-
spite the many FPGA-based DNN accelerators available [89],
only a few have been developed specifically for biomedical
applications such as ECG anomaly detection [97], or real-
time mass-spectrometry data analysis for cancer detection [98],
where the authors show that application-specific parameter
quantization and customized network design can result in sig-
nificant inference speed-up compared to both CPU and GPU.
In addition, the authors in [99] have developed an FPGA-based
BCI, in which a MLP is used for reconstructing ECog signals.
In [100], the authors have implemented an EEG processing and
neurofeedback prototype on a low-power but low-cost FPGA
and then scaled it on a high-end Ultra-scale Virtex-VU9P, which
has achieved 215 and 8 times power efficiency compared to CPU
and GPU, respectively. For the EEG processing, they developed
an LSTM inference engine.
It is projected that, by leveraging specific algorithmic design
and hardware-software co-design techniques, FPGAs can pro-
vide >10 times energy-delay efficiency compared to state-of-
the-art GPUs for accelerating DL [89]. This is significant for
realizing portable and reliable healthcare applications. However,
FPGA design is not as straightforward as high-level designs
conducted for DL accelerators and requires skilled engineers and
stronger tools, such as those offered by the GPU manufacturers.
C. Memristive DNNs
To achieve the two aforementioned key DNN acceleration
goals, i.e. massive MAC parallelism and reduced memory ac-
cess, many studies have leveraged memristors [101]–[104] as
weight elements in their DNN and SNN [105], [106] architec-
tures. Memristors are often referred to as the fourth fundamental
circuit element, and can adapt their resistance (conductance)
to changes in the applied current or voltage. This is similar to
the adaptation of neural synapses to their surrounding activity
while learning. This adaptation feature is integral to the brain’s
2https://github.com/coreylammie/TBCAS-Towards-Healthcare-and-
Biomedical-Applications/blob/master/FPGA/
Fig. 6. Memristive crossbars can parallelize (a) analog MAC and (b) VMM
operations. Here, V represents the input vector, while conductances in the
crossbar represent the matrix.
in-memory processing ability, which is missing in today’s gen-
eral purpose computers. This in-situ processing can be utilized to
perform parallel MAC operations inside memory, hence, signif-
icantly improving DNN learning and inference. This is achieved
by developing memristive crossbar neuromorphic architectures,
which are projected to achieve approximately 2500-fold reduc-
tion in power and a 25-fold increase in acceleration, compared
to state-of-the-art specialized hardware such as GPUs [101].
1) Memristive Crossbars for Parallel MAC and VMM Opera-
tions: A memristive crossbar that can be fabricated using a vari-
ety of device technologies [106], [107] can perform analog MAC
operations in a single time-step (see Fig. 6(a)). This reduces the
time complexity to its minimum (O(1)), and is achieved by
carrying out multiplication at the place of memory, in a non-von
Neumann structure. Using this well-known approach, VMM can
be parallelized as demonstrated in Fig. 6(b), where the vector of
size M represents input voltage signals ([V1..VM ]). These volt-
ages are applied to the rows of the crossbar, while the matrix (of
size M ×N ), whose elements are represented as conductances
(resistances), is stored in the memristive components at each
cross point. Taking advantage of the basic Ohm’s law (I = V.G),
the current summed in each crossbar column represents one
element of the resulting multiplication vector of size N .
2) Mapping Memristive Crossbars to DNN Layers:
Although implementing fully-connected DNN layers is
straightforward by mapping the weights to crossbar point
AZGHADI et al.: HARDWARE IMPLEMENTATION OF DEEP NETWORK ACCELERATORS 1147
Fig. 7. Conversion process of a DNN trained in PyTorch and mapped to a Memristive DNN using MemTorch [108], to parallelize VMMs using 1T1R memristive
crossbars and to take into account memristor variability including finite number of conductance states and non-ideal RON and ROFF distributions.
memristors and having the inputs represented by input voltages,
implementing a complex CNN requires mapping techniques to
convert convolution operations to MAC operations. A popular
approach to perform this conversion is to use an unrolling
(unfolding) operation that transforms the convolution of input
feature maps and convolutional filters to MAC operations. We
have developed a software platform named MemTorch [108],
that will be introduced in subsequent sections, to perform this
mapping as well as a number of other operations, for converting
DNNs to Memristive DNNs (MDNNs). The mapping process
implemented in MemTorch is illustrated in the left panel in
Fig. 7. The figure shows that the normal input feature maps and
convolutional filters (shown in gray shaded area) are unfolded
and reshaped (shown in the cyan shaded area) to be compatible
with memristive crossbar parallel VMM operations. It is worth
noting that the convolutional filters that can be applied to
the input feature maps have a direct relationship with the
required crossbar sizes. Furthermore, the resulting hardware
size depends on the size of the input feature maps [109].
3) Peripheral Circuitry for Memristive DNNs: In addition
to the memristive devices that are used as programmable ele-
ments in MDNN architectures, various peripheral circuitry is
required to perform feed-forward error-backpropagation learn-
ing in MDNNs [103]. This extra circuitry may include: (i) a
conversion circuit to translate the input feature maps to input
voltages, which for programming memristive devices are usually
Pulse Width Modulator (PWM) circuits, (ii) current integrators
or sense amplifiers, which pass the current read from every
column of the memristive crossbar to (iii) analog to digital
converters (ADCs), which pass the converted voltage to (iv)
an activation function circuit, for forward propagation, and for
backward error propagation (v) the activation function derivative
circuit. Other circuits required in the error backpropagation path
include (vi) backpropagation values to PWM voltage generators,
(vii) backpropagation current integrators, and (viii) backpropa-
gation path ADCs. In addition, an update module that updates
network weights based on an algorithm such as SGD is required,
which is usually implemented in software. After the update, the
new weight values should be written to the memristive crossbar,
which itself requires Bit-Line (BL) and Word-line (WL) switch
matrices to address the memristors for update, as well as a circuit
to update the memristive weights. There are different approaches
to implement this circuit such as that proposed in [110], while
others may use software ex-situ training where the new weight
values are calculated in software and transferred to the physical
memristors through peripheral circuitry [103].
4) Memristive Device Nonidealities: Although ideal mem-
ristive crossbars have been projected to remarkably accelerate
DNN learning and inference and drastically reduce their power
consumption [101], [102], device imperfections observed in
experimentally fabricated memristors impose significant per-
formance degradation when the crossbar sizes are scaled up
for deployment in real-world DNN architectures, such as those
required for healthcare and biomedical applications discussed
in Subsection III-A. These imperfections include nonlinear
asymmetric and stochastic conductance (weight) update, device
temporal and spatial variations, device yield, as well as limited
on/off ratios [101]. To minimize the impact of these imperfec-
tions, specific peripheral circuitry and system-level mitigation
techniques have been used [111]. However, these techniques
add significant computation time and complexity to the system.
It is, therefore, essential to take the effect of these nonidealities
into consideration before utilizing memristive DNNs for any
healthcare and medical applications, where accuracy is critical.
In addition, there is a need for a unified tool that reliably
simulates the conversion of a pre-trained DNN to a MDNN,
while critically considering experimentally modeled device im-
perfections [108].
5) Conversion of DNN to MDNN While Considering Mem-
ristor Nonidealities: Due to the significant time and energy
required to train new large versions of DNNs for challenging
cognitive tasks, such as biomedical and healthcare data process-
ing [9], [112], the training of the algorithms is usually under-
taken in data centers [9], [13]. The pretrained DNN can then be
transferred to be used on memristive crossbars. There are several
different frameworks and tools that can be used to simulate
and facilitate this transition [113]. In a recent study, we have
developed a comprehensive tool named MemTorch, which is
1148 IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. 14, NO. 6, DECEMBER 2020
an open source, general, high-level simulation platform that can
fully integrate any behavioral or experimental memristive device
model into crossbar architectures to design MDNNs [108].
Here, we utilize the benchmark biomedical signal processing
task explained in Subsection II-E to demonstrate how pretrained
DNNs can be converted to equivalent MDNNs, and how non-
ideal memristive devices can be simulated within MDNNs prior
to hardware realization. The conversion process, which can be
generalized to other biomedical models using MemTorch, is
depicted in Fig. 7.
The targeted MDNNs are constructed by converting linear
and convolutional layers from PyTorch pre-trained DNNs to
memristive equivalent layers employing 1-Transistor-1-Resistor
(1T1R) crossbars. A double-column scheme, in which two cross-
bars are used to represent positive and negative weight values, is
used to represent network weights within memristive crossbars.
The converted MDNN models are tuned using linear regression,
as described in [108]. The complete and detailed process and
the source code of the network conversion for the experiments
shown in this subsection are provided in a publicly accessible
complementary Jupyter Notebook.3
During the conversion, any memristor model can be used.
For the benchmark task, a reference VTEAM model [114] is
instantiated using parameters from Pt/Hf/Ti Resistive Random
Access Memory (RRAM) devices [115], to model all memristive
devices within converted linear and convolutional layers. As
already mentioned, memristive devices have inevitable variabil-
ity, which should be taken into account when implementing
an MDNNs for learning and/or inference. Also, depicted in
Fig. 7 are visualizations of two non-ideal device characteristics:
the finite number of conductance states and device-to-device
variability. Using MemTorch [108], not only can we convert any
DNNs to an equivalent MDNNs utilizing any memristive device
model, we are also able to comprehensively investigate the effect
of various device non-idealities and variation on the performance
of a possible MDNN, before it is physically realized in hardware.
In order to demonstrate an example which includes vari-
ability in our MDNN simulations, device-device variability is
introduced by sampling ROFF for each device from a normal
distribution with R̄OFF = 2,500Ω with standard deviation 2σ,
and RON for each device from a normal distribution with R̄ON
= 100Ω with standard deviation σ.
In Fig. 8, for the converted memristive MLP and CNN that
process APS hand-gesture inputs, we gradually increase σ from
0 to 500, and compare the mean test set accuracy across the
three folds. As can be observed from Fig. 8, with increasing
device-to-device variability, i.e. the variability ofRON andROFF,
the performance degradation increases across all networks. For
all simulations, RON and ROFF are bounded to be positive.
6) Memristive DNNs Towards Biomedical Applications:
Although some previous small-scale MDNNs have been
simulated for biomedical tasks such as cardiac arrhythmia
classification [116], or have been implemented on a physical pro-
grammable memristive array for breast cancer diagnosis [117],
3[Online]. Available: https://github.com/coreylammie/TBCAS-Towards-
Healthcare-and-Biomedical-Applications/blob/master/MemTorch.ipynb
Fig. 8. Simulation results investigating the performance of MDNNs for hand
gesture classification adopting non-ideal Pt/Hf/Ti ReRAM devices. Device-
device variability is simulated using MemTorch [108].
there is currently no large-scale MDNN, even at the simulation-
level, which has realized any practical biomedical processing
tasks.
Similar to the recent advances in CMOS-driven DNN ac-
celerator chips discussed in Subsection III-A, there have been
promises in partial [102] or full [103] realizations of MDNNs
in hardware, which are shown to achieve significant energy
saving compared to state-of-the-art GPUs. However, unlike
their CMOS counterparts, these implementations have been
only able to perform simple tasks such as MNIST and CIFAR
classification. This is, of course, not suitable for implementing
large-scale CNNs and RNNs, which as shown in Subsection
III-A are required for biomedical and healthcare tasks dealing
with image [32] or temporal [33] data types.
In addition, following similar optimization strategies as those
used in CMOS accelerators, [118] has simulated the use of
quantized and binarized MDNNs and their error tolerance in a
biomedical ECG processing task and has shown their potential
to achieve significant energy savings compared to full-precision
MDNNs. However, due to the many intricacies in the design
process and considering the peripheral circuitry that may offset
the benefits gained by using MDNNs, full hardware design
is required before the actual energy saving of such binarized
MDNNs can be verified.
In the next section, we provide our analysis and perspective
on the use of the three hardware technologies discussed in this
section for DL-based biomedical and healthcare applications.
We also discuss how SNN-based neuromorphic processors can
benefit edge-processing for biomedical applications.
IV. ANALYSIS AND PERSPECTIVE
The use of ANNs trained with the backpropagation learning
algorithm in the domain of healthcare and for biomedical appli-
cations such as cancer diagnosis [130] or ECG monitoring [131]
dates back to the early 90 s. These networks, were typically
small-scale networks run on normal workstations. As they were
AZGHADI et al.: HARDWARE IMPLEMENTATION OF DEEP NETWORK ACCELERATORS 1149
TABLE III
EXISTING HARDWARE IMPLEMENTATIONS AND HARDWARE-BASED SIMULATIONS OF DNN ACCELERATORS USED FOR HEALTHCARE AND BIOMEDICAL
APPLICATIONS, AND GENERIC SNN NEUROMORPHIC PROCESSORS UTILIZED FOR BIOMEDICAL SIGNAL PROCESSING. †SIMULATION-BASED
not deep and did not have too many parameters, they did not de-
mand high-performance accelerators. However, with the resur-
gence of CNNs in the early 2010 s followed by the rapid spread
of DNNs and large data-sets, came the need for high-speed
specialized processors. This need resulted in repurposing GPUs
and actively researching other hardware and design technologies
including ASIC CMOS chips (see Table II) and platforms [13],
memristive crossbars and in-memory computing [102], [103],
[109], and FPGA-based designs for DNN training [94], [95] and
inference [92]. Despite notable progress in deploying non-GPU
platforms for DL acceleration, similar to other data processing
tasks, biomedical and healthcare tasks have mainly relied on
standard technologies and GPUs. Currently, depending on the
size of the required DNN, its number of parameters, as well as the
available training dataset size, biomedical DL tasks are usually
“trained” on high-performance workstations with one or more
GPUs [12], [19], on customized proprietary processors such as
Google TPU [9], or on various Infrastructure-as-a-Service (IaaS)
provider platforms, including Nvidia GPU cloud, Google Cloud,
and Amazon Web Services, among others. This is mostly due
to (i) the convenience these platforms provide using high-level
languages such as Python; (ii) the availability of wide-spread and
open-source DL libraries such as TensorFlow and PyTorch; and
(iii) strong community and/or provider support in utilizing GPUs
and IaaS for training various DNN algorithms and applications.
However, DL inference can benefit from further research
and development on emerging and mature hardware and design
technologies, such as those discussed in this paper, to open
up new opportunities for deploying healthcare devices closer
to the edge, paving the way for low-power and low-cost DL
accelerators for PoC devices and healthcare IoT. Despite this
fact, hardware implementations of biomedical and healthcare
inference engines are very sparse. Table III lists a summary
of the available hardware implementations and hardware-based
simulations of DNNs used for healthcare and biomedical signal
processing applications, using the three hardware technologies
covered herein. In addition, the table shows existing biomedi-
cal signal processing tasks implemented on generic low-power
spiking neuromorphic processors.
A. CMOS Technology Has Been the Main Player for DL
Inference in the Biomedical Domain
Similarly to general-purpose GPUs, all other non-GPU DL
inference engines at present are implemented in CMOS. There-
fore, it is obvious that most of the future edge-based biomedical
platforms would rely on these inference platforms. In Table II,
we listed a number of these accelerators that are mainly devel-
oped for low-power mobile applications. However, before the
deployment of any edge-based DL accelerators for biomedical
and healthcare tasks, some challenges need to be overcome.
A non-exhaustive list of these obstacles include: (i) the power
and resource constraints of available mobile platforms which,
despite significant improvements, are still not suitable for high-
risk medical tasks; (ii) the need to verify that a DL system
can generalize beyond the distribution they are trained and
tested on; (iii) bias that is inherent to datasets which may have
adverse impacts on classification across different populations;
(iv) confusion surrounding the liability of AI algorithms in
high-risk environments [132]; and (v) the lack of a streamlined
workflow between medical practitioners and DL. While the latter
challenges are matters of legality and policy, the former issues
highlight the fundamental need to understand where dataset
bias comes from, and how to improve our understanding of
1150 IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. 14, NO. 6, DECEMBER 2020
TABLE IV
NEUROMORPHIC PLATFORMS USED FOR BIOMEDICAL SIGNAL PROCESSING
why neural networks learn the features they do, such that they
may generalize across populations in a manner that is safe for
receivers of medical care.
In addition, to make the use of any accelerators possible
for general as well as more complex biomedical applications,
the field requires strong hardware-software co-design to build
hardware that can be readily programmed for biomedical tasks.
One successful co-design is the Google TPU [13], which has
successfully been used to surpass human experts in medical
imaging tasks [9]. Google has used a similar CMOS TPU
technology to design inference engines [52], which are very
promising as edge hardware to enable mobile healthcare care
applications. The main reason for this promise is the availability
of the established software platforms (such as TensorFlow Light)
and the community support for the Google TPU.
Overall, great advancements have happened for DL acceler-
ators in the past several years and they are currently stemming
in various aspects of our life from self-driving cars to smart
personal assistants. After overcoming a number of obstacles
such as those mentioned above, we may be also able to widely
integrate these DL accelerators in healthcare and biomedical
applications. However, for some medical applications such as
monitoring that requires always-on processing, we still need
systems with orders of magnitude better power efficiency, so they
can run on a simple button battery for a long time. To achieve
such systems, one possible approach is to process data only when
available and make our processing asynchronous. A promising
method to achieve such goals is the use of brain-inspired
SNN-based neuromorphic processors.
B. Towards Edge Processing for Biomedical Applications
With Neuromorphic Processors
Although most of the efforts presented in this work focused
on DNN accelerators, there are also notable efforts in the domain
of SNN processors that offer complementary advantages, such
as the potential to reduce the power consumption by multiple
orders of magnitude, and to process the data in real time. These
so-called neuromorphic processors are ideal for end-to-end pro-
cessing scenarios, e.g., in wearable devices where the streaming
input needs to be monitored in continuous time in an always-on
manner.
There are already some works using both mixed analog-
digital and digital neuromorphic platforms for biomedical tasks,
showing promising results for always-on embedded biomedical
systems. Table IV shows a summary of today’s large scale
neuromorphic processors, used for biomedical signal process-
ing. The first chip presented in this table is DYNAP-SE [133],
a multi-core mixed-signal neuromorphic implementation with
analog neural dynamics circuits and event-based asynchronous
routing and communication. The DYNAP-SE chip has been
used to implement four of the seven SNN processing systems
listed in Table III. These SNNs are used for EMG [120], [121]
and ECG [44], [119] signal processing. The DYNAP-SE was
also used to build a spiking perceptron as part of a design
to classify and detect High-Frequency Oscillations (HFO) in
human intracranial EEG [49].
In [44], [119], [120] a spiking RNN is used to integrate the
ECG/EMG patterns temporally and separate them in a linear
fashion to be classifiable with a linear read-out. A Support
Vector Machine (SVM) and linear least square approximation
is used in the read out layer for [44], [119] and overall ac-
curacy of 91% and 95% for anomaly detection were reached
respectively. In [120], the timing and dynamic features of the
spiking RNN on EMG recordings was investigated for clas-
sifying different hand gestures. In [121] the performance of
a feedforward SNN and a hardware-friendly spiking learning
algorithm for hand gesture recognition using superficial EMG
was investigated and compared to traditional machine learning
approaches, such as SVM. Results show that applying SVM on
the spiking output of the hidden layer achieved a classification
rate of 84%, and the spiking learning method achieved 74% with
a power consumption of about 0.05 mW . This was compared
to state-of-the-art embedded system showing that the proposed
spiking network is two orders of magnitude more power efficient
[134], [135].
The other neuromorphic platforms listed in Table IV include
digital architectures such as SpiNNaker [136], TrueNorth [137]
and Loihi [138]. SpiNNaker has been used for EMG and EEG
processing and the results show improved classification accu-
racy compared to traditional machine learning methods [122].
In [123], the authors developed a framework for decoding EEG
and LFP using CNNs. The network was first developed in Caffe
and the result was then used as a basis for building a TrueNorth-
compatible neural network. The TrueNorth-compatible net-
work achieved the highest classification, at approximately 76%.
In [124], [125], the authors present a low-power neuromor-
phic platform named Spike-input Extreme Learning Machine
AZGHADI et al.: HARDWARE IMPLEMENTATION OF DEEP NETWORK ACCELERATORS 1151
TABLE V
COMPARISON OF CONVENTIONAL DNNS IMPLEMENTED ON VARIOUS HARDWARE PLATFORMS WITH SPIKING DNN NEUROMORPHIC SYSTEMS ON THE
BENCHMARK BIOMEDICAL SIGNAL PROCESSING TASK OF HAND GESTURE RECOGNITION FOR BOTH SINGLE SENSOR AND SENSOR FUSION, AS EXPLAINED IN
SUBSECTION II-E. THE RESULTS OF THE ACCURACY ARE REPORTED WITH MEAN AND STANDARD DEVIATION OBTAINED OVER A 3-FOLD CROSS VALIDATION.
LOIHI, EMBEDDED GPU, AND ODIN+MORPHIC IMPLEMENTATION RESULTS ARE FROM [45]. THE DNN ARCHITECTURES ADOPTED ARE AS FOLLOWS:
8C3-2P-16C3-2P-32C3-512-5 CNN. †16-128-128-5 MLP. ‡16-230-5 MLP. ∓4 × 400-210-5 MLP. ∪EMG AND APS/DVS NETWORKS ARE FUSED USING A
5-NEURON DENSE LAYER
(SELMA), which performs continuous state decoding towards
fully-implantable wireless intracortical BMI. Recently, the
benchmark hand-gesture classification introduced in Subsection
II-E, was processed and compared on two additional digital neu-
romorphic platforms, Loihi and ODIN/MorphIC [139], [140].
A spiking CNN was implemented on Loihi and a spiking MLP
was implemented on ODIN/MorphIC [45]. The results achieved
using these networks are presented in Table V.
On-chip adaptation and learning mechanisms, such as those
present in some of the neuromorphic devics listed in Table IV,
could be a game changer for personalized medicine, where
the system can adapt to each patient’s unique bio signature
and/or drift over time. However, the challenge of implementing
efficient on-chip online learning in these types of neuromorphic
architectures has not yet been solved. This challenge lies on two
main factors: locality of the weight update and weight storage.
Locality: There is a hardware constraint that the learning
information for updating the weights of any on-chip network
should be locally available to the synapse, otherwise most of the
silicon area would be consumed by the wires, required to route
the update information to it. As Hebbian learning satisfies this
requirement, most of the available on-chip learning algorithms
focus on its implementation in forms of unsupervised/semi-
supervised learning [139], [141]. However, local Hebbian-based
algorithms are limited in learning static patterns or using very
shallow networks [142]. There are also some efforts in the
direction of on-chip gradient-descent based methods which im-
plement on-chip error-based learning algorithms where the least
mean square of a neural network cost function is minimized.
For example, spike-based delta rule is the most common weight
update used for single-layer networks which is the base of
the back-propagation algorithm used in the vast majority of
current multi-layer neural networks. Single layer mixed-signal
neuromorphic circuit implementation of the delta rule have
already been designed [143] and employed for EMG classifi-
cation [121]. Expanding this to multi-layer networks involves
non-local weight updates which limits its on-chip implementa-
tion. Making the backpropagation algorithm local is a topic of
on-going research [144]–[146].
Weight storage: The holy grail weight storage for online
on-chip learning is a memory with non-volatile properties whose
state can change linearly in an analog fashion. Non-volatile
memristive devices provide a great potential for this. Therefore,
there is a large body of literature in combining the maturity of
CMOS technology with the potential of the emerging memories
to take the best out of the two worlds.
The integration of CMOS technology with that of the emerg-
ing devices has been demonstrated for non-volatile filamentary
switches [147] already at a commercial level [148]. There have
also been some efforts in combining CMOS and memristor
technologies to design supervised local error-based learning
circuits using only one network layer by exploiting the properties
of memristive devices [143], [149], [150].
Apart from the above-mentioned benefits in utilizing mem-
ristive devices for online learning in SNN-based neuromorphic
chips, as discussed in Subsection III-C, memristive devices have
1152 IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. 14, NO. 6, DECEMBER 2020
also shown interesting features to improve the power consump-
tion and delay of conventional DNNs. However, as shown in
Table III, memristor-based DNNs are very sparse in the biomed-
ical domain, and existing works are largely based only on
simulation.
C. Why is the Use of MDNNs Very Limited in the
Biomedical Domain?
Currently there are very few hardware implementations of
biomedical MDNNs that make use of general programmable
memristive-CMOS, and only one programmed to construct
an MLP for cancer diagnosis. We could also find two other
memristive designs in literature for biomedical applications
(shown in Table III), but they are only simulations considering
memristive crossbars. This sparsity is despite the significant
advantages that memristors provide in MAC parallelization and
in-memory computing paradigm, while being compatible with
CMOS technology [151]. These features make memristors ideal
candidates for DL accelerators in general, and for portable and
edge-based healthcare applications in particular, because they
have stringent device size and power consumption requirements.
To be able to use memristive devices in biomedical domain,
though, several of their shortcomings such as limited endurance,
mismatch, and analog noise accumulation must be overcome
first. This demands further research in the materials, as well
as the circuit and system design side of this emerging technol-
ogy, while at the same time developing facilitator open-source
software [108] to support MDNNs. Furthermore, investigating
the same techniques utilized in developing CMOS-based DL
accelerators such as limited precision data representation [109],
[118] and approximate computing schemes can lead to advances
in developing MDNNs and facilitate their use in biomedical
domains.
D. Why and When to Use FPGA for Biomedical DNNs?
Table III shows that FPGA is a popular hardware technology
for implementing simple DL networks such as MLPs [97]–[99],
[126] and in a few cases, more complex LSTMs and CNNs [100],
[127]–[129]. The table also shows that FPGAs are mainly used
for signal processing tasks and have not been widely used to
run complex DL architectures such as CNNs. This is mainly
because they have limited on-chip memory and low bandwidth
compared to GPUs. However, they demonstrate notable benefits
in terms of significantly shorter development time compared
to ASICs, and much lower power consumption than typical
GPUs. Besides, significant power and latency improvement
can be gained by customizing the implementation of various
components of a DNN on an FPGA, compared to running it on a
general-purpose CPU or GPU [98], [100]. For instance, in [100],
EEG signals are processed on FPGAs using two customized
hardware blocks for (i) parallelizing MAC operations and (ii)
efficient recurrent state updates, both of which are key elements
of LSTMs. This has resulted in almost an order of magnitude
power efficiency compared to GPUs. This efficiency is critical
in many edge-computing applications including DNN-based
point-of-care biomedical devices [21] and healthcare IoT [20],
[64].
Another benefit of FPGAs is that a customized efficient
FPGA design can be directly synthesized into an ASIC using
a nanometer-node CMOS technology to achieve even more
benefits [128], [129]. For instance, [100] has shown near 100
times energy efficiency improvement as an ASIC in a 15-nm
CMOS technology, compared to its FPGA counterpart.
Although low-power consumption and affordable cost are
two key factors for almost any edge-computing or near-sensor
device, these are even more important for biomedical devices
such as wearables, health-monitoring systems, and PoC de-
vices. Therefore, FPGAs present an appealing solution, where
their limitations can be addressed for a customized DNN using
specific design methods such as approximate computing [95]
and limited-precision data [92], [94], depending on the cost,
required power consumption, and the acceptable accuracy of
the biomedical device.
Another programmable low-power device that can be used in
biomedical applications are Field Programmable Analog Arrays
(FPAAs). These are constructed using programmable Computa-
tional Analog Blocks (CBAs) and interconnects. Unlike FPGAs,
FPAAs tend to be more application driven than general purpose
as they may be current mode or voltage mode devices [152].
FPAAs have been shown to perform computation with 1000
times more power efficiency while reducing the required area
by 100 times when compared to FPGAs [153]. Therefore, they
are a promising candidate for accelerating biomedical signal
processing if machine learning algorithms such as ANNs can be
implemented using them.
In 2003, [154] explored ANNs with differential feedback, and
in 2006 [155] implemented an ANN using multi-chip FPAAs.
More recently, [156] have demonstrated that VMMs can be ef-
ficiently computed using FPAAs, which can be used to compute
linear and unrolled convolution layers within DNNs. However,
while FPAAs have been used in several biomedical applications
ranging from knee-joint rehabilitation [153] to the amplification
of various bio-electric signals [157], the implementation of a
FPAA DNN accelerator, which can be used in biomedical and
general applications, is yet to be explored.
E. Benchmarking EMG Processing Across Multiple DNN and
SNN Hardware Platforms
In Table V, we compare our FPGA and memristive im-
plementations to other DNN accelerators and neuromorphic
processors from [45]. In [45], the authors presented a sensor
fusion neuromorphic benchmark for hand-gesture recognition
based on EMG and event-based camera. Two neuromorphic
platforms, Loihi [138] and ODIN+MorphIC [139], [140], were
deployed and the results were compared to traditional machine
learning baselines implemented on an embedded GPU, the
NVIDIA Jetson Nano. Loihi and ODIN+MorphIC are digital
neuromorphic platforms. Loihi is a 128-core neuromorphic chip
fabricated on 14 nm FinFET process, designed by Intel Labs. It
implements adaptive self-modifying event-driven fine-grained
parallel computations used to implement learning and inference
AZGHADI et al.: HARDWARE IMPLEMENTATION OF DEEP NETWORK ACCELERATORS 1153
with high efficiency. ODIN (Online-learning Digital spiking
Neuromorphic) is designed using 28 nm FDSOI CMOS technol-
ogy and consists of a single neurosynaptic core with 256 neurons
and 2562 synapses that embed a 3-bit weight and a mapping table
bit that allows enabling or disabling Spike-timing-dependent
plasticity (STDP). MorphIC is a quad-core digital neuromorphic
processor with 2 k Leaky Integrate and Fire (LIF) neurons and
more than 2 M synapses in 65 nm CMOS technology [140].
They can be either programmed with offline-trained weights or
trained online with a stochastic version of Spike Driven Synaptic
Plasticity (SDSP).
For the spiking architectures shown in Table V, the vision
input and EMG data were individually processed using spiking
CNN and spiking MLP respectively, and fused in the last layer.
Loihi was trained using SLAYER [158], a backpropagation
framework used for evaluating the gradient of any kind of SNN.
It is a dt-based SNN backpropagation algorithm that keeps track
of the internal membrane potential of the spiking neuron and uses
it during gradient propagation. Both ODIN and Morphic training
was carried out in Keras with quantization-aware stochastic
gradient descent following a standard ANN-to-SNN mapping
approach.
The dataset used is described in Section II-E. It is a collec-
tion of 5 hand gestures from sign language (e.g. ILY).4 In the
comparison proposed in Table V the input and hidden layers are
sequenced with the ReLU activation function, and output layers
are fed through Softmax activation functions to determine class
probabilities. Dropout layers are used in all networks to avoid
over-fitting. The DNN architectures are determined in the table
caption.
The platforms used for each system in Table V are as follows:
ODIN+MorphIC [139], [140] and Loihi [138] neuromorphic
platforms were used for spiking implementations; NVIDIA
Jetson Nano was used for all embedded GPU implementations;
OpenVINO Toolkit FPGA was used for all FPGA implementa-
tions, and MemTorch [108] was used for converting the MLP
and CNN networks to their corresponding MDNNs to determine
the test set accuracies of all memristive implementations.
From Table V, it can be observed that, when transitioning
from generalized architectures to application specific proces-
sors, more optimized processing of a subset of given tasks can
be achieved. Moving up the specificity hierarchy from GPU
to FPGA to memristive networks shows orders of magnitude
of improvement in both MLP and CNN processing, but natu-
rally at the expense of a generalizable range of tasks. While
GPUs are relatively efficient at training networks (compared to
CPUs), the impressive metrics presented by memristor (RRAM
in this simulations) is coupled with limited endurance. This is
not an issue for read-only tasks, as is the case with inference,
but training is thwarted by the thousands of epochs of weight
updates which limits broad use of RRAMs in training. Rather,
more exploration in alternative resistive-based technologies such
as Magnetoresistive Random Access Memory (MRAM) could
prove beneficial for tasks that demand high endurance.
4[Online]. Available: https://zenodo.org/record/3663616#.X2m5GC2cbx4.
Further implementation details can be found in [45].
After determining the test set accuracy of each MDNN using
MemTorch [108], we determined the energy required to per-
form inference on a single input, the inference time, and the
Energy-Delay Product (EDP) by adopting the metrics in [159],
for a tiled memristor architecture. All assumptions made in our
calculations are listed below. Parameters are adopted from those
given in a 1T1R 65 nm technology, where the maximum current
during inference is 3μA per cell with a read voltage of 0.3 V. Each
cell is capable of storing 8 bits with a resistance ratio of 100,
and mapping signed weights is achieved using a dual column
representation. All convolutions are performed by unrolling the
kernels and performing VMMs, and the fully connected layers
have the fan-in weights for a single neuron assigned to one
column. Each crossbar has an aspect ratio of 256 × 64 to enable
more analog operations per ADC when compared to a 128 ×
128 array. Where there is insufficient space to map weights to
a single array, they are distributed across multiple arrays, with
their results to be added digitally. Throughput can be improved
at the expense of additional arrays for convolutional layers, by
duplicating kernels such that multiple inputs can be processed in
parallel. The number of tiles used for each network is assumed
to be the exact number required to balance the processing time
of each layer. The power consumption of each current-mode
8-bit ADC is estimated to be 2×10−4 W with an operating
frequency of 40 MHz (5 MHz for bit-serial operation) [159].
The ADC latency is presumed to dominate digital addition of
partial products from various tiles. The dynamic range of each
ADC has been adapted to the maximum possible range for each
column, and each ADC occupies a pair of columns.
The above presumptions lead to pre-silicon results that are
extremely promising for memristor arrays, as shown in Table V.
But it should be clear that these calculations were performed
for network-specific architectures, rather than a more general
application-specific use-case. That is, we assume the chip has
been designed for a given neural network model. The other
comparison benchmarks are far more generalizable, in that they
are suited to not only handle most network topologies, but are
also well-suited for training. The substantial improvement of
inference time over other methods is a result of duplicate weights
being mapped to enable higher parallelism, which is tolerable
for small architectures, but lends to prohibitively large ADC
power consumption for computer vision tasks which rely on
deep networks and millions of parameters, such as VGG-16. In
addition, the area of each ADC is estimated to be 3×10−3mm2,
which is orders of magnitude larger than the area of each
RRAM cell (1.69×10−7mm2). This disparity implies that pitch-
matching is not viable. Instead, to achieve parallelism, weights
must be duplicated across tiles which demands redundancy. This
improvement in parallelism thus comes at the cost of additional
area and power consumption. The use of memristors as synapses
in spike-based implementations may be more appropriate, so
as to reduce the ADC overhead by replacing multi-bit ADCs
with current sense amplifiers instead, and reducing the reliance
on analog current summation along resistive and capacitive
bit-lines.
Spike-based hardware show approximately two orders of
magnitude improvement in the EDP from Table V when
1154 IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. 14, NO. 6, DECEMBER 2020
compared to their GPU and FPGA counterparts, which high-
lights the prospective use of such architectures in always-on
monitoring. This is necessary for enhancing the prospect of
ambient-assisted living, which would allow medical resources to
be freed up for tasks that are not suited for automation. In general,
one would expect that data should be processed in its naturalized
form. For example, 2D CNNs do not discard the spatial relations
between pixels in an image. Graph networks are optimized
for connectionist data, such as the structure of proteins. By
extension, the discrete events generated by electrical impulses
such as in EMGs, EEGs and ECGs may also be optimized for
SNNs. Of course, this discounts any subthreshold firing patterns
of measured neuron populations. But one possible explanation
for the suitability of spiking hardware for biological processes
stems from the natural timing of neuronal action potentials.
Individual neurons will typically not fire in excess of 100 Hz,
and the average heart rate (and correspondingly, ECG spiking
rate) will not exceed 3 Hz. There is a clear mismatch between the
clock rate of non-spiking neural network hardware, which tend
to at least be in the MHz range, and spike-driven processes. This
introduces a significant amount of wastage in processing data
when there is no new information to process (e.g., in between
heartbeats, action potentials, or neural activity).
Nonetheless, it is clear that accuracy is compromised when
relying on EMG signals alone, based on the approximately
10% decrease of classification accuracy on the Loihi chip and
ODIN+MorphIC, as against their GPU/FPGA counterparts. This
could be a result of spike-based training algorithms lagging
behind in maturity compared to conventional neural network
methods, or it could be an indication that critical information
is being discarded when neglecting the subthreshold signals
generated by populations of neurons. But when EMG and
DVS data are combined, this multi-sensory data fusion of
spiking signals positively reinforce upon each other with an
approximately 4% accuracy improvement, whereas combining
non-spiking, mismatched data representations leads to marginal
improvements, and even a destructive effect (e.g., non-spiking
CNN implementation on FPGA and memristive arrays). This
may be a result of EMG and APS data taking on completely
different structures. This is a possible indication that feature
extraction from merging the same structural form of data (i.e.,
as spikes) proves to be more beneficial than combining a pair
of networks with two completely different modes of data (i.e.,
EMG signals with pixel-driven images). This allows us to draw
an important hypothesis: neural networks can benefit from a
consistent representation of data generated by various sensory
mechanisms. This is supported by biology, where all biological
interpretations are typically represented by graded or spiking
action potentials.
F. Deep Network Accelerators and
Patient-Specific Model Tuning
Given the inherent variability between patients, it is difficult
to train and deploy a single model to a large group of individ-
uals each with unique signature(s). Consequently, significant
efforts are being made to facilitate patient-specific model tuning
processes [72], [160], [161]. Patient-specific Modeling (PSM)
is the development of computational models of human or ani-
mal pathophysiology that are individualized to patient-specific
data [160].
In the DL domain, existing ANN and neuromorphic models
can be retrofitted to specific patients using transfer learning and
tuning algorithms. In this approach, the network is first trained on
a large dataset including data from various patients to acquire the
domain-specific knowledge of the targeted task. Parts of the large
network are then retrained, i.e. tuned, using patient-specific data,
to produce better performance for individual patients. This way,
the domain-specific features of the large network are transferred
to the smaller network that is retrained to learn patient-specific
features [72]. Depending on the availability of patient-specific
data, PSM can be performed online (on-chip) or offline (off-
chip).
1) Online Patient-Specific Model Tuning: Considering con-
cerns surrounding the sensitive nature of individual patient data,
and the ability of some recent edge-AI CMOS chips such as
LNPU [55] to perform online training, patient-specific model
tuning can be performed online on the hardware deep learning
accelerator. To achieve this, a sufficient amount of patient data
that is fed to the accelerator over time can be gathered to indi-
vidualize the initial generic model. An accelerator that can adapt
its working to the specific needs of a patient would be highly
beneficial but it may require buffering of data [162], which needs
higher on-chip memory and may introduce power overheads.
2) Offline Patient-Specific Model Tuning: A convenient ap-
proach to tune general models, with domain-specific knowledge,
to patient-specific data is offline off-chip transfer learning. How-
ever, unlike online tuning, the offline approach requires prior
patient data measurements, which may not be readily available.
Besides, the offline approach may require undesired remote
storage and processing of private patient data to retrain and tune
generic models.
V. CONCLUSION
The use of DL in biomedical signal processing and healthcare
promises significant utility for medical practitioners and their
patients. DNNs can be used to improve the quality of life
for chronically ill patients by enabling ambient monitoring for
abnormalities, and correspondingly can reduce the burden on
medical resources. Proper use can lead to reduced workloads
for medical practitioners who may divert their attention to
time-critical tasks that require a standard beyond what neural
networks can achieve at this point in time.
We have stepped through the use of various DL accelerators
on a disparate range of medical tasks, and shown how SNNs
may complement DNNs where hardware efficiency is the pri-
mary bottleneck for widespread integration. We have provided a
balanced view to how memristors may lead to optimal hardware
processing of both DNNs and SNNs, and have highlighted the
challenges that must be overcome before they can be adopted at
a large-scale. While the focus of this tutorial and review is on
hardware implementation of various DL algorithms, the reader
should be mindful that progress in hardware is a necessary, but
insufficient, condition for successful integration of medical-AI.
Adopting medical-AI tools is clearly a challenge that demands
the collaborative attention of healthcare providers, hardware
AZGHADI et al.: HARDWARE IMPLEMENTATION OF DEEP NETWORK ACCELERATORS 1155
and software engineers, data scientists, policy-makers, cogni-
tive neuroscientists, device engineers and materials scientists,
amongst other specializations. A unified approach to developing
better hardware can have pervasive impacts upon the healthcare
industry, and realize significant payoff by improving the acces-
sibility and outcomes of healthcare.
ACKNOWLEDGMENT
M. Rahimi Azghadi acknowledges a JCU Rising Start ECR
Fellowship. C. Lammie acknowledges the JCU DRTPS.
REFERENCES
[1] G. Rong, A. Mendez, E. B. Assi, B. Zhao, and M. Sawan, “Artificial intel-
ligence in healthcare: Review and prediction case studies,” Engineering,
vol. 6, no. 3, pp. 291–301, 2020.
[2] T. Arevalo, The State of Health Care Industry (2020), Policy Advice,
Apr. 2020. Accessed on: Nov. 18, 2020. [Online]. Available: https://
policyadvice.net/insurance/insights/healthcare-statistics/
[3] V. Jindal, “Integrating mobile and cloud for PPG signal selection to
monitor heart rate during intensive physical exercise,” in Proc. Int. Conf.
Mobile Softw. Eng. Syst. , May 2016, pp. 36–37.
[4] P. Sundaravadivel, K. Kesavan, L. Kesavan, S. P. Mohanty, and E.
Kougianos, “Smart-Log: A deep-learning based automated nutrition
monitoring system in the IoT,” IEEE Trans. Consum. Electron., vol. 64,
no. 3, pp. 390–398, Aug. 2018.
[5] B. Shi et al., “Prediction of occult invasive disease in ductal carcinoma
in situ using deep learning features,” J. Amer. College Radiol., vol. 15,
no. 3, pp. 527–534, 2018.
[6] X. Liu et al., “A comparison of deep learning performance against
health-care professionals in detecting diseases from medical imaging:
A systematic review and meta-analysis,” Lancet Digit. Health, vol. 1,
no. 6, pp. e271–e297, 2019.
[7] F. Liu, P. Yadav, A. M. Baschnagel, and A. B. McMillan, “MR-based
treatment planning in radiation therapy using a deep learning approach,”
J. Appl. Clin. Med. Phys., vol. 20, no. 3, pp. 105–114, 2019.
[8] W. Zhu, L. Xie, J. Han, and X. Guo, “The application of deep learn-
ing in cancer prognosis prediction,” Cancers, vol. 12, p. 603, 2020,
doi: 10.3390/cancers12030603.
[9] S. M. McKinney et al., “International evaluation of an AI system for
breast cancer screening,” Nature, vol. 577, no. 7788, pp. 89–94, 2020.
[10] A. Y. Hannun et al., “Cardiologist-level arrhythmia detection and classi-
fication in ambulatory electrocardiograms using a deep neural network,”
Nat. Med., vol. 25, no. 1, pp. 65–69, 2019, doi: 10.1038/s41591-018-
0268-3.
[11] A. Esteva et al., “Dermatologist-level classification of skin cancer with
deep neural networks,” Nature, vol. 542, no. 7639, pp. 115–118, 2017.
[12] T. Kalaiselvi, P. Sriramakrishnan, and K. Somasundaram, “Survey of
using GPU CUDA programming model in medical image analysis,”
Informat. Med. Unlocked, vol. 9, pp. 133–144, 2017.
[13] N. Jouppi et al., “In-datacenter performance analysis of a ten-
sor processing unit,” in ACM/IEEE 44th Ann. Int. Symp. Com-
put. Architecture (ISCA), Toronto, ON, Canada, 2017, pp. 1–12,
doi: 10.1145/3079856.3080246.
[14] A. Esteva et al., “A guide to deep learning in healthcare,” Nat. Med.,
vol. 25, no. 1, pp. 24–29, 2019.
[15] R. Perrault et al., “The ai index 2019 annual report,”, AI Index Steering
Committee, Human-Centered AI Institute. Stanford, CA, USA: Stanford
University, 2019.
[16] P. N. Glaskowsky, “NVIDIA’s fermi: The first complete GPU
computing architecture,” Prepared under contract with NVIDIA
Corporation, NVIDIA 2788 San Tomas Expressway Santa Clara,
CA, USA, 2009. [Online]. Available: https://www.nvidia.com/
content/PDF/fermi_white_papers/P.Glaskowsky_NVIDIA’s_Fermi-
The_First_Complete_GPU_Architecture.pdf
[17] Z. Jia, M. Maggioni, B. Staiger, and D. P. Scarpazza, “Dissecting
the NVIDIA volta GPU architecture via microbenchmarking,” 2018,
arXiv:1804.06826.
[18] R. Zemouri, N. Zerhouni, and D. Racoceanu, “Deep learning in the
biomedical applications: recent and future status,” Appl. Sci., vol. 9, no. 8,
2019, Art. no. 1526.
[19] E. Smistad, T. L. Falch, M. Bozorgi, A. C. Elster, and F. Lindseth,
“Medical image segmentation on GPUs–A comprehensive review,” Med.
Image Anal., vol. 20, no. 1, pp. 1–18, 2015.
[20] B. Farahani, F. Firouzi, and K. Chakrabarty, “Healthcare IoT,” in
Intelligent Internet of Things. Berlin, Germany: Springer, 2020,
pp. 515–545.
[21] Q. Xie, K. Faust, R. Van Ommeren, A. Sheikh, U. Djuric, and P. Diaman-
dis, “Deep learning for image analysis: Personalizing medicine closer to
the point of care,” Crit. Rev. Clin. Lab. Sci., vol. 56, no. 1, pp. 61–73,
2019.
[22] M. Hartmann, U. S. Hashmi, and A. Imran, “Edge computing in
smart health care systems: Review, challenges, and research direc-
tions,” Trans. Emerg. Telecommun. Technol., 2019, Art. no. e3710,
doi: 10.1002/ett.3710.
[23] I. Azimi et al., “Hich: Hierarchical fog-assisted computing architecture
for healthcare IoT,” ACM Trans. Embedded Comput. Syst. , vol. 16, no. 5 s,
pp. 1–20, 2017.
[24] K. Sethi, V. Parmar, and M. Suri, “Low-power hardware-based deep-
learning diagnostics support case study,” in Proc. IEEE Biomed. Circuits
Syst. Conf., Oct. 2018, pp. 1–4.
[25] P. Sahu, D. Yu, and H. Qin, “Apply lightweight deep learning on internet
of things for low-cost and easy-to-access skin cancer detection,” in Proc.
Med. Imag.: Imag. Informat. Healthcare, Res. Appl., vol. 10579, Houston,
TX, USA, Feb. 2018, Art. no. 1057912.
[26] E. Chicca, F. Stefanini, C. Bartolozzi, and G. Indiveri, “Neuromorphic
electronic circuits for building autonomous cognitive systems,” Proc.
IEEE, vol. 102, no. 9, pp. 1367–1388, Sep. 2014.
[27] D. E. Rumelhart, G. Hinton, and R. J. Williams, “Learning representations
by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–538,
1986.
[28] T. C. Hollon et al., “Near real-time intraoperative brain tumor diagnosis
using stimulated Raman histology and deep neural networks,” Nat. Med.,
vol. 26, no. 1, pp. 52–58, 2020.
[29] B. Shickel, P. J. Tighe, A. Bihorac, and P. Rashidi, “Deep EHR: A survey
of recent advances in deep learning techniques for electronic health
record (EHR) analysis,” IEEE J. Biomed. Health Informat., vol. 22, no. 5,
pp. 1589–1604, Sep. 2018.
[30] M. A. Sayeed, S. P. Mohanty, E. Kougianos, and H. P. Zaveri, “Neuro-
Detect: A machine learning-based fast and accurate seizure detection
system in the IoMT,” IEEE Trans. Consum. Electron., vol. 65, no. 3,
pp. 359–368, Aug. 2019.
[31] J. Yang and M. Sawan, “From seizure detection to smart and fully
embedded seizure prediction engine: A review,” IEEE Trans. Biomed.
Circuits Syst., vol. 14, no. 5, pp. 1008–1023, Oct. 2020.
[32] N. Tajbakhsh et al., “Convolutional neural networks for medical image
analysis: full training or fine tuning?,” IEEE Trans. Med. Imag., vol. 35,
no. 5, pp. 1299–1312, May 2016.
[33] J. Gao, H. Zhang, P. Lu, and Z. Wang, “An effective LSTM recurrent
network to detect arrhythmia on imbalanced ECG dataset,” J. Healthcare
Eng., vol. 2019, 2019, Art. no. 6320651, doi: 10.1155/2019/6320651.
[34] D. Zhang et al., “Cascade and parallel convolutional recurrent neural
networks on EEG-based intention recognition for brain computer inter-
face,” in Proc. AAAI Conf. Artif. Intell., Feb. 2018, pp. 1703–1710. [On-
line]. Available: https://www.aaai.org/ocs/index.php/AAAI/AAAI18/
paper/view/16107
[35] X. Zhou, Y. Li, and W. Liang, “CNN-RNN based intelli-
gent recommendation for online medical pre-diagnosis support,”
IEEE/ACM Trans. Comput. Biol. Bioinformat., to be published,
doi: 10.1109/TCBB.2020.2994780.
[36] J. Laitala et al., “Robust ECG R-peak detection using LSTM,” in Proc.
ACM Symp. Appl. Comput., Mar. 2020, pp. 1104–1111.
[37] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge,
MA, USA: MIT Press, 2016.
[38] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans.
Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010.
[39] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit., Jun. 2009, pp. 248–255.
[40] M. Raghu, C. Zhang, J. Kleinberg, and S. Bengio, “Transfusion: Under-
standing transfer learning for medical imaging,” in Proc. Adv. Neural Inf.
Process. Syst., Dec. 2019, pp. 3347–3357.
[41] M. S. Elmahdy, T. Ahuja, U. A. van der Heide, and M. Staring, “Patient-
specific finetuning of deep learning models for adaptive radiotherapy
in prostate ct,” in Proc. IEEE 17th Int. Symp. Biomed. Imag., 2020,
pp. 577–580.
1156 IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. 14, NO. 6, DECEMBER 2020
[42] G. Indiveri and S.-C. Liu, “Memory and information processing in
neuromorphic systems,” Proc. IEEE, vol. 103, no. 8, pp. 1379–1397,
Aug. 2015.
[43] F. Corradi and G. Indiveri, “A neuromorphic event-based neural record-
ing system for smart brain-machine-interfaces,” IEEE Trans. Biomed.
Circuits Syst., vol. 9, no. 5, pp. 699–709, Oct. 2015.
[44] F. Corradi et al., “ECG-based heartbeat classification in neuromor-
phic hardware,” in Proc. Int. Joint Conf. Neural Netw., Jul. 2019,
pp. 1–8.
[45] E. Ceolini et al., “Hand-gesture recognition based on EMG and event-
based camera sensor fusion: A benchmark in neuromorphic computing,”
Front. Neurosci., vol. 14, no. 520438, 2020, Art. no. 637. [Online]. Avail-
able: https://www.frontiersin.org/article/10.3389/fnins.2020.00637
[46] G. Indiveri and Y. Sandamirskaya, “The importance of space and time for
signal processing in neuromorphic agents: The challenge of developing
low-power, autonomous agents that interact with the environment,” IEEE
Signal Process. Mag., vol. 36, no. 6, pp. 16–28, Nov. 2019.
[47] J. K. Eshraghian et al., “Neuromorphic vision hybrid rram-cmos archi-
tecture,” IEEE Trans. Very Large Scale Integr. Syst., vol. 26, no. 12,
pp. 2816–2829, Dec. 2018.
[48] X. Lagorce, G. Orchard, F. Galluppi, B. E. Shi, and R. B. Benosman,
“Hots: A hierarchy of event-based time-surfaces for pattern recognition,”
IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 7, pp. 1346–1359,
Jul. 2017.
[49] M. Sharifshazileh, K. Burelo, T. Fedele, J. Sarnthein, and G. Indiveri, “A
neuromorphic device for detecting high-frequency oscillations in human
iEEG,” in Proc. IEEE Int. Conf. Electron., Circuits Syst., Nov. 2019,
pp. 69–72.
[50] P. Lichtsteiner, C. Posch, and T. Delbruck, “A 128X128 120 dB 15 us
latency asynchronous temporal contrast vision sensor,” IEEE J. Solid-
state Circuits, vol. 43, no. 2, pp. 566–576, Feb. 2008.
[51] A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and J.
Kepner, “Survey and benchmarking of machine learning accelerators,”
2019, arXiv:1908.11348.
[52] Coral, “What is the Edge TPU?,” Accessed: Nov. 18, 2020. [Online].
Available: https://coral.ai/docs/edgetpu/faq/
[53] J. Hruska, “Intel details its Nervana inference and training
AI cards,” Extreme Tech., Aug. 21, 2019. [Online]. Avail-
able: https://www.extremetech.com/computing/296990-intel-nervana-
nnp-i-nnp-t-a-training-inference
[54] P. Kennedy, “Huawei Ascend 910 provides a NVIDIA AI train-
ing alternative,” Serve The Home, Aug. 25, 2019. [Online]. Avail-
able: https://www.servethehome.com/huawei-ascend-910-provides-a-
nvidia-ai-training-alternative/
[55] J. Lee, J. Lee, D. Han, J. Lee, G. Park, and H.-J. Yoo, “LNPU:
A 25.3 TFLOPS/W sparse deep-neural-network learning processor
with fine-grained mixed precision of FP8-FP16,” in Proc. IEEE Int.
Solid-State Circuits Conf., San Francisco, CA, USA, Feb. 2019,
pp. 142–144.
[56] D. Shin, J. Lee, J. Lee, J. Lee, and H.-J. Yoo, “DNPU: An energy-efficient
deep-learning processor with heterogeneous multi-core architecture,”
IEEE Micro, vol. 38, no. 5, pp. 85–93, Sep./Oct. 2018.
[57] S. Yin et al., “A high energy efficient reconfigurable hybrid neural
network processor for deep learning applications,” IEEE J. Solid-State
Circuits, vol. 53, no. 4, pp. 968–982, Apr. 2018.
[58] J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, and H.-J. Yoo, “UNPU:
An energy-efficient deep neural network accelerator with fully vari-
able weight bit precision,” IEEE J. Solid-State Circuits, vol. 54, no. 1,
pp. 173–185, Jan. 2019.
[59] S. Zhang et al., “Cambricon-X: An accelerator for sparse neural net-
works,” in Proc. IEEE/ACM Int. Symp. Microarchitecture, Taipei, Tai-
wan, Oct. 2016, pp. 1–12.
[60] J. Zhang et al., “A computer vision pipeline for automated determi-
nation of cardiac structure and function and detection of disease by
two-dimensional echocardiography,” 2017, arXiv:1706.07342.
[61] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-
efficient reconfigurable accelerator for deep convolutional neural net-
works,” IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138,
Jan. 2017.
[62] Q. Guan et al., “Deep convolutional neural network VGG-16 model for
differential diagnosing of papillary thyroid carcinomas in cytological
images: A pilot study,” J. Cancer, vol. 10, no. 20, 2019, Art. no. 4876.
[63] L. Cavigelli and L. Benini, “Origami: A 803-GOp/s/W convolutional
network accelerator,” IEEE Trans. Circuits Syst. Video Technol., vol. 27,
no. 11, pp. 2461–2475, Nov. 2017.
[64] I. Azimi, J. Takalo-Mattila, A. Anzanpour, A. M. Rahmani, J.-P. Soininen,
and P. Liljeberg, “Empowering healthcare IoT systems with hierarchical
edge-based deep learning,” in Proc. Int. Conf. Connected Health: Appl.,
Syst. Eng. Technol., Washington, DC, USA, Sep. 2018, pp. 63–68.
[65] J. Huang, S. Lin, N. Wang, G. Dai, Y. Xie, and J. Zhou, “TSE-CNN:
A two-stage end-to-end CNN for human activity recognition,” IEEE J.
Biomed. Health Informat., vol. 24, no. 1, pp. 292–299, Jan. 2020.
[66] B. Moons and M. Verhelst, “An energy-efficient precision-scalable Con-
vNet processor in 40-nm CMOS,” IEEE J. Solid-State Circuits, vol. 52,
no. 4, pp. 903–914, Apr. 2017.
[67] M. Blaivas and L. Blaivas, “Are all deep learning architectures alike
for point-of-care ultrasound?: Evidence from a cardiac image classifi-
cation model suggests otherwise,” J. Ultrasound Med., vol. 39, no. 6,
pp. 1187–1194, 2020. [Online]. Available: https://onlinelibrary.wiley.
com/doi/abs/10.1002/jum.15206
[68] B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, “Envision:
A 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accuracy-
frequency-scalable Convolutional Neural Network processor in 28 nm
FDSOI,” in Proc. IEEE Int. Solid-State Circuits Conf., San Francisco,
CA, USA, Feb. 2017, pp. 246–247.
[69] M.-P. Hosseini, T. X. Tran, D. Pompili, K. Elisevich, and H. Soltanian-
Zadeh, “Deep learning with edge computing for localization of epilep-
togenicity using multimodal RS-fMRI and EEG big data,” in Proc.
IEEE Int. Conf. Autonomic Comput., Columbus, OH, USA, Jul. 2017,
pp. 83–92.
[70] J. Song et al., “An 11.5TOPS/W 1024-MAC butterfly structure dual-core
sparsity-aware neural processing unit in 8 nm flagship mobile SoC,” in
Proc. IEEE Int. Solid-State Circuits Conf., San Francisco, CA, USA,
Feb. 2019, pp. 130–132.
[71] F. Preiswerk, C.-C. Cheng, J. Luo, and B. Madore, “Synthesizing dynamic
MRI using long-term recurrent convolutional networks,” in Proc. Int.
Workshop Mach. Learn. Med. Imag., Sep. 2018, pp. 89–97.
[72] J. Acharya and A. Basu, “Deep neural network for respiratory sound
classification in wearable devices enabled by patient specific model
tuning,” IEEE Trans. Biomed. Circuits Syst., vol. 14, no. 3, pp. 535–544,
Jun. 2020.
[73] M. S. Roy, B. Roy, R. Gupta, and K. D. Sharma, “On-device reliabil-
ity assessment and prediction of missing photoplethysmographic data
using deep neural networks,” IEEE Trans. Biomed. Circuits Syst., to be
published, doi: 10.1109/TBCAS.2020.3028935.
[74] J. P. Queralta, T. N. Gia, H. Tenhunen, and T. Westerlund, “Edge-AI in
LoRa-based health monitoring: Fall detection system with fog computing
and LSTM recurrent neural networks,” in Proc. Int. Conf. Telecommun.
Signal Process., 2019, pp. 601–604.
[75] S. Shaikh, R. So, T. Sibindi, C. Libedinsky, and A. Basu, “Sparse
ensemble machine learning to improve robustness of long-term decoding
in iBMIs,” IEEE Trans. Neural Syst. Rehabil. Eng., vol. 28, no. 2,
pp. 380–389, Feb. 2020.
[76] I. M. Baltruschat, H. Nickisch, M. Grass, T. Knopp, and A. Saalbach,
“Comparison of deep learning approaches for multi-label chest x-ray
classification,” Sci. Rep., vol. 9, no. 1, pp. 1–10, 2019.
[77] K. Zhao, H. Jiang, Z. Wang, P. Chen, B. Zhu, and X. Duan, “Long-
term bowel sound monitoring and segmentation by wearable devices
and convolutional neural networks,” IEEE Trans. Biomed. Circuits Syst.,
vol. 14, no. 5, pp. 985-996, Oct. 2020.
[78] G. Zamzmi, L.-Y. Hsu, W. Li, V. Sachdev, and S. Antani, “Harnessing
machine intelligence in automatic echocardiogram analysis: Current
status, limitations, and future directions,” IEEE Rev. Biomed. Eng., to
be published, doi: 10.1109/RBME.2020.2988295.
[79] Y. Chen, E. Yao, and A. Basu, “A 128-channel extreme learning machine-
based neural decoder for brain machine interfaces,” IEEE Trans. Biomed.
Circuits Syst., vol. 10, no. 3, pp. 679–692, Jun. 2016.
[80] G. O’Leary, D. M. Groppe, T. A. Valiante, N. Verma, and R. Genov,
“NURIP: Neural interface processor for brain-state classification and
programmable-waveform neurostimulation,” IEEE J. Solid-State Cir-
cuits, vol. 53, no. 11, pp. 3150–3162, Nov. 2018.
[81] L. Deng, G. Li, S. Han, L. Shi, and Y. Xie, “Model compression and
hardware acceleration for neural networks: A comprehensive survey,”
Proc. IEEE, vol. 108, no. 4, pp. 485–532, Apr. 2020.
[82] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing
of deep neural networks: A tutorial and survey,” Proc. IEEE, vol. 105,
no. 12, pp. 2295–2329, Dec. 2017.
[83] P.-H. Pham, D. Jelaca, C. Farabet, B. Martini, Y. LeCun, and E. Cu-
lurciello, “NeuFlow: Dataflow vision processing system-on-a-chip,” in
Proc. IEEE Int. Midwest Symp. Circuits Syst., Aug. 2012, pp. 1044–1047.
AZGHADI et al.: HARDWARE IMPLEMENTATION OF DEEP NETWORK ACCELERATORS 1157
[84] A. Putnam et al., “A reconfigurable fabric for accelerating large-scale dat-
acenter services,” in Proc. ACM/IEEE Int. Symp. Comput. Architecture,
Jun. 2014, pp. 13–24.
[85] D. Abts et al., “Think fast: A Tensor Streaming Processor (TSP)
for accelerating deep learning workloads,” in Proc. ACM/IEEE
47th Ann. Int. Symp. Comput. Architect., 2020, pp. 3347–3357,
doi: 10.1109/ISCA45697.2020.00023.
[86] H. Kung and C. E. Leiserson, “Systolic arrays (for VLSI),” in Proc. Sparse
Matrix, 1979, vol. 1, pp. 256–282.
[87] H.-T. Kung, “Why systolic architectures?,” Computer, vol. 15, no. 1,
pp. 37–46, 1982, doi: 10.1109/MC.1982.1653825.
[88] J. E. Stone, D. Gohara, and G. Shi, “OpenCL: A parallel programming
standard for heterogeneous computing systems,” Comput. Sci. Eng.,
vol. 12, no. 3, pp. 66–73, 2010.
[89] K. Guo, S. Zeng, J. Yu, Y. Wang, and H. Yang, “A survey of FPGA-
based neural network inference accelerator,” ACM Trans. Reconfigurable
Technol. Syst., vol. 12, no. 1, pp. 1–26, 2019.
[90] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang,
“High-level synthesis for FPGAs: From prototyping to deployment,”
IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 30, no. 4,
pp. 473–491, Apr. 2011.
[91] C. Lammie, W. Xiang, and M. R. Azghadi, “Accelerating deterministic
and stochastic binarized neural networks on FPGAS using OpenCL,” in
Proc. IEEE Int. Midwest Symp. Circuits Syst., Aug. 2019, pp. 626–629.
[92] C. Lammie, A. Olsen, T. Carrick, and M. R. Azghadi, “Low-power and
high-speed deep FPGA inference engines for weed classification at the
edge,” IEEE Access, vol. 7, pp. 51 171–51 184, 2019.
[93] M. Carreras, G. Deriu, L. Raffo, L. Benini, and P. Meloni, “Optimizing
temporal convolutional network inference on FPGA-based accelerators,”
2020, arXiv:2005.03775.
[94] C. Lammie, W. Xiang, and M. R. Azghadi, “Training progressively bi-
narizing deep networks using FPGAs,” in Proc. IEEE Int. Symp. Circuits
Syst., 2020, pp. 1–5.
[95] C. Lammie and M. R. Azghadi, “Stochastic computing for low-power and
high-speed deep learning on FPGA,” in Proc. IEEE Int. Symp. Circuits
Syst., Sapporo, Japan, May 2019, pp. 1–5.
[96] D. Wang, K. Xu, and D. Jiang, “PipeCNN: An OpenCL-based open-
source FPGA accelerator for convolution neural networks,” in Proc. Int.
Conf. Field Programmable Technol., Dec. 2017, pp. 279–282.
[97] M. Wess, P. S. Manoj, and A. Jantsch, “Neural network based ECG
anomaly detection on FPGA and trade-off analysis,” in Proc. IEEE Int.
Symp. Circuits Syst., May 2017, pp. 1–4.
[98] F. Xing, Y. Xie, X. Shi, P. Chen, Z. Zhang, and L. Yang, “Towards pixel-
to-pixel deep nucleus detection in microscopy images,” BMC Bioinf.,
vol. 20, no. 1, pp. 1–16, 2019.
[99] R. R. Shrivastwa, V. Pudi, and A. Chattopadhyay, “An FPGA-based
brain computer interfacing using compressive sensing and machine
learning,” in Proc. IEEE Comput. Soc. Annu. Symp. VLSI, Jul. 2018,
pp. 726–731.
[100] Z. Chen, A. Howe, H. T. Blair, and J. Cong, “CLINK: Compact LSTM
inference kernel for energy efficient neurofeedback devices,” in Proc. Int.
Symp. Low Power Electron. Des., Jul. 2018, pp. 1–6.
[101] G. Burr et al., “Large-scale neural networks implemented with non-
volatile memory as the synaptic weight element: Comparative perfor-
mance analysis (accuracy, speed, and power),” in Proc. IEEE Int. Electron
Devices Meeting, Washington, DC, USA, Dec. 2015, pp. 4.4.1–4.4.4.
[102] S. Ambrogio et al., “Equivalent-accuracy accelerated neural-network
training using analogue memory,” Nature, vol. 558, no. 7708, pp. 60–67,
Jun. 2018, doi: 10.1038/s41586-018-0180-5.
[103] P. Yao et al., “Fully hardware-implemented memristor convolutional
neural network,” Nature, vol. 577, no. 7792, pp. 641–646, 2020.
[104] J. K. Eshraghian, S.-M. Kang, S. Baek, G. Orchard, H. H.-C. Iu, and W.
Lei, “Analog weights in reram dnn accelerators,” in Proc. IEEE Int. Conf.
Artif. Intell. Circuits Syst., 2019, pp. 267–271.
[105] M. R. Azghadi, B. Linares-Barranco, D. Abbott, and P. H. Leong, “A
hybrid CMOS-memristor neuromorphic synapse,” IEEE Trans. Biomed.
Circuits Syst., vol. 11, no. 2, pp. 434–445, Apr. 2017.
[106] M. R. Azghadi et al., “Complementary metal-oxide semiconductor and
memristive hardware for neuromorphic computing,” Adv. Intell. Syst.,
vol. 2, no. 5, 2020, Art. no. 1900189.
[107] Q. Xia and J. J. Yang, “Memristive crossbar arrays for brain-inspired
computing,” Nat. Mater., vol. 18, no. 4, pp. 309–323, 2019.
[108] C. Lammie, W. Xiang, B. Linares-Barranco, and M. R. Azghadi, “Mem-
Torch: An open-source simulation framework for memristive deep learn-
ing systems,” 2020, arXiv:2004.10971.
[109] C. Lammie, O. Krestinskaya, A. James, and M. R. Azghadi, “Variation-
aware binarized memristive networks,” in Proc. Int. Conf. Electron.,
Circuits Syst., Genova, Italy., Nov. 2019, pp. 490–493.
[110] O. Krestinskaya, K. N. Salama, and A. P. James, “Learning in memristive
neural network architectures using analog backpropagation circuits,”
IEEE Trans. Circuits Syst. I: Regular Papers, vol. 66, no. 2, pp. 719–732,
Feb. 2019.
[111] S. Yu, P.-Y. Chen, Y. Cao, L. Xia, Y. Wang, and H. Wu, “Scaling-up
resistive synaptic arrays for neuro-inspired architecture: Challenges and
prospect,” in Proc. IEEE Int. Electron Devices Meeting, Washington, DC,
USA, Dec. 2015, pp. 17.3.1–17.3.4.
[112] N. Bien et al., “Deep-learning-assisted diagnosis for knee magnetic
resonance imaging: development and retrospective validation of MRNet,”
PLoS Med., vol. 15, no. 11, 2018, Paper e1002699.
[113] A. Ankit et al., “PUMA: A programmable ultra-efficient memristor-based
accelerator for machine learning inference,” 2019. [Online]. Available:
http://arxiv.org/abs/1901.10351
[114] S. Kvatinsky, M. Ramadan, E. G. Friedman, and A. Kolodny, “VTEAM:
A general model for voltage-controlled memristors,” IEEE Trans. Cir-
cuits Syst. II: Express Briefs, vol. 62, no. 8, pp. 786–790, Aug. 2015.
[115] E. Yalon et al., “Resistive switching in HfO2 probed by a metal–insulator–
semiconductor bipolar transistor,” IEEE Electron Device Lett., vol. 33,
no. 1, pp. 11–13, Jan. 2012.
[116] A. M. Hassan, A. F. Khalaf, K. S. Sayed, H. H. Li, and Y. Chen, “Real-time
cardiac arrhythmia classification using memristor neuromorphic comput-
ing system,” in Proc. Int. Conf. IEEE Eng. Med. Biol. Soc., Jul. 2018,
pp. 2567–2570.
[117] F. Cai et al., “A fully integrated reprogrammable memristor–CMOS
system for efficient multiply–accumulate operations,” Nat. Electron.,
vol. 2, no. 7, pp. 290–299, 2019.
[118] T. Hirtzlin et al., “Digital biologically plausible implementation of bina-
rized neural networks with differential hafnium oxide resistive memory
arrays,” Front. Neurosci., vol. 13, 2020, Art. no. 1383. [Online]. Avail-
able: https://www.frontiersin.org/article/10.3389/fnins.2019.01383
[119] F. C. Bauer, D. R. Muir, and G. Indiveri, “Real-time ultra-low power
ECG anomaly detection using an event-driven neuromorphic processor,”
IEEE Trans. Biomed. Circuits Syst., vol. 13, no. 6, pp. 1575–1582,
Dec. 2019.
[120] E. Donati et al., “Processing EMG signals using reservoir computing on
an event-based neuromorphic system,” in Proc. IEEE Biomed. Circuits
Syst. Conf., Oct. 2018, pp. 1–4.
[121] E. Donati, M. Payvand, N. Risi, R. Krause, and G. Indiveri, “Discrimina-
tion of EMG signals using a neuromorphic implementation of a spiking
neural network,” IEEE Trans. Biomed. Circuits Syst., vol. 13, no. 5,
pp. 795–803, Oct. 2019.
[122] J. Behrenbeck et al., “Classification and regression of spatio-temporal
signals using NeuCube and its realization on SpiNNaker neuromorphic
hardware,” J. Neural Eng., vol. 16, no. 2, 2019, Paper 026014.
[123] E. Nurse, B. S. Mashford, A. J. Yepes, I. Kiral-Kornek, S. Harrer, and
D. R. Freestone, “Decoding EEG and LFP signals using deep learning:
Heading TrueNorth,” in Proc. ACM Int. Conf. Comput. Front., May 2016,
pp. 259–266.
[124] S. Shaikh, R. So, T. Sibindi, C. Libedinsky, and A. Basu, “Real-
time closed loop neural decoding on a neuromorphic chip,” in Proc.
IEEE/EMBS Int. Conf. Neural Eng., San Francisco, CA, USA, Mar. 2019,
pp. 670–673.
[125] S. Shaikh, R. So, T. Sibindi, C. Libedinsky, and A. Basu, “Towards intelli-
gent intracortical BMI (i2BMI): Low-power neuromorphic decoders that
outperform Kalman filters,” IEEE Trans. Biomed. Circuits Syst., vol. 13,
no. 6, pp. 1615–1624, Dec. 2019.
[126] P. Škoda, T. Lipić, Srp, B. M. Rogina, K. Skala, and F. Vajda, “Imple-
mentation framework for artificial neural networks on FPGA,” in Proc.
Int. Conv., May 2011, pp. 274–278.
[127] C. Heelan, A. V. Nurmikko, and W. Truccolo, “FPGA implementation of
deep-learning recurrent neural networks with sub-millisecond real-time
latency for BCI-decoding of large-scale neural sensors (104 nodes),” in
Proc. Int. Conf. IEEE Eng. Med. Biol. Soc., Jul. 2018, pp. 1070–1073.
[128] L. G. Rocha et al., “Binary CorNET: Accelerator for HR estimation
from wrist-PPG,” IEEE Trans. Biomed. Circuits Syst., vol. 14, no. 4,
pp. 715–726, Aug. 2020.
[129] A. Jafari, A. Ganesan, C. S. K. Thalisetty, V. Sivasubramanian, T.
Oates, and T. Mohsenin, “SensorNet: A scalable and low-power deep
convolutional neural network for multimodal data classification,” IEEE
Trans. Circuits Syst. I: Regular Papers, vol. 66, no. 1, pp. 274–287,
Jan. 2019.
1158 IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. 14, NO. 6, DECEMBER 2020
[130] L. Ohno-Machado and D. Bialek, “Diagnosing breast cancer from fnas:
variable relevance in neural network and logistic regression models,”
Stud. Health Technol. Informat., vol. 52, pp. 537–540, 1998.
[131] Y. Ku, W. Tompkins, and Q. Xue, “Artificial neural network for ECG
arrhythmia monitoring,” in Proc. IEEE Int. Joint Conf. Neural Netw.,
Jun. 1992, vol. 2, pp. 987–992.
[132] J. K. Eshraghian, “Human ownership of artificial creativity,” Nat. Mach.
Intell., vol. 2, pp. 157–160, 2020, doi: 10.1807 1038/s42256-020-0161-x.
[133] S. Moradi, N. Qiao, F. Stefanini, and G. Indiveri, “A scalable multicore
architecture with heterogeneous memory structures for dynamic neu-
romorphic asynchronous processors (DYNAPs),” IEEE Trans. Biomed.
Circuits Syst., vol. 12, no. 1, pp. 106–122, Feb. 2018.
[134] S. Benatti et al., “A versatile embedded platform for EMG acquisition and
gesture recognition,” IEEE Trans. Biomed. Circuits Syst., vol. 9, no. 5,
pp. 620–630, Oct. 2015.
[135] F. Montagna, A. Rahimi, S. Benatti, D. Rossi, and L. Benini, “PULP-HD:
Accelerating brain-inspired high-dimensional computing on a parallel
ultra-low power platform,” in Proc. ACM/ESDA/IEEE Des. Automat.
Conf., Jun. 2018, pp. 1–6.
[136] S. B. Furber et al., “Overview of the SpiNNaker system archi-
tecture,” IEEE Trans. Comput., vol. 62, no. 12, pp. 2454–2467,
Dec. 2013.
[137] P. Merolla and K. Boahen, “A recurrent model of orientation maps with
simple and complex cells,” in Proc. Adv. Neural Inf. Process. Syst. 17
(NIPS), Vancouver, Canada, Dec. 2004, pp. 1995–2002.
[138] M. Davies et al., “Loihi: A neuromorphic manycore processor with on-
chip learning,” IEEE Micro, vol. 38, no. 1, pp. 82–99, Jan./Feb. 2018.
[139] C. Frenkel, M. Lefebvre, J.-D. Legat, and D. Bol, “A 0.086-mm2
12.7-pJ/SOP 64k-synapse 256-neuron online-learning digital spiking
neuromorphic processor in 28-nm CMOS,” IEEE Trans. Biomed. Circuits
Syst., vol. 13, no. 1, pp. 145–158, Feb. 2019.
[140] C. Frenkel, J.-D. Legat, and D. Bol, “MorphIC: A 65-nm 738k-
Synapse/mm 2 quad-core binary-weight digital neuromorphic proces-
sor with stochastic spike-driven online learning,” IEEE Trans. Biomed.
Circuits Syst., vol. 13, no. 5, pp. 999–1010, Oct. 2019.
[141] N. Qiao et al., “A reconfigurable on-line learning spiking neuromorphic
processor comprising 256 neurons and 128 K synapses,” Front. Neurosci.,
vol. 9, 2015, Art. no. 141. [Online]. Available: https://www.frontiersin.
org/article/10.3389/fnins.2015.00141
[142] M. R. Azghadi, S. Moradi, D. B. Fasnacht, M. S. Ozdas, and G. Indiveri,
“Programmable spike-timing-dependent plasticity learning circuits in
neuromorphic VLSI architectures,” ACM J. Emerg. Technol. Comput.
Syst., vol. 12, no. 2, Sep. 2015, doi: 10.1145/2658998.
[143] M. Payvand and G. Indiveri, “Spike-based plasticity circuits for always-
on on-line learning in neuromorphic systems,” in Proc. IEEE Int. Symp.
Circuits Syst., Sapporo, Japan, May 2019, pp. 1–5.
[144] J. Kaiser, H. Mostafa, and E. Neftci, “Synaptic plasticity dynamics for
deep continuous local learning (DECOLLE),” Front. Neurosci., vol. 14,
2020, Art. no. 424. [Online]. Available: https://www.frontiersin.org/
article/10.3389/fnins.2020.00424
[145] G. Bellec et al., “Eligibility traces provide a data-inspired alternative to
backpropagation through time,” in Proc. Workshop 33rd Conf. Neural
Inf. Process. Syst., Vancouver, Canada, Dec. 2019.
[146] J. Sacramento, R. P. Costa, Y. Bengio, and W. Senn, “Dendritic
cortical microcircuits approximate the backpropagation algorithm,” in
Proc. Conf. Neural Inf. Process. Syst., Montreal, Canada, Dec. 2018,
pp. 8721–8732.
[147] A. Valentian et al., “Fully integrated spiking neural network with analog
neurons and RRAM synapses,” in Proc. IEEE Int. Electron Devices
Meeting, San Francisco, CA, USA, Dec. 2019, pp. 14.13.1–14.13.4.
[148] Y. Hayakawa et al., “Highly reliable TaOx ReRAM with centralized
filament for 28-nm embedded application,” in Proc. VLSI Technol., 2015,
pp. T14–T15.
[149] T. Dalgaty et al., “Hybrid neuromorphic circuits exploiting non-
conventional properties of RRAM for massively parallel local plasticity
mechanisms,” APL Mater., vol. 7, no. 8, 2019, Art. no. 081125.
[150] M. Payvand, Y. Demirag, T. Dalgaty, E. Vianello, and G. Indiveri, “Analog
weight updates with compliance current modulation of binary ReRAMs
for on-chip learning,” in Proc. IEEE Int. Symp. Circuits Syst., 2020,
pp. 1–5.
[151] E. Chicca and G. Indiveri, “A recipe for creating ideal hybrid memristive-
cmos neuromorphic processing systems,” Appl. Phys. Lett., vol. 116,
no. 12, 2020, Art. no. 120501.
[152] T. S. Hall, C. M. Twigg, P. Hasler, and D. V. Anderson, “Application
performance of elements in a floating-gate FPAA,” in Proc. IEEE Int.
Symp. Circuits Syst., May 2004, pp. II-589.
[153] S. Shah, H. Toreyin, O. T. Inan, and J. Hasler, “Reconfigurable analog
classifier for knee-joint rehabilitation,” in Proc. IEEE Int. Conf. Eng.
Med. Biol. Soc., Aug. 2016, pp. 4784–4787.
[154] R. Manjunath and K. S. Gurumurthy, “Artificial neural networks as
building blocks of mixed signal FPGA,” in Proc. IEEE Int. Conf. Field-
Programmable Technol., Tokyo, Japan, Dec. 2003, pp. 375–378.
[155] P. Dong, G. L. Bilbro, and M.-Y. Chow, “Implementation of artificial
neural network for real time applications using field programmable
analog arrays,” in Proc. IEEE Int. Joint Conf. Neural Netw., Jul. 2006,
pp. 1518–1524.
[156] C. R. Schlottmann and P. E. Hasler, “A highly dense, low power, pro-
grammable analog vector-matrix multiplier: The FPAA implementation,”
IEEE J. Emerg. Sel. Top. Circuits Syst., vol. 1, no. 3, pp. 403–411,
Sep. 2011.
[157] A. Zbrzeski, P. Hasler, F. Kölbl, E. Syed, N. Lewis, and S. Renaud, “A
programmable Bioamplifier on FPAA for in vivo neural recording,” in
Proc. IEEE Biomed. Circuits Syst. Conf., Nov. 2010, pp. 114–117.
[158] S. B. Shrestha and G. Orchard, “SLAYER: Spike layer error reassignment
in time,” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 1419–1428.
[159] Q. Wang, X. Wang, S. H. Lee, F.-H. Meng, and W. D. Lu, “A deep neural
network accelerator based on tiled RRAM architecture,” in Proc. IEEE
Int. Electron Devices Meeting, Dec. 2019, pp. 14–4.
[160] A. Shoeb, H. Edwards, J. Connolly, B. Bourgeois, T. Treves, and J. Guttag,
“Patient-specific seizure onset detection,” in Proc. IEEE Eng. Med. Biol.
Soc. Conf., 2004, pp. 419-422.
[161] M. A. B. Altaf, C. Zhang, and J. Yoo, “A 16-channel patient-specific
seizure onset and termination detection SoC with impedance-adaptive
transcranial electrical stimulator,” IEEE J. Solid-State Circuits, vol. 50,
no. 11, pp. 2728–2740, Nov. 2015.
[162] J. Yoo, L. Yan, D. El-Damak, M. A. B. Altaf, A. H. Shoeb, and A. P.
Chandrakasan, “An 8-channel scalable eeg acquisition SoC with patient-
specific seizure classification and recording processor,” IEEE J. Solid-
State Circuits, vol. 48, no. 1, pp. 214–228, Jan. 2013.
Mostafa Rahimi Azghadi (Senior Member, IEEE)
received the Ph.D. degree in electrical & elec-
tronic engineering with The University of Adelaide,
Adelaide, Australia, earning the Doctoral Research
Medal, as well as the Adelaide University Alumni
Medal. He is currently a Senior Lecturer with the Col-
lege of Science and Engineering, James Cook Uni-
versity, Townsville, Australia, where he researches
low-power and high-performance neuromorphic ac-
celerators for neural-inspired and deep learning net-
works for a variety of applications including automa-
tion, precision agriculture, aquaculture, marine sciences, and medical imaging.
His research has attracted over $0.7 Million in funding from national and
international resources. Dr. Azghadi was the recipient of several national and
international accolades including a 2015 South Australia Science Excellence
award, a 2016 Endeavour Research Fellowship, a 2017 Queensland Young Tall
Poppy Science Award, a 2018 JCU Rising Star ECR Leader Fellowship, and a
2019 Fresh Science Queensland finalist. He is a TC member of Neural Systems
and Applications of the circuit and system society. He is an Associate Editor for
Frontiers in Neuromorphic Engineering and IEEE ACCESS.
Corey Lammie (Student Member, IEEE) received
the undergraduate degrees (Hons.) in electrical
engineering and information technology in 2018 from
James Cook University (JCU), where he is currently
working toward the Ph.D. in computer engineering.
His main research interests include brain-inspired
computing, and the simulation and hardware imple-
mentation of Spiking Neural Networks (SNNs) and
Artificial Neural Networks (ANNs) using ReRAM
devices and FPGAs. Mr. Lammie was the recipient
of several awards and fellowships including the in-
tensely competitive 2020–2021 IBM International Ph.D. Fellowship, a Domestic
Prestige Research Training Program Scholarship, and the 2017 Engineers Aus-
tralia CN Barton Medal Awarded for the best undergraduate engineering thesis
at JCU. He has served as Reviewer for several IEEE journals and conferences
including IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS and the IEEE
International Symposium on Circuits and Systems (ISCAS).
AZGHADI et al.: HARDWARE IMPLEMENTATION OF DEEP NETWORK ACCELERATORS 1159
Jason K. Eshraghian (Member, IEEE) received the
B.Eng. degree in electrical and electrinic engineering
and the bachelor’s degree in law from The University
of Western Australia, Perth, WA, Australia, in 2016,
where he also completed the Ph.D. degree. From 2015
to 2016, he was a Research Assistant with Chungbuk
National University, South Korea. He is a Postdoc-
toral Researcher with the Department of Electrical
Engineering and Computer Science, University of
Michigan in Ann Arbor. His current research interests
include neuromorphic computing and spiking neural
networks. Dr. Eshraghian is a member of the IEEE Neural Systems and Appli-
cations Committee. He was the recipient of the 2019 IEEE Very Large Scale
Integration Systems Best Paper Award, and the Best Paper Award at the 2019
IEEE Artificial Intelligence Circuits and Systems Conference for his work in
neuromorphic vision.
Melika Payvand (Member, IEEE) received the M.S.
and Ph.D. degrees in electrical and computer en-
gineering from the University of California Santa
Barbara in 2012 and 2016, respectively. She is cur-
rently a Research Scientist with the Institute of Neu-
roinformatics, University of Zurich and ETH Zurich.
Her research interests include exploiting the physics
of the computational substrate for online learning
and sensory processing. Dr. Payvand is part of the
Scientific Committee of the Capocaccia Workshop
for Neuromorphic Intelligence, she is as a Techni-
cal Committee member of Neural Systems, Applications, and Technologies
in Circuits and System Society and as a Technical Program Committee for
International Symposium on Circuits and Systems (ISCAS). She is a Guest
Editor of Frontiers in Neuroscience and is the winner of the Best Neuromorph
Award of the 2019 Telluride neuromorphic workshop.
Elisa Donati (Member, IEEE) received the B.Sc.,
M.Sc. degrees in biomedical engineering from the
University of Pisa, Pisa, Italy (cum laude), and the
Ph.D. degree in biorobotics from the Sant’Anna
School of Advanced Studies, Pisa, Italy, in 2016. She
is currently a Senior Scientist with the Institute of
Neuroinformatics, University of Zurich and ETHZ
where she is training as a Neuromorphic Engineer.
Her research interests include how to interface neuro-
robotics and neuromorphic engineering for building
smart and wearable biomedical devices. In particu-
lar, she is interested in designing VLSI systems for prosthetic devices, such
as adaptive neuromorphic pacemakers. Another recent application includes a
neuromorphic processor for controlling upper limb neuroprosthesis. She is
investigating how to process EMG data to extract features to produce motor
commands by using spiking neural networks. She is an Associate Editor for
Frontiers in Neuromorphic Engineering and she is a TC member of Neural
Systems and Applications of the circuit and system society and of the Biomedical
circuit and system society. As member she is part of the commission that is
organizing the 2nd IEEE International Conference on Artificial Intelligence
Circuits and Systems.
Bernabé Linares-Barranco (Fellow, IEEE) received
the B.S. degree in electronic physics and the M.S.
degree in microelectronics from the University of
Seville, Sevilla, Spain, in 1986 and 1987, respectively.
He received the first Ph.D. degree in high-frequency
OTA-C oscillator design in June 1990 from the Uni-
versity of Seville, Spain, and the second Ph.D. degree
in analog neural network design in December 1991
from Texas A&M University, College-Station, USA.
From September 1988 to August 1991, he was a
Graduate Student with the Department of Electri-
cal Engineering of Texas A&M University. Since June 1991, he has been a
Tenured Scientist with the “Instituto de Microelectrónica deSevilla,” (IMSE-
CNM-CSIC) Sevilla, Spain, which since 2015, is a Mixed Center between
the University of Sevilla and the Spanish Research Council (CSIC). From
September 1996 to August 1997, he was on sabbatical stay with the Department
of Electrical and Computer Engineering of the Johns Hopkins University. During
Spring 2002, he was Visiting Associate Professor with the Electrical Engineering
Department of Texas A&M University, College-Station, USA. In January 2003,
he was promoted to Tenured Researcher, and in January 2004, to Full Professor.
Since February 2018, he is the Director of the “Insitituto de Microelectrónica
de Sevilla”. He has been involved with circuit design for telecommunication
circuits, VLSI emulators of biological neurons, VLSI neural based pattern
recognition systems, hearing aids, precision circuit design for instrumentation
equipment, VLSI transistor mismatch parameters characterization, and over the
past 25 years has been deeply involved with neuromorphic spiking circuits
and systems, with strong emphasis on vision and exploiting nanoscale mem-
ristive devices for learning. He is Co-Founder of two start-ups, Prophesee SA
(www.prophesee.ai) and GrAI-Matter-Labs SAS (www.graimatterlabs.ai), both
on neuromorphic hardware.
Giacomo Indiveri (Senior Member, IEEE) received
the M.Sc. degree in electrical engineering and the
Ph.D. degree in computer science from the University
of Genoa, Italy. He is a Dual Professor with the
Faculty of Science of the University of Zurich and,
Department of Information Technology and Elec-
trical Engineering of ETH Zurich, Switzerland. He
is the Director of the Institute of Neuroinformatics
(INI) of the University of Zurich and ETH Zurich.
He was a Postdoctoral Research Fellow with the
Division of Biology, Caltech and with the Institute
of Neuroinformatics of the University of Zurich and ETH Zurich. His research
interests include the study of neural computation, with a particular focus on
spike-based learning and selective attention mechanisms. His research and
development activities focus on the full custom hardware implementation of
real-time sensory-motor systems using analog/digital neuromorphic circuits and
emerging memory technologies. Prof. Indiveri was awarded an ERC Starting
Grant on “Neuromorphic processors” in 2011 and an ERC Consolidator Grant
on neuromorphic cognitive agents in 2016.
