3,349 research outputs found
Energy-efficient embedded machine learning algorithms for smart sensing systems
Embedded autonomous electronic systems are required in numerous application domains such as Internet of Things (IoT), wearable devices, and biomedical systems. Embedded electronic systems usually host sensors, and each sensor hosts multiple input channels (e.g., tactile, vision), tightly coupled to the electronic computing unit (ECU). The ECU extracts information by often employing sophisticated methods, e.g., Machine Learning. However, embedding Machine Learning algorithms poses essential challenges in terms of hardware resources and energy consumption because of: 1) the high amount of data to be processed; 2) computationally demanding methods. Leveraging on the trade-off between quality requirements versus computational complexity and time latency could reduce the system complexity without affecting the performance. The objectives of the thesis are to develop: 1) energy-efficient arithmetic circuits outperforming state of the art solutions for embedded machine learning algorithms, 2) an energy-efficient embedded electronic system for the \u201celectronic-skin\u201d (e-skin) application. As such, this thesis exploits two main approaches:
Approximate Computing: In recent years, the approximate computing paradigm became a significant major field of research since it is able to enhance the energy efficiency and performance of digital systems. \u201cApproximate Computing\u201d(AC) turned out to be a practical approach to trade accuracy for better power, latency, and size . AC targets error-resilient applications and offers promising benefits by conserving some resources. Usually, approximate results are acceptable for many applications, e.g., tactile data processing,image processing , and data mining ; thus, it is highly recommended to take advantage of energy reduction with minimal variation in performance . In our work, we developed two approximate multipliers: 1) the first one is called \u201cMETA\u201d multiplier and is based on the Error Tolerant Adder (ETA), 2) the second one is called \u201cApproximate Baugh-Wooley(BW)\u201d multiplier where the approximations are implemented in the generation of the partial products. We showed that the proposed approximate arithmetic circuits could achieve a relevant reduction in power consumption and time delay around 80.4% and 24%, respectively, with respect to the exact BW multiplier. Next, to prove the feasibility of AC in real world applications, we explored the approximate multipliers on a case study as the e-skin application. The e-skin application is defined as multiple sensing components, including 1) structural materials, 2) signal processing, 3) data acquisition, and 4) data processing. Particularly, processing the originated data from the e-skin into low or high-level information is the main problem to be addressed by the embedded electronic system. Many studies have shown that Machine Learning is a promising approach in processing tactile data when classifying input touch modalities. In our work, we proposed a methodology for evaluating the behavior of the system when introducing approximate arithmetic circuits in the main stages (i.e., signal and data processing stages) of the system. Based on the proposed methodology, we first implemented the approximate multipliers on the low-pass Finite Impulse Response (FIR) filter in the signal processing stage of the application. We noticed that the FIR filter based on (Approx-BW) outperforms state of the art solutions, while respecting the tradeoff between accuracy and power consumption, with an SNR degradation of 1.39dB. Second, we implemented approximate adders and multipliers respectively into the Coordinate Rotational Digital Computer (CORDIC) and the Singular Value Decomposition (SVD) circuits; since CORDIC and SVD take a significant part of the computationally expensive Machine Learning algorithms employed in tactile data processing. We showed benefits of up to 21% and 19% in power reduction at the cost of less than 5% accuracy loss for CORDIC and SVD circuits when scaling the number of approximated bits.
2) Parallel Computing Platforms (PCP): Exploiting parallel architectures for near-threshold computing based on multi-core clusters is a promising approach to improve the performance of smart sensing systems. In our work, we exploited a novel computing platform embedding a Parallel Ultra Low Power processor (PULP), called \u201cMr. Wolf,\u201d for the implementation of Machine Learning (ML) algorithms for touch modalities classification. First, we tested the ML algorithms at the software level; for RGB images as a case study and tactile dataset, we achieved accuracy respectively equal to 97% and 83.5%. After validating the effectiveness of the ML algorithm at the software level, we performed the on-board classification of two touch modalities, demonstrating the promising use of Mr. Wolf for smart sensing systems. Moreover, we proposed a memory management strategy for storing the needed amount of trained tensors (i.e., 50 trained tensors for each class) in the on-chip memory. We evaluated the execution cycles for Mr. Wolf using a single core, 2 cores, and 3 cores, taking advantage of the benefits of the parallelization. We presented a comparison with the popular low power ARM Cortex-M4F microcontroller employed, usually for battery-operated devices. We showed that the ML algorithm on the proposed platform runs 3.7 times faster than ARM Cortex M4F (STM32F40), consuming only 28 mW. The proposed platform achieves 15
7 better energy efficiency than the classification done on the STM32F40, consuming 81mJ per classification and 150 pJ per operation
Performance Evaluation of cuDNN Convolution Algorithms on NVIDIA Volta GPUs
Convolutional neural networks (CNNs) have recently attracted considerable attention due to their outstanding accuracy in applications, such as image recognition and natural language processing. While one advantage of the CNNs over other types of neural networks is their reduced computational cost, faster execution is still desired for both training and inference. Since convolution operations pose most of the execution time, multiple algorithms were and are being developed with the aim of accelerating this type of operations. However, due to the wide range of convolution parameter configurations used in the CNNs and the possible data type representations, it is not straightforward to assess in advance which of the available algorithms will be the best performing in each particular case. In this paper, we present a performance evaluation of the convolution algorithms provided by the cuDNN, the library used by most deep learning frameworks for their GPU operations. In our analysis, we leverage the convolution parameter configurations from widely used the CNNs and discuss which algorithms are better suited depending on the convolution parameters for both 32 and 16-bit floating-point (FP) data representations. Our results show that the filter size and the number of inputs are the most significant parameters when selecting a GPU convolution algorithm for 32-bit FP data. For 16-bit FP, leveraging specialized arithmetic units (NVIDIA Tensor Cores) is key to obtain the best performance.This work was supported by the European Union's Horizon 2020 Research and Innovation Program under the Marie Sklodowska-Curie under Grant 749516, and in part by the Spanish Juan de la Cierva under Grant IJCI-2017-33511Peer ReviewedPostprint (published version
Exploiting All-Programmable System on Chips for Closed-Loop Real-Time Neural Interfaces
High-density microelectrode arrays (HDMEAs) feature thousands of recording electrodes
in a single chip with an area of few square millimeters. The obtained electrode density is
comparable and even higher than the typical density of neuronal cells in cortical cultures.
Commercially available HDMEA-based acquisition systems are able to record the neural
activity from the whole array at the same time with submillisecond resolution. These devices
are a very promising tool and are increasingly used in neuroscience to tackle fundamental
questions regarding the complex dynamics of neural networks. Even if electrical or optical
stimulation is generally an available feature of such systems, they lack the capability of
creating a closed-loop between the biological neural activity and the artificial system. Stimuli
are usually sent in an open-loop manner, thus violating the inherent working basis of neural
circuits that in nature are constantly reacting to the external environment. This forbids to
unravel the real mechanisms behind the behavior of neural networks.
The primary objective of this PhD work is to overcome such limitation by creating a fullyreconfigurable
processing system capable of providing real-time feedback to the ongoing
neural activity recorded with HDMEA platforms. The potentiality of modern heterogeneous
FPGAs has been exploited to realize the system. In particular, the Xilinx Zynq All Programmable
System on Chip (APSoC) has been used. The device features reconfigurable
logic, specialized hardwired blocks, and a dual-core ARM-based processor; the synergy of
these components allows to achieve high elaboration performances while maintaining a high
level of flexibility and adaptivity. The developed system has been embedded in an acquisition
and stimulation setup featuring the following platforms:
\u2022 3\ub7Brain BioCam X, a state-of-the-art HDMEA-based acquisition platform capable of
recording in parallel from 4096 electrodes at 18 kHz per electrode.
\u2022 PlexStim\u2122 Electrical Stimulator System, able to generate electrical stimuli with
custom waveforms to 16 different output channels.
\u2022 Texas Instruments DLP\uae LightCrafter\u2122 Evaluation Module, capable of projecting
608x684 pixels images with a refresh rate of 60 Hz; it holds the function of optical
stimulation.
All the features of the system, such as band-pass filtering and spike detection of all the
recorded channels, have been validated by means of ex vivo experiments. Very low-latency
has been achieved while processing the whole input data stream in real-time. In the case
of electrical stimulation the total latency is below 2 ms; when optical stimuli are needed,
instead, the total latency is a little higher, being 21 ms in the worst case.
The final setup is ready to be used to infer cellular properties by means of closed-loop
experiments. As a proof of this concept, it has been successfully used for the clustering
and classification of retinal ganglion cells (RGCs) in mice retina. For this experiment, the
light-evoked spikes from thousands of RGCs have been correctly recorded and analyzed in
real-time. Around 90% of the total clusters have been classified as ON- or OFF-type cells.
In addition to the closed-loop system, a denoising prototype has been developed. The main
idea is to exploit oversampling techniques to reduce the thermal noise recorded by HDMEAbased
acquisition systems. The prototype is capable of processing in real-time all the input
signals from the BioCam X, and it is currently being tested to evaluate the performance in
terms of signal-to-noise-ratio improvement
Models for learning reverberant environments
Reverberation is present in all real life enclosures. From our workplaces to our homes and even in places designed as auditoria, such as concert halls and theatres. We have learned to understand speech in the presence of reverberation and also to use it for aesthetics in music. This thesis investigates novel ways enabling machines to learn the properties of reverberant acoustic environments. Training machines to classify rooms based on the effect of reverberation requires the use of data recorded in the room. The typical data for such measurements is the Acoustic Impulse Response (AIR) between the speaker and the receiver as a Finite Impulse Response (FIR) filter. Its representation however is high-dimensional and the measurements are small in number, which limits the design and performance of deep learning algorithms. Understanding properties of the rooms relies on the analysis of reflections that compose the AIRs and the decay and absorption of the sound energy in the room. This thesis proposes novel methods for representing the early reflections, which are strong and sparse in nature and depend on the position of the source and the receiver. The resulting representation significantly reduces the coefficients needed to represent the AIR and can be combined with a stochastic model from the literature to also represent the late reflections. The use of Finite Impulse Response (FIR) for the task of classifying rooms is investigated, which provides novel results in this field. The aforementioned issues related to AIRs are highlighted through the analysis. This leads to the proposal of a data augmentation method for the training of the classifiers based on Generative Adversarial Networks (GANs), which uses existing data to create artificial AIRs, as if they were measured in real rooms. The networks learn properties of the room in the space defined by the parameters of the low-dimensional representation that is proposed in this thesis.Open Acces
Data comparison schemes for Pattern Recognition in Digital Images using Fractals
Pattern recognition in digital images is a common problem with application in
remote sensing, electron microscopy, medical imaging, seismic imaging and
astrophysics for example. Although this subject has been researched for over
twenty years there is still no general solution which can be compared with the
human cognitive system in which a pattern can be recognised subject to
arbitrary orientation and scale.
The application of Artificial Neural Networks can in principle provide a very
general solution providing suitable training schemes are implemented.
However, this approach raises some major issues in practice. First, the CPU
time required to train an ANN for a grey level or colour image can be very
large especially if the object has a complex structure with no clear geometrical
features such as those that arise in remote sensing applications. Secondly,
both the core and file space memory required to represent large images and
their associated data tasks leads to a number of problems in which the use of
virtual memory is paramount.
The primary goal of this research has been to assess methods of image data
compression for pattern recognition using a range of different compression
methods. In particular, this research has resulted in the design and
implementation of a new algorithm for general pattern recognition based on
the use of fractal image compression.
This approach has for the first time allowed the pattern recognition problem to
be solved in a way that is invariant of rotation and scale. It allows both ANNs
and correlation to be used subject to appropriate pre-and post-processing
techniques for digital image processing on aspect for which a dedicated
programmer's work bench has been developed using X-Designer
Field Programmable Gate Arrays (FPGAs) II
This Edited Volume Field Programmable Gate Arrays (FPGAs) II is a collection of reviewed and relevant research chapters, offering a comprehensive overview of recent developments in the field of Computer and Information Science. The book comprises single chapters authored by various researchers and edited by an expert active in the Computer and Information Science research area. All chapters are complete in itself but united under a common research study topic. This publication aims at providing a thorough overview of the latest research efforts by international authors on Computer and Information Science, and open new possible research paths for further novel developments
- …