1,251 research outputs found
Towards Neuromorphic Gradient Descent: Exact Gradients and Low-Variance Online Estimates for Spiking Neural Networks
Spiking Neural Networks (SNNs) are biologically-plausible models that can run on low-powered non-Von Neumann neuromorphic hardware, positioning them as promising alternatives to conventional Deep Neural Networks (DNNs) for energy-efficient edge computing and robotics. Over the past few years, the Gradient Descent (GD) and Error Backpropagation (BP) algorithms used in DNNs have inspired various training methods for SNNs. However, the non-local and the reverse nature of BP, combined with the inherent non-differentiability of spikes, represent fundamental obstacles to computing gradients with SNNs directly on neuromorphic hardware. Therefore, novel approaches are required to overcome the limitations of GD and BP and enable online gradient computation on neuromorphic hardware.
In this thesis, I address the limitations of GD and BP with SNNs by proposing three algorithms. First, I extend a recent method that computes exact gradients with temporally-coded SNNs by relaxing the firing constraint of temporal coding and allowing multiple spikes per neuron. My proposed method generalizes the computation of exact gradients with SNNs and enhances the tradeoffs between performance and various other aspects of spiking neurons. Next, I introduce a novel alternative to BP that computes low-variance gradient estimates in a local and online manner. Compared to other alternatives to BP, the proposed method demonstrates an improved convergence rate and increased performance with DNNs. Finally, I combine these two methods and propose an algorithm that estimates gradients with SNNs in a manner that is compatible with the constraints of neuromorphic hardware. My empirical results demonstrate the effectiveness of the resulting algorithm in training SNNs without performing BP
Non-invasive and non-intrusive diagnostic techniques for gas-solid fluidized beds – A review
Gas-solid fluidized-bed systems offer great advantages in terms of chemical reaction efficiency and temperature control where other chemical reactor designs fall short. For this reason, they have been widely employed in a range of industrial application where these properties are essential. Nonetheless, the knowledge of such systems and the corresponding design choices, in most cases, rely on a heuristic expertise gained over the years rather than on a deep physical understanding of the phenomena taking place in fluidized beds. This is a huge limiting factor when it comes to the design, the scale-up and the optimization of such complex units. Fortunately, a wide array of diagnostic techniques has enabled researchers to strive in this direction, and, among these, non-invasive and non-intrusive diagnostic techniques stand out thanks to their innate feature of not affecting the flow field, while also avoiding direct contact with the medium under study. This work offers an overview of the non-invasive and non-intrusive diagnostic techniques most commonly applied to fluidized-bed systems, highlighting their capabilities in terms of the quantities they can measure, as well as advantages and limitations of each of them. The latest developments and the likely future trends are also presented. Neither of these methodologies represents a best option on all fronts. The goal of this work is rather to highlight what each technique has to offer and what application are they better suited for
Architecture and Circuit Design Optimization for Compute-In-Memory
The objective of the proposed research is to optimize computing-in-memory (CIM) design for accelerating Deep Neural Network (DNN) algorithms. As compute peripheries such as analog-to-digital converter (ADC) introduce significant overhead in CIM inference design, the research first focuses on the circuit optimization for inference acceleration and proposes a resistive random access memory (RRAM) based ADC-free in-memory compute scheme. We comprehensively explore the trade-offs involving different types of ADCs and investigate a new ADC design especially suited for the CIM, which performs the analog shift-add for multiple weight significance bits, improving the throughput and energy efficiency under similar area constraints. Furthermore, we prototype an ADC-free CIM inference chip design with a fully-analog data processing manner between sub-arrays, which can significantly improve the hardware performance over the conventional CIM designs and achieve near-software classification accuracy on ImageNet and CIFAR-10/-100 dataset. Secondly, the research focuses on hardware support for CIM on-chip training. To maximize hardware reuse of CIM weight stationary dataflow, we propose the CIM training architectures with the transpose weight mapping strategy. The cell design and periphery circuitry are modified to efficiently support bi-directional compute. A novel solution of signed number multiplication is also proposed to handle the negative input in backpropagation. Finally, we propose an SRAM-based CIM training architecture and comprehensively explore the system-level hardware performance for DNN on-chip training based on silicon measurement results.Ph.D
Rank-adaptive spectral pruning of convolutional layers during training
The computing cost and memory demand of deep learning pipelines have grown
fast in recent years and thus a variety of pruning techniques have been
developed to reduce model parameters. The majority of these techniques focus on
reducing inference costs by pruning the network after a pass of full training.
A smaller number of methods address the reduction of training costs, mostly
based on compressing the network via low-rank layer factorizations. Despite
their efficiency for linear layers, these methods fail to effectively handle
convolutional filters. In this work, we propose a low-parametric training
method that factorizes the convolutions into tensor Tucker format and
adaptively prunes the Tucker ranks of the convolutional kernel during training.
Leveraging fundamental results from geometric integration theory of
differential equations on tensor manifolds, we obtain a robust training
algorithm that provably approximates the full baseline performance and
guarantees loss descent. A variety of experiments against the full model and
alternative low-rank baselines are implemented, showing that the proposed
method drastically reduces the training costs, while achieving high
performance, comparable to or better than the full baseline, and consistently
outperforms competing low-rank approaches
Reaching the Edge of the Edge: Image Analysis in Space
Satellites have become more widely available due to the reduction in size and
cost of their components. As a result, there has been an advent of smaller
organizations having the ability to deploy satellites with a variety of
data-intensive applications to run on them. One popular application is image
analysis to detect, for example, land, ice, clouds, etc. for Earth observation.
However, the resource-constrained nature of the devices deployed in satellites
creates additional challenges for this resource-intensive application.
In this paper, we present our work and lessons-learned on building an Image
Processing Unit (IPU) for a satellite. We first investigate the performance of
a variety of edge devices (comparing CPU, GPU, TPU, and VPU) for
deep-learning-based image processing on satellites. Our goal is to identify
devices that can achieve accurate results and are flexible when workload
changes while satisfying the power and latency constraints of satellites. Our
results demonstrate that hardware accelerators such as ASICs and GPUs are
essential for meeting the latency requirements. However, state-of-the-art edge
devices with GPUs may draw too much power for deployment on a satellite. Then,
we use the findings gained from the performance analysis to guide the
development of the IPU module for an upcoming satellite mission. We detail how
to integrate such a module into an existing satellite architecture and the
software necessary to support various missions utilizing this module
Moby: Empowering 2D Models for Efficient Point Cloud Analytics on the Edge
3D object detection plays a pivotal role in many applications, most notably
autonomous driving and robotics. These applications are commonly deployed on
edge devices to promptly interact with the environment, and often require near
real-time response. With limited computation power, it is challenging to
execute 3D detection on the edge using highly complex neural networks. Common
approaches such as offloading to the cloud induce significant latency overheads
due to the large amount of point cloud data during transmission. To resolve the
tension between wimpy edge devices and compute-intensive inference workloads,
we explore the possibility of empowering fast 2D detection to extrapolate 3D
bounding boxes. To this end, we present Moby, a novel system that demonstrates
the feasibility and potential of our approach. We design a transformation
pipeline for Moby that generates 3D bounding boxes efficiently and accurately
based on 2D detection results without running 3D detectors. Further, we devise
a frame offloading scheduler that decides when to launch the 3D detector
judiciously in the cloud to avoid the errors from accumulating. Extensive
evaluations on NVIDIA Jetson TX2 with real-world autonomous driving datasets
demonstrate that Moby offers up to 91.9% latency improvement with modest
accuracy loss over state of the art.Comment: Accepted to ACM International Conference on Multimedia (MM) 202
Feasibility of functional MRI on point-of-care MR platforms
Magnetic resonance imaging (MRI) has proven to be a clinically valuable tool that can produce anatomical and functional images with improved soft tissue contrast compared to other imaging modalities. There has recently been a surge in low- and mid-field scanners due to hardware developments and innovative acquisition techniques. These compact scanners are accessible, offer reduced siting requirements and can be made operational at a reduced cost.
This thesis aims to implement blood-oxygen-level-dependent (BOLD) resting-state functional MRI (fMRI) at such a mid-field point-of-care scanner. The availability of this technique can be beneficial to get neurological information in cases of traumatic brain injury, stroke, epilepsy, and dementia. This technique was previously not implemented at low- and mid-field since signal-to-noise ratio and the contrast scale with field strength.
Studies were conducted to gauge the performance of an independent component analysis (ICA) based platform (GraphICA) to analyze artificially added noisy resting state functional data previously collected with a 3T scanner. This platform was used in later chapters to preprocess and perform functional connectivity studies with data from a mid-field scanner.
A single echo gradient echo echoplanar imaging (GE-EPI) sequence is typically used for BOLD-based fMRI. Task-based fMRI experiments were performed with this sequence to gauge the feasibility of this technique on a mid-field scanner. Once the feasibility was established, the sequence was further optimized to suit mid-field scanners by considering all the imaging parameters.
Resting-state experiments were conducted with an optimized single echo GE-EPI sequence with reduced dead time on a mid-field scanner. Temporal and image signal-to-noise ratio were calculated for different cortical regions. Along with that, functional connectivity studies and identification of resting-state networks were performed with GraphICA which demonstrated the feasibility of this resting-state fMRI at mid-field. The reliability and repeatability of the identified networks were assessed by comparing the networks identified with 3T data.
Resting-state experiments were conducted with a multi-echo GE-EPI sequence to use the dead time due to long T2* at mid-field effectively. Temporal signal-to-noise was calculated for different cortical regions. Along with that, functional connectivity studies and identification of resting-state networks were performed with GraphICA which demonstrated the feasibility of this resting-state fMRI at mid-field
Image-based Decision Support Systems: Technical Concepts, Design Knowledge, and Applications for Sustainability
Unstructured data accounts for 80-90% of all data generated, with image data contributing its largest portion. In recent years, the field of computer vision, fueled by deep learning techniques, has made significant advances in exploiting this data to generate value. However, often computer vision models are not sufficient for value creation. In these cases, image-based decision support systems (IB-DSSs), i.e., decision support systems that rely on images and computer vision, can be used to create value by combining human and artificial intelligence. Despite its potential, there is only little work on IB-DSSs so far.
In this thesis, we develop technical foundations and design knowledge for IBDSSs and demonstrate the possible positive effect of IB-DSSs on environmental sustainability. The theoretical contributions of this work are based on and evaluated in a series of artifacts in practical use cases: First, we use technical experiments to demonstrate the feasibility of innovative approaches to exploit images for IBDSSs.
We show the feasibility of deep-learning-based computer vision and identify future research opportunities based on one of our practical use cases. Building on this, we develop and evaluate a novel approach for combining human and artificial intelligence for value creation from image data. Second, we develop design knowledge that can serve as a blueprint for future IB-DSSs. We perform two design science research studies to formulate generalizable principles for purposeful design — one for IB-DSSs and one for the subclass of image-mining-based decision support systems (IM-DSSs). While IB-DSSs can provide decision support based on single images, IM-DSSs are suitable when large amounts of image data are available and required for decision-making. Third, we demonstrate the viability of applying IBDSSs to enhance environmental sustainability by performing life cycle assessments for two practical use cases — one in which the IB-DSS enables a prolonged product lifetime and one in which the IB-DSS facilitates an improvement of manufacturing processes.
We hope this thesis will contribute to expand the use and effectiveness of imagebased decision support systems in practice and will provide directions for future research
Estimating the Power Consumption of Heterogeneous Devices when performing AI Inference
Modern-day life is driven by electronic devices connected to the internet.
The emerging research field of the Internet-of-Things (IoT) has become popular,
just as there has been a steady increase in the number of connected devices.
Since many of these devices are utilised to perform CV tasks, it is essential
to understand their power consumption against performance. We report the power
consumption profile and analysis of the NVIDIA Jetson Nano board while
performing object classification. The authors present an extensive analysis
regarding power consumption per frame and the output in frames per second using
YOLOv5 models. The results show that the YOLOv5n outperforms other YOLOV5
variants in terms of throughput (i.e. 12.34 fps) and low power consumption
(i.e. 0.154 mWh/frame)
ACiS: smart switches with application-level acceleration
Network performance has contributed fundamentally to the growth of supercomputing over the past decades. In parallel, High Performance Computing (HPC) peak performance has depended, first, on ever faster/denser CPUs, and then, just on increasing density alone. As operating frequency, and now feature size, have levelled off, two new approaches are becoming central to achieving higher net performance: configurability and integration. Configurability enables hardware to map to the application, as well as vice versa. Integration enables system components that have generally been single function-e.g., a network to transport data—to have additional functionality, e.g., also to operate on that data. More generally, integration enables compute-everywhere: not just in CPU and accelerator, but also in network and, more specifically, the communication switches.
In this thesis, we propose four novel methods of enhancing HPC performance through Advanced Computing in the Switch (ACiS). More specifically, we propose various flexible and application-aware accelerators that can be embedded into or attached to existing communication switches to improve the performance and scalability of HPC and Machine Learning (ML) applications. We follow a modular design discipline through introducing composable plugins to successively add ACiS capabilities.
In the first work, we propose an inline accelerator to communication switches for user-definable collective operations. MPI collective operations can often be performance killers in HPC applications; we seek to solve this bottleneck by offloading them to reconfigurable hardware within the switch itself. We also introduce a novel mechanism that enables the hardware to support MPI communicators of arbitrary shape and that is scalable to very large systems.
In the second work, we propose a look-aside accelerator for communication switches that is capable of processing packets at line-rate. Functions requiring loops and states are addressed in this method. The proposed in-switch accelerator is based on a RISC-V compatible Coarse Grained Reconfigurable Arrays (CGRAs).
To facilitate usability, we have developed a framework to compile user-provided C/C++ codes to appropriate back-end instructions for configuring the accelerator.
In the third work, we extend ACiS to support fused collectives and the combining of collectives with map operations. We observe that there is an opportunity of fusing communication (collectives) with computation. Since the computation can vary for different applications, ACiS support should be programmable in this method.
In the fourth work, we propose that switches with ACiS support can control and manage the execution of applications, i.e., that the switch be an active device with decision-making capabilities. Switches have a central view of the network; they can collect telemetry information and monitor application behavior and then use this information for control, decision-making, and coordination of nodes.
We evaluate the feasibility of ACiS through extensive RTL-based simulation as well as deployment in an open-access cloud infrastructure. Using this simulation framework, when considering a Graph Convolutional Network (GCN) application as a case study, a speedup of on average 3.4x across five real-world datasets is achieved on 24 nodes compared to a CPU cluster without ACiS capabilities
- …