49,685 research outputs found
TensorIR: An Abstraction for Automatic Tensorized Program Optimization
Deploying deep learning models on various devices has become an important
topic. The wave of hardware specialization brings a diverse set of acceleration
primitives for multi-dimensional tensor computations. These new acceleration
primitives, along with the emerging machine learning models, bring tremendous
engineering challenges. In this paper, we present TensorIR, a compiler
abstraction for optimizing programs with these tensor computation primitives.
TensorIR generalizes the loop nest representation used in existing machine
learning compilers to bring tensor computation as the first-class citizen.
Finally, we build an end-to-end framework on top of our abstraction to
automatically optimize deep learning models for given tensor computation
primitives. Experimental results show that TensorIR compilation automatically
uses the tensor computation primitives for given hardware backends and delivers
performance that is competitive to state-of-art hand-optimized systems across
platforms.Comment: Accepted to ASPLOS 202
SYSTEM-ON-A-CHIP (SOC)-BASED HARDWARE ACCELERATION FOR HUMAN ACTION RECOGNITION WITH CORE COMPONENTS
Today, the implementation of machine vision algorithms on embedded platforms or in portable systems is growing rapidly due to the demand for machine vision in daily human life. Among the applications of machine vision, human action and activity recognition has become an active research area, and market demand for providing integrated smart security systems is growing rapidly. Among the available approaches, embedded vision is in the top tier; however, current embedded platforms may not be able to fully exploit the potential performance of machine vision algorithms, especially in terms of low power consumption. Complex algorithms can impose immense computation and communication demands, especially action recognition algorithms, which require various stages of preprocessing, processing and machine learning blocks that need to operate concurrently. The market demands embedded platforms that operate with a power consumption of only a few watts. Attempts have been mad to improve the performance of traditional embedded approaches by adding more powerful processors; this solution may solve the computation problem but increases the power consumption. System-on-a-chip eld-programmable gate arrays (SoC-FPGAs) have emerged as a major architecture approach for improving power eciency while increasing computational performance. In a SoC-FPGA, an embedded processor and an FPGA serving as an accelerator are fabricated in the same die to simultaneously improve power consumption and performance. Still, current SoC-FPGA-based vision implementations either shy away from supporting complex and adaptive vision algorithms or operate at very limited resolutions due to the immense communication and computation demands. The aim of this research is to develop a SoC-based hardware acceleration workflow for the realization of advanced vision algorithms. Hardware acceleration can improve performance for highly complex mathematical calculations or repeated functions. The performance of a SoC system can thus be improved by using hardware acceleration method to accelerate the element that incurs the highest performance overhead. The outcome of this research could be used for the implementation of various vision algorithms, such as face recognition, object detection or object tracking, on embedded platforms. The contributions of SoC-based hardware acceleration for hardware-software codesign platforms include the following: (1) development of frameworks for complex human action recognition in both 2D and 3D; (2) realization of a framework with four main implemented IPs, namely, foreground and background subtraction (foreground probability), human detection, 2D/3D point-of-interest detection and feature extraction, and OS-ELM as a machine learning algorithm for action identication; (3) use of an FPGA-based hardware acceleration method to resolve system bottlenecks and improve system performance; and (4) measurement and analysis of system specications, such as the acceleration factor, power consumption, and resource utilization. Experimental results show that the proposed SoC-based hardware acceleration approach provides better performance in terms of the acceleration factor, resource utilization and power consumption among all recent works. In addition, a comparison of the accuracy of the framework that runs on the proposed embedded platform (SoCFPGA) with the accuracy of other PC-based frameworks shows that the proposed approach outperforms most other approaches
Neural Network Methods for Radiation Detectors and Imaging
Recent advances in image data processing through machine learning and
especially deep neural networks (DNNs) allow for new optimization and
performance-enhancement schemes for radiation detectors and imaging hardware
through data-endowed artificial intelligence. We give an overview of data
generation at photon sources, deep learning-based methods for image processing
tasks, and hardware solutions for deep learning acceleration. Most existing
deep learning approaches are trained offline, typically using large amounts of
computational resources. However, once trained, DNNs can achieve fast inference
speeds and can be deployed to edge devices. A new trend is edge computing with
less energy consumption (hundreds of watts or less) and real-time analysis
potential. While popularly used for edge computing, electronic-based hardware
accelerators ranging from general purpose processors such as central processing
units (CPUs) to application-specific integrated circuits (ASICs) are constantly
reaching performance limits in latency, energy consumption, and other physical
constraints. These limits give rise to next-generation analog neuromorhpic
hardware platforms, such as optical neural networks (ONNs), for high parallel,
low latency, and low energy computing to boost deep learning acceleration
Reconfigurable Architectures for Hardware Acceleration of Machine Learning Classifiers
У овој дисертацији представљене су универзалне реконфигурабилне архитектуре грубог степена гранулације за хардверску имплементацију DT (decision trees), ANN (artificial neural networks) и SVM (support vector machines) предиктивних модела као и хомогених и хетерогених ансамбала. Коришћењем ових архитектура реализоване су две врсте DT модела, две врсте ANN модела, две врсте SVM модела и седам врста ансамбала на FPGA (field programmable gate arrays) чипу. Експерименти, засновани на скуповима из стандардне UCI базе скупова за машинско учење, показују да FPGA имплементација омогућава значајно убрзање (од 1 до 6 редова величине) просечног времена потребног за предикцију, у поређењу са софтверским решењима.U ovoj disertaciji predstavljene su univerzalne rekonfigurabilne arhitekture grubog stepena granulacije za hardversku implementaciju DT (decision trees), ANN (artificial neural networks) i SVM (support vector machines) prediktivnih modela kao i homogenih i heterogenih ansambala. Korišćenjem ovih arhitektura realizovane su dve vrste DT modela, dve vrste ANN modela, dve vrste SVM modela i sedam vrsta ansambala na FPGA (field programmable gate arrays) čipu. Eksperimenti, zasnovani na skupovima iz standardne UCI baze skupova za mašinsko učenje, pokazuju da FPGA implementacija omogućava značajno ubrzanje (od 1 do 6 redova veličine) prosečnog vremena potrebnog za predikciju, u poređenju sa softverskim rešenjima.This thesis proposes universal coarse-grained reconfigurable computing architectures for hardware implementation of decision trees (DTs), artificial neural networks (ANNs), support vector machines (SVMs), and homogeneous and heterogeneous ensemble classifiers (HHESs). Using these universal architectures, two versions of DTs, two versions of SVMs, two versions of ANNs, and seven versions of HHESs machine learning classifiers, have been implemented in field programmable gate arrays (FPGA). Experimental results, based on datasets of standard UCI machine learning repository database, show that FPGA implementation provides significant improvement (1–6 orders of magnitude) in the average instance classification time, in comparison with software implementations
FPGA-accelerated machine learning inference as a service for particle physics computing
New heterogeneous computing paradigms on dedicated hardware with increased
parallelization, such as Field Programmable Gate Arrays (FPGAs), offer exciting
solutions with large potential gains. The growing applications of machine
learning algorithms in particle physics for simulation, reconstruction, and
analysis are naturally deployed on such platforms. We demonstrate that the
acceleration of machine learning inference as a web service represents a
heterogeneous computing solution for particle physics experiments that
potentially requires minimal modification to the current computing model. As
examples, we retrain the ResNet-50 convolutional neural network to demonstrate
state-of-the-art performance for top quark jet tagging at the LHC and apply a
ResNet-50 model with transfer learning for neutrino event classification. Using
Project Brainwave by Microsoft to accelerate the ResNet-50 image classification
model, we achieve average inference times of 60 (10) milliseconds with our
experimental physics software framework using Brainwave as a cloud (edge or
on-premises) service, representing an improvement by a factor of approximately
30 (175) in model inference latency over traditional CPU inference in current
experimental hardware. A single FPGA service accessed by many CPUs achieves a
throughput of 600--700 inferences per second using an image batch of one,
comparable to large batch-size GPU throughput and significantly better than
small batch-size GPU throughput. Deployed as an edge or cloud service for the
particle physics computing model, coprocessor accelerators can have a higher
duty cycle and are potentially much more cost-effective.Comment: 16 pages, 14 figures, 2 table
Algorithm Optimization and Hardware Acceleration for Machine Learning Applications on Low-energy Systems
Machine learning (ML) has been extensively employed for strategy optimization, decision making, data classification, etc. While ML shows great triumph in its application field, the increasing complexity of the learning models introduces neoteric challenges to the ML system designs. On the one hand, the applications of ML on resource-restricted terminals, like mobile computing and IoT devices, are prevented by the high computational complexity and memory requirement. On the other hand, the massive parameter quantity for the modern ML models appends extra demands on the system\u27s I/O speed and memory size. This dissertation investigates feasible solutions for those challenges with software-hardware co-design
DLAS: An Exploration and Assessment of the Deep Learning Acceleration Stack
Deep Neural Networks (DNNs) are extremely computationally demanding, which
presents a large barrier to their deployment on resource-constrained devices.
Since such devices are where many emerging deep learning applications lie
(e.g., drones, vision-based medical technology), significant bodies of work
from both the machine learning and systems communities have attempted to
provide optimizations to accelerate DNNs. To help unify these two perspectives,
in this paper we combine machine learning and systems techniques within the
Deep Learning Acceleration Stack (DLAS), and demonstrate how these layers can
be tightly dependent on each other with an across-stack perturbation study. We
evaluate the impact on accuracy and inference time when varying different
parameters of DLAS across two datasets, seven popular DNN architectures, four
DNN compression techniques, three algorithmic primitives with sparse and dense
variants, untuned and auto-scheduled code generation, and four hardware
platforms. Our evaluation highlights how perturbations across DLAS parameters
can cause significant variation and across-stack interactions. The highest
level observation from our evaluation is that the model size, accuracy, and
inference time are not guaranteed to be correlated. Overall we make 13 key
observations, including that speedups provided by compression techniques are
very hardware dependent, and that compiler auto-tuning can significantly alter
what the best algorithm to use for a given configuration is. With DLAS, we aim
to provide a reference framework to aid machine learning and systems
practitioners in reasoning about the context in which their respective DNN
acceleration solutions exist in. With our evaluation strongly motivating the
need for co-design, we believe that DLAS can be a valuable concept for
exploring the next generation of co-designed accelerated deep learning
solutions
- …