206 research outputs found
Reconfigurable acceleration of Recurrent Neural Networks
Recurrent Neural Networks (RNNs) have been successful in a wide range of applications involving temporal sequences such as natural language processing, speech recognition and video analysis. However, RNNs often require a significant amount of memory and computational resources. In addition, the recurrent nature and data dependencies in RNN computations can lead to system stall, resulting in low throughput and high latency.
This work describes novel parallel hardware architectures for accelerating RNN inference using Field-Programmable Gate Array (FPGA) technology, which considers the data dependencies and high computational costs of RNNs.
The first contribution of this thesis is a latency-hiding architecture that utilizes column-wise matrix-vector multiplication instead of the conventional row-wise operation to eliminate data dependencies and improve the throughput of RNN inference designs. This architecture is further enhanced by a configurable checkerboard tiling strategy which allows large dimensions of weight matrices, while supporting element-based parallelism and vector-based parallelism. The presented reconfigurable RNN designs show significant speedup over CPU, GPU, and other FPGA designs.
The second contribution of this thesis is a weight reuse approach for large RNN models with weights stored in off-chip memory, running with a batch size of one. A novel blocking-batching strategy is proposed to optimize the throughput of large RNN designs on FPGAs by reusing the RNN weights. Performance analysis is also introduced to enable FPGA designs to achieve the best trade-off between area, power consumption and performance. Promising power efficiency improvement has been achieved in addition to speeding up over CPU and GPU designs.
The third contribution of this thesis is a low latency design for RNNs based on a partially-folded hardware architecture. It also introduces a technique that balances initiation interval of multi-layer RNN inferences to increase hardware efficiency and throughput while reducing latency. The approach is evaluated on a variety of applications, including gravitational wave detection and Bayesian RNN-based ECG anomaly detection.
To facilitate the use of this approach, we open source an RNN template which enables the generation of low-latency FPGA designs with efficient resource utilization using high-level synthesis tools.Open Acces
Dataflow Programming and Acceleration of Computationally-Intensive Algorithms
The volume of unstructured textual information continues to grow due to recent technological advancements. This resulted in an exponential growth of information generated in various formats, including blogs, posts, social networking, and enterprise documents. Numerous Enterprise Architecture (EA) documents are also created daily, such as reports, contracts, agreements, frameworks, architecture requirements, designs, and operational guides. The processing and computation of this massive amount of unstructured information necessitate substantial computing capabilities and the implementation of new techniques. It is critical to manage this unstructured information through a centralized knowledge management platform. Knowledge management is the process of managing information within an organization. This involves creating, collecting, organizing, and storing information in a way that makes it easily accessible and usable. The research involved the development textual knowledge management system, and two use cases were considered for extracting textual knowledge from documents. The first case study focused on the safety-critical documents of a railway enterprise. Safety is of paramount importance in the railway industry. There are several EA documents including manuals, operational procedures, and technical guidelines that contain critical information. Digitalization of these documents is essential for analysing vast amounts of textual knowledge that exist in these documents to improve the safety and security of railway operations. A case study was conducted between the University of Huddersfield and the Railway Safety Standard Board (RSSB) to analyse EA safety documents using Natural language processing (NLP). A graphical user interface was developed that includes various document processing features such as semantic search, document mapping, text summarization, and visualization of key trends. For the second case study, open-source data was utilized, and textual knowledge was extracted. Several features were also developed, including kernel distribution, analysis offkey trends, and sentiment analysis of words (such as unique, positive, and negative) within the documents. Additionally, a heterogeneous framework was designed using CPU/GPU and FPGAs to analyse the computational performance of document mapping
LL-GNN: Low Latency Graph Neural Networks on FPGAs for High Energy Physics
This work presents a novel reconfigurable architecture for Low Latency Graph
Neural Network (LL-GNN) designs for particle detectors, delivering
unprecedented low latency performance. Incorporating FPGA-based GNNs into
particle detectors presents a unique challenge since it requires
sub-microsecond latency to deploy the networks for online event selection with
a data rate of hundreds of terabytes per second in the Level-1 triggers at the
CERN Large Hadron Collider experiments. This paper proposes a novel
outer-product based matrix multiplication approach, which is enhanced by
exploiting the structured adjacency matrix and a column-major data layout.
Moreover, a fusion step is introduced to further reduce the end-to-end design
latency by eliminating unnecessary boundaries. Furthermore, a GNN-specific
algorithm-hardware co-design approach is presented which not only finds a
design with a much better latency but also finds a high accuracy design under
given latency constraints. To facilitate this, a customizable template for this
low latency GNN hardware architecture has been designed and open-sourced, which
enables the generation of low-latency FPGA designs with efficient resource
utilization using a high-level synthesis tool. Evaluation results show that our
FPGA implementation is up to 9.0 times faster and achieves up to 13.1 times
higher power efficiency than a GPU implementation. Compared to the previous
FPGA implementations, this work achieves 6.51 to 16.7 times lower latency.
Moreover, the latency of our FPGA design is sufficiently low to enable
deployment of GNNs in a sub-microsecond, real-time collider trigger system,
enabling it to benefit from improved accuracy. The proposed LL-GNN design
advances the next generation of trigger systems by enabling sophisticated
algorithms to process experimental data efficiently.Comment: This paper has been accepted by ACM Transactions on Embedded
Computing Systems (TECS
Efficient machine learning: models and accelerations
One of the key enablers of the recent unprecedented success of machine learning is the adoption of very large models. Modern machine learning models typically consist of multiple cascaded layers such as deep neural networks, and at least millions to hundreds of millions of parameters (i.e., weights) for the entire model. The larger-scale model tend to enable the extraction of more complex high-level features, and therefore, lead to a significant improvement of the overall accuracy. On the other side, the layered deep structure and large model sizes also demand to increase computational capability and memory requirements. In order to achieve higher scalability, performance, and energy efficiency for deep learning systems, two orthogonal research and development trends have attracted enormous interests. The first trend is the acceleration while the second is the model compression. The underlying goal of these two trends is the high quality of the models to provides accurate predictions. In this thesis, we address these two problems and utilize different computing paradigms to solve real-life deep learning problems.
To explore in these two domains, this thesis first presents the cogent confabulation network for sentence completion problem. We use Chinese language as a case study to describe our exploration of the cogent confabulation based text recognition models. The exploration and optimization of the cogent confabulation based models have been conducted through various comparisons. The optimized network offered a better accuracy performance for the sentence completion. To accelerate the sentence completion problem in a multi-processing system, we propose a parallel framework for the confabulation recall algorithm. The parallel implementation reduce runtime, improve the recall accuracy by breaking the fixed evaluation order and introducing more generalization, and maintain a balanced progress in status update among all neurons. A lexicon scheduling algorithm is presented to further improve the model performance.
As deep neural networks have been proven effective to solve many real-life applications, and they are deployed on low-power devices, we then investigated the acceleration for the neural network inference using a hardware-friendly computing paradigm, stochastic computing. It is an approximate computing paradigm which requires small hardware footprint and achieves high energy efficiency. Applying this stochastic computing to deep convolutional neural networks, we design the functional hardware blocks and optimize them jointly to minimize the accuracy loss due to the approximation. The synthesis results show that the proposed design achieves the remarkable low hardware cost and power/energy consumption.
Modern neural networks usually imply a huge amount of parameters which cannot be fit into embedded devices. Compression of the deep learning models together with acceleration attracts our attention. We introduce the structured matrices based neural network to address this problem. Circulant matrix is one of the structured matrices, where a matrix can be represented using a single vector, so that the matrix is compressed. We further investigate a more flexible structure based on circulant matrix, called block-circulant matrix. It partitions a matrix into several smaller blocks and makes each submatrix is circulant. The compression ratio is controllable. With the help of Fourier Transform based equivalent computation, the inference of the deep neural network can be accelerated energy efficiently on the FPGAs. We also offer the optimization for the training algorithm for block circulant matrices based neural networks to obtain a high accuracy after compression
Intelligent Computing: The Latest Advances, Challenges and Future
Computing is a critical driving force in the development of human
civilization. In recent years, we have witnessed the emergence of intelligent
computing, a new computing paradigm that is reshaping traditional computing and
promoting digital revolution in the era of big data, artificial intelligence
and internet-of-things with new computing theories, architectures, methods,
systems, and applications. Intelligent computing has greatly broadened the
scope of computing, extending it from traditional computing on data to
increasingly diverse computing paradigms such as perceptual intelligence,
cognitive intelligence, autonomous intelligence, and human-computer fusion
intelligence. Intelligence and computing have undergone paths of different
evolution and development for a long time but have become increasingly
intertwined in recent years: intelligent computing is not only
intelligence-oriented but also intelligence-driven. Such cross-fertilization
has prompted the emergence and rapid advancement of intelligent computing.
Intelligent computing is still in its infancy and an abundance of innovations
in the theories, systems, and applications of intelligent computing are
expected to occur soon. We present the first comprehensive survey of literature
on intelligent computing, covering its theory fundamentals, the technological
fusion of intelligence and computing, important applications, challenges, and
future perspectives. We believe that this survey is highly timely and will
provide a comprehensive reference and cast valuable insights into intelligent
computing for academic and industrial researchers and practitioners
Brain-Inspired Computing
This open access book constitutes revised selected papers from the 4th International Workshop on Brain-Inspired Computing, BrainComp 2019, held in Cetraro, Italy, in July 2019. The 11 papers presented in this volume were carefully reviewed and selected for inclusion in this book. They deal with research on brain atlasing, multi-scale models and simulation, HPC and data infra-structures for neuroscience as well as artificial and natural neural architectures
Acceleration for the many, not the few
Although specialized hardware promises orders of magnitude performance gains, their
uptake has been limited by how challenging it is to program them. Hardware accelerators
present challenges programmers are not used to, exposing details of the hardware that
are often hidden and requiring new programming styles to use them effectively.
Existing programming models often involve learning complex and hardware-specific
APIs, using Domain Specific Languages (DSLs), or programming in customized assembly languages. These programming models for hardware accelerators present a
significant challenge to uptake: a steep, unforgiving, and untransferable learning curve.
However, programming hardware accelerators using traditional programming models
presents a challenge: mapping code not written with hardware accelerators in mind to
accelerators with restricted behaviour.
This thesis presents these challenges in the context of the acceleration equation, and
it presents solutions to it in three different contexts: for regular expression accelerators,
for API-programmable accelerators (with Fourier Transforms as a key case-study) and
for heterogeneous coarse-grained reconfigurable arrays (CGRAs). This thesis shows
that automatically morphing software written in traditional manners to fit hardware
accelerators is possible with no programmer effort and that huge potential speedups are
available
An Overlay Architecture for Pattern Matching
Deterministic and Non-deterministic Finite Automata (DFA and NFA) comprise the fundamental unit of work for many emerging big data applications, motivating recent efforts to develop Domain-Specific Architectures (DSAs) to exploit fine-grain parallelism available in automata workloads.
This dissertation presents NAPOLY (Non-Deterministic Automata Processor Over- LaY), an overlay architecture and associated software that attempt to maximally exploit on-chip memory parallelism for NFA evaluation. In order to avoid an upper bound in NFA size that commonly affects prior efforts, NAPOLY is optimized for runtime reconfiguration, allowing for full reconfiguration in 10s of microseconds. NAPOLY is also parameterizable, allowing for offline generation of repertoire of overlay configurations with various trade-offs between state capacity and transition capacity.
In this dissertation, we evaluate NAPOLY on automata applications packaged in ANMLZoo benchmarks using our proposed state mapping heuristic and off-shelf SAT solver. We compare NAPOLY’s performance against existing CPU and GPU implementations. The results show NAPOLY performs best for larger benchmarks with more active states and high report frequency. NAPOLY outperforms in 10 out of 12 benchmark suite to the best of state-of-the-art CPU and GPU implementations. To the best of our knowledge, this is the first example of a runtime-reprogrammable FPGA-based automata processor overlay
Edge-Cloud Polarization and Collaboration: A Comprehensive Survey for AI
Influenced by the great success of deep learning via cloud computing and the
rapid development of edge chips, research in artificial intelligence (AI) has
shifted to both of the computing paradigms, i.e., cloud computing and edge
computing. In recent years, we have witnessed significant progress in
developing more advanced AI models on cloud servers that surpass traditional
deep learning models owing to model innovations (e.g., Transformers, Pretrained
families), explosion of training data and soaring computing capabilities.
However, edge computing, especially edge and cloud collaborative computing, are
still in its infancy to announce their success due to the resource-constrained
IoT scenarios with very limited algorithms deployed. In this survey, we conduct
a systematic review for both cloud and edge AI. Specifically, we are the first
to set up the collaborative learning mechanism for cloud and edge modeling with
a thorough review of the architectures that enable such mechanism. We also
discuss potentials and practical experiences of some on-going advanced edge AI
topics including pretraining models, graph neural networks and reinforcement
learning. Finally, we discuss the promising directions and challenges in this
field.Comment: 20 pages, Transactions on Knowledge and Data Engineerin
Deep learning that scales: leveraging compute and data
Deep learning has revolutionized the field of artificial intelligence in the past decade. Although the development of these techniques spans over several years, the recent advent of deep learning is explained by an increased availability of data and compute that have unlocked the potential of deep neural networks. They have become ubiquitous in domains such as natural language processing, computer vision, speech processing, and control, where enough training data is available. Recent years have seen continuous progress driven by ever-growing neural networks that benefited from large amounts of data and computing power.
This thesis is motivated by the observation that scale is one of the key factors driving progress in deep learning research, and aims at devising deep learning methods that scale gracefully with the available data and compute. We narrow down this scope into two main research directions. The first of them is concerned with designing hardware-aware methods which can make the most of the computing resources in current high performance computing facilities. We then study bottlenecks preventing existing methods from scaling up as more data becomes available, providing solutions that contribute towards enabling training of more complex models.
This dissertation studies the aforementioned research questions for two different learning paradigms, each with its own algorithmic and computational characteristics. The first part of this thesis studies the paradigm where the model needs to learn from a collection of examples, extracting as much information as possible from the given data. The second part is concerned with training agents that learn by interacting with a simulated environment, which introduces unique challenges such as efficient exploration and simulation
- …