14,210 research outputs found
Reconfigurable Hardware Accelerators: Opportunities, Trends, and Challenges
With the emerging big data applications of Machine Learning, Speech
Recognition, Artificial Intelligence, and DNA Sequencing in recent years,
computer architecture research communities are facing the explosive scale of
various data explosion. To achieve high efficiency of data-intensive computing,
studies of heterogeneous accelerators which focus on latest applications, have
become a hot issue in computer architecture domain. At present, the
implementation of heterogeneous accelerators mainly relies on heterogeneous
computing units such as Application-specific Integrated Circuit (ASIC),
Graphics Processing Unit (GPU), and Field Programmable Gate Array (FPGA). Among
the typical heterogeneous architectures above, FPGA-based reconfigurable
accelerators have two merits as follows: First, FPGA architecture contains a
large number of reconfigurable circuits, which satisfy requirements of high
performance and low power consumption when specific applications are running.
Second, the reconfigurable architectures of employing FPGA performs prototype
systems rapidly and features excellent customizability and reconfigurability.
Nowadays, in top-tier conferences of computer architecture, emerging a batch of
accelerating works based on FPGA or other reconfigurable architectures. To
better review the related work of reconfigurable computing accelerators
recently, this survey reserves latest high-level research products of
reconfigurable accelerator architectures and algorithm applications as the
basis. In this survey, we compare hot research issues and concern domains,
furthermore, analyze and illuminate advantages, disadvantages, and challenges
of reconfigurable accelerators. In the end, we prospect the development
tendency of accelerator architectures in the future, hoping to provide a
reference for computer architecture researchers
Partitioning Large Scale Deep Belief Networks Using Dropout
Deep learning methods have shown great promise in many practical
applications, ranging from speech recognition, visual object recognition, to
text processing. However, most of the current deep learning methods suffer from
scalability problems for large-scale applications, forcing researchers or users
to focus on small-scale problems with fewer parameters.
In this paper, we consider a well-known machine learning model, deep belief
networks (DBNs) that have yielded impressive classification performance on a
large number of benchmark machine learning tasks. To scale up DBN, we propose
an approach that can use the computing clusters in a distributed environment to
train large models, while the dense matrix computations within a single machine
are sped up using graphics processors (GPU). When training a DBN, each machine
randomly drops out a portion of neurons in each hidden layer, for each training
case, making the remaining neurons only learn to detect features that are
generally helpful for producing the correct answer. Within our approach, we
have developed four methods to combine outcomes from each machine to form a
unified model. Our preliminary experiment on the mnst handwritten digit
database demonstrates that our approach outperforms the state of the art test
error rate.Comment: arXiv admin note: text overlap with arXiv:1207.0580 by other author
AI Benchmark: Running Deep Neural Networks on Android Smartphones
Over the last years, the computational power of mobile devices such as
smartphones and tablets has grown dramatically, reaching the level of desktop
computers available not long ago. While standard smartphone apps are no longer
a problem for them, there is still a group of tasks that can easily challenge
even high-end devices, namely running artificial intelligence algorithms. In
this paper, we present a study of the current state of deep learning in the
Android ecosystem and describe available frameworks, programming models and the
limitations of running AI on smartphones. We give an overview of the hardware
acceleration resources available on four main mobile chipset platforms:
Qualcomm, HiSilicon, MediaTek and Samsung. Additionally, we present the
real-world performance results of different mobile SoCs collected with AI
Benchmark that are covering all main existing hardware configurations
FutureMapping: The Computational Structure of Spatial AI Systems
We discuss and predict the evolution of Simultaneous Localisation and Mapping
(SLAM) into a general geometric and semantic `Spatial AI' perception capability
for intelligent embodied devices. A big gap remains between the visual
perception performance that devices such as augmented reality eyewear or
comsumer robots will require and what is possible within the constraints
imposed by real products. Co-design of algorithms, processors and sensors will
be needed. We explore the computational structure of current and future Spatial
AI algorithms and consider this within the landscape of ongoing hardware
developments
Machine Learning in Compiler Optimisation
In the last decade, machine learning based compilation has moved from an an
obscure research niche to a mainstream activity. In this article, we describe
the relationship between machine learning and compiler optimisation and
introduce the main concepts of features, models, training and deployment. We
then provide a comprehensive survey and provide a road map for the wide variety
of different research areas. We conclude with a discussion on open issues in
the area and potential research directions. This paper provides both an
accessible introduction to the fast moving area of machine learning based
compilation and a detailed bibliography of its main achievements.Comment: Accepted to be published at Proceedings of the IEE
Automation of Processor Verification Using Recurrent Neural Networks
When considering simulation-based verification of processors, the current
trend is to generate stimuli using pseudorandom generators (PRGs), apply them
to the processor inputs and monitor the achieved coverage of its functionality
in order to determine verification completeness. Stimuli can have different
forms, for example, they can be represented by bit vectors applied to the input
ports of the processor or by programs that are loaded directly into the program
memory. In this paper, we propose a new technique dynamically altering
constraints for PRG via recurrent neural network, which receives a coverage
feedback from the simulation of design under verification. For the
demonstration purposes we used processors provided by Codasip as their coverage
state space is reasonably big and differs for various kinds of processors.
Nevertheless, techniques presented in this paper are widely applicable. The
results of experiments show that not only the coverage closure is achieved much
sooner, but we are able to isolate a small set of stimuli with high coverage
that can be used for running regression tests.Comment: Paper contains 6 pages, 6 figures. Presented on MTVCon 2017. Soon to
be released by IEE
FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
Due to recent advances in digital technologies, and availability of credible
data, an area of artificial intelligence, deep learning, has emerged, and has
demonstrated its ability and effectiveness in solving complex learning problems
not possible before. In particular, convolution neural networks (CNNs) have
demonstrated their effectiveness in image detection and recognition
applications. However, they require intensive CPU operations and memory
bandwidth that make general CPUs fail to achieve desired performance levels.
Consequently, hardware accelerators that use application specific integrated
circuits (ASICs), field programmable gate arrays (FPGAs), and graphic
processing units (GPUs) have been employed to improve the throughput of CNNs.
More precisely, FPGAs have been recently adopted for accelerating the
implementation of deep learning networks due to their ability to maximize
parallelism as well as due to their energy efficiency. In this paper, we review
recent existing techniques for accelerating deep learning networks on FPGAs. We
highlight the key features employed by the various techniques for improving the
acceleration performance. In addition, we provide recommendations for enhancing
the utilization of FPGAs for CNNs acceleration. The techniques investigated in
this paper represent the recent trends in FPGA-based accelerators of deep
learning networks. Thus, this review is expected to direct the future advances
on efficient hardware accelerators and to be useful for deep learning
researchers.Comment: This article has been accepted for publication in IEEE Access
(December, 2018
DNNVM : End-to-End Compiler Leveraging Heterogeneous Optimizations on FPGA-based CNN Accelerators
The convolutional neural network (CNN) has become a state-of-the-art method
for several artificial intelligence domains in recent years. The increasingly
complex CNN models are both computation-bound and I/O-bound. FPGA-based
accelerators driven by custom instruction set architecture (ISA) achieve a
balance between generality and efficiency, but there is much on them left to be
optimized. We propose the full-stack compiler DNNVM, which is an integration of
optimizers for graphs, loops and data layouts, and an assembler, a runtime
supporter and a validation environment. The DNNVM works in the context of deep
learning frameworks and transforms CNN models into the directed acyclic graph:
XGraph. Based on XGraph, we transform the optimization challenges for both the
data layout and pipeline into graph-level problems. DNNVM enumerates all
potentially profitable fusion opportunities by a heuristic subgraph isomorphism
algorithm to leverage pipeline and data layout optimizations, and searches for
the best choice of execution strategies of the whole computing graph. On the
Xilinx ZU2 @330 MHz and ZU9 @330 MHz, we achieve equivalently state-of-the-art
performance on our benchmarks by na\"ive implementations without optimizations,
and the throughput is further improved up to 1.26x by leveraging heterogeneous
optimizations in DNNVM. Finally, with ZU9 @330 MHz, we achieve state-of-the-art
performance for VGG and ResNet50. We achieve a throughput of 2.82 TOPs/s and an
energy efficiency of 123.7 GOPs/s/W for VGG. Additionally, we achieve 1.38
TOPs/s for ResNet50 and 1.41 TOPs/s for GoogleNet.Comment: 18 pages, 9 figures, 5 table
Optimal DNN Primitive Selection with Partitioned Boolean Quadratic Programming
Deep Neural Networks (DNNs) require very large amounts of computation both
for training and for inference when deployed in the field. Many different
algorithms have been proposed to implement the most computationally expensive
layers of DNNs. Further, each of these algorithms has a large number of
variants, which offer different trade-offs of parallelism, data locality,
memory footprint, and execution time. In addition, specific algorithms operate
much more efficiently on specialized data layouts and formats.
We state the problem of optimal primitive selection in the presence of data
format transformations, and show that it is NP-hard by demonstrating an
embedding in the Partitioned Boolean Quadratic Assignment problem (PBQP).
We propose an analytic solution via a PBQP solver, and evaluate our approach
experimentally by optimizing several popular DNNs using a library of more than
70 DNN primitives, on an embedded platform and a general purpose platform. We
show experimentally that significant gains are possible versus the state of the
art vendor libraries by using a principled analytic solution to the problem of
layout selection in the presence of data format transformations
Exploring the Vision Processing Unit as Co-processor for Inference
The success of the exascale supercomputer is largely debated to remain
dependent on novel breakthroughs in technology that effectively reduce the
power consumption and thermal dissipation requirements. In this work, we
consider the integration of co-processors in high-performance computing (HPC)
to enable low-power, seamless computation offloading of certain operations. In
particular, we explore the so-called Vision Processing Unit (VPU), a
highly-parallel vector processor with a power envelope of less than 1W. We
evaluate this chip during inference using a pre-trained GoogLeNet convolutional
network model and a large image dataset from the ImageNet ILSVRC challenge.
Preliminary results indicate that a multi-VPU configuration provides similar
performance compared to reference CPU and GPU implementations, while reducing
the thermal-design power (TDP) up to 8x in comparison
- …