2,864 research outputs found
Reconfigurable Hardware Accelerators: Opportunities, Trends, and Challenges
With the emerging big data applications of Machine Learning, Speech
Recognition, Artificial Intelligence, and DNA Sequencing in recent years,
computer architecture research communities are facing the explosive scale of
various data explosion. To achieve high efficiency of data-intensive computing,
studies of heterogeneous accelerators which focus on latest applications, have
become a hot issue in computer architecture domain. At present, the
implementation of heterogeneous accelerators mainly relies on heterogeneous
computing units such as Application-specific Integrated Circuit (ASIC),
Graphics Processing Unit (GPU), and Field Programmable Gate Array (FPGA). Among
the typical heterogeneous architectures above, FPGA-based reconfigurable
accelerators have two merits as follows: First, FPGA architecture contains a
large number of reconfigurable circuits, which satisfy requirements of high
performance and low power consumption when specific applications are running.
Second, the reconfigurable architectures of employing FPGA performs prototype
systems rapidly and features excellent customizability and reconfigurability.
Nowadays, in top-tier conferences of computer architecture, emerging a batch of
accelerating works based on FPGA or other reconfigurable architectures. To
better review the related work of reconfigurable computing accelerators
recently, this survey reserves latest high-level research products of
reconfigurable accelerator architectures and algorithm applications as the
basis. In this survey, we compare hot research issues and concern domains,
furthermore, analyze and illuminate advantages, disadvantages, and challenges
of reconfigurable accelerators. In the end, we prospect the development
tendency of accelerator architectures in the future, hoping to provide a
reference for computer architecture researchers
Criteria and Approaches for Virtualization on Modern FPGAs
Modern field programmable gate arrays (FPGAs) can produce high performance in
a wide range of applications, and their computational capacity is becoming
abundant in personal computers. Regardless of this fact, FPGA virtualization is
an emerging research field. Nowadays, challenges of the research area come from
not only technical difficulties but also from the ambiguous standards of
virtualization. In this paper, we introduce novel criteria of FPGA
virtualization and discuss several approaches to accomplish those criteria. In
addition, we present and describe in detail the specific FPGA virtualization
architecture that we developed on Intel Arria 10 FPGA. We evaluate our solution
with a combination of applications and microbenchmarks. The result shows that
our virtualization solution can provide a full abstraction of FPGA device in
both user and developer perspective while maintaining a reasonable performance
compared to native FPGA
Resource-Aware Just-in-Time OpenCL Compiler for Coarse-Grained FPGA Overlays
FPGA vendors have recently started focusing on OpenCL for FPGAs because of
its ability to leverage the parallelism inherent to heterogeneous computing
platforms. OpenCL allows programs running on a host computer to launch
accelerator kernels which can be compiled at run-time for a specific
architecture, thus enabling portability. However, the prohibitive compilation
times (specifically the FPGA place and route times) are a major stumbling block
when using OpenCL tools from FPGA vendors. The long compilation times mean that
the tools cannot effectively use just-in-time (JIT) compilation or runtime
performance scaling. Coarse-grained overlays represent a possible solution by
virtue of their coarse granularity and fast compilation. In this paper, we
present a methodology for run-time compilation of OpenCL kernels to a DSP block
based coarse-grained overlay, rather than directly to the fine-grained FPGA
fabric. The proposed methodology allows JIT compilation and on-demand
resource-aware kernel replication to better utilize available overlay
resources, raising the abstraction level while reducing compile times
significantly. We further demonstrate that this approach can even be used for
run-time compilation of OpenCL kernels on the ARM processor of the embedded
heterogeneous Zynq device.Comment: Presented at 3rd International Workshop on Overlay Architectures for
FPGAs (OLAF 2017) arXiv:1704.0880
Polystore++: Accelerated Polystore System for Heterogeneous Workloads
Modern real-time business analytic consist of heterogeneous workloads (e.g,
database queries, graph processing, and machine learning). These analytic
applications need programming environments that can capture all aspects of the
constituent workloads (including data models they work on and movement of data
across processing engines). Polystore systems suit such applications; however,
these systems currently execute on CPUs and the slowdown of Moore's Law means
they cannot meet the performance and efficiency requirements of modern
workloads. We envision Polystore++, an architecture to accelerate existing
polystore systems using hardware accelerators (e.g, FPGAs, CGRAs, and GPUs).
Polystore++ systems can achieve high performance at low power by identifying
and offloading components of a polystore system that are amenable to
acceleration using specialized hardware. Building a Polystore++ system is
challenging and introduces new research problems motivated by the use of
hardware accelerators (e.g, optimizing and mapping query plans across
heterogeneous computing units and exploiting hardware pipelining and
parallelism to improve performance). In this paper, we discuss these challenges
in detail and list possible approaches to address these problems.Comment: 11 pages, Accepted in ICDCS 201
Coarse-grained reconfigurable array architectures
Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefit from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efficiently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on flexibility, performance, and power-efficiency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual fine-tuning of source code
Multi-Mode Inference Engine for Convolutional Neural Networks
During the past few years, interest in convolutional neural networks (CNNs)
has risen constantly, thanks to their excellent performance on a wide range of
recognition and classification tasks. However, they suffer from the high level
of complexity imposed by the high-dimensional convolutions in convolutional
layers. Within scenarios with limited hardware resources and tight power and
latency constraints, the high computational complexity of CNNs makes them
difficult to be exploited. Hardware solutions have striven to reduce the power
consumption using low-power techniques, and to limit the processing time by
increasing the number of processing elements (PEs). While most of ASIC designs
claim a peak performance of a few hundred giga operations per seconds, their
average performance is substantially lower when applied to state-of-the-art
CNNs such as AlexNet, VGGNet and ResNet, leading to low resource utilization.
Their performance efficiency is limited to less than 55% on average, which
leads to unnecessarily high processing latency and silicon area. In this paper,
we propose a dataflow which enables to perform both the fully-connected and
convolutional computations for any filter/layer size using the same PEs. We
then introduce a multi-mode inference engine (MMIE) based on the proposed
dataflow. Finally, we show that the proposed MMIE achieves a performance
efficiency of more than 84% when performing the computations of the three
renown CNNs (i.e., AlexNet, VGGNet and ResNet), outperforming the best
architecture in the state-of-the-art in terms of energy consumption, processing
latency and silicon area
A configurable accelerator for manycores: the Explicitly Many-Processor Approach
A new approach to designing processor accelerators is presented. A new
computing model and a special kind of accelerator with dynamic (end-user
programmable) architecture is suggested. The new model considers a processor,
in which a newly introduced supervisor layer coordinates the job of the cores.
The cores have the ability (based on the parallelization information provided
by the compiler, and using the help of the supervisor) to outsource part of the
job they received to some neighbouring core. The introduced changes essentially
and advantageously modify the architecture and operation of the computing
systems. The computing throughput drastically increases, the efficiency of the
technological implementation (computing performance per logic gates) increases,
the non-payload activity for using operating system services decreases, the
real-time behavior changes advantageously, and connecting accelerators to the
processor greatly simplifies. Here only some details of the architecture and
operation of the processor are discussed, the rest is described elsewhere.Comment: 12 pages, 6 figure
FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
Due to recent advances in digital technologies, and availability of credible
data, an area of artificial intelligence, deep learning, has emerged, and has
demonstrated its ability and effectiveness in solving complex learning problems
not possible before. In particular, convolution neural networks (CNNs) have
demonstrated their effectiveness in image detection and recognition
applications. However, they require intensive CPU operations and memory
bandwidth that make general CPUs fail to achieve desired performance levels.
Consequently, hardware accelerators that use application specific integrated
circuits (ASICs), field programmable gate arrays (FPGAs), and graphic
processing units (GPUs) have been employed to improve the throughput of CNNs.
More precisely, FPGAs have been recently adopted for accelerating the
implementation of deep learning networks due to their ability to maximize
parallelism as well as due to their energy efficiency. In this paper, we review
recent existing techniques for accelerating deep learning networks on FPGAs. We
highlight the key features employed by the various techniques for improving the
acceleration performance. In addition, we provide recommendations for enhancing
the utilization of FPGAs for CNNs acceleration. The techniques investigated in
this paper represent the recent trends in FPGA-based accelerators of deep
learning networks. Thus, this review is expected to direct the future advances
on efficient hardware accelerators and to be useful for deep learning
researchers.Comment: This article has been accepted for publication in IEEE Access
(December, 2018
Massively Parallel Processor Architectures for Resource-aware Computing
We present a class of massively parallel processor architectures called
invasive tightly coupled processor arrays (TCPAs). The presented processor
class is a highly parameterizable template, which can be tailored before
runtime to fulfill costumers' requirements such as performance, area cost, and
energy efficiency. These programmable accelerators are well suited for
domain-specific computing from the areas of signal, image, and video processing
as well as other streaming processing applications. To overcome future scaling
issues (e.g., power consumption, reliability, resource management, as well as
application parallelization and mapping), TCPAs are inherently designed in a
way to support self-adaptivity and resource awareness at hardware level. Here,
we follow a recently introduced resource-aware parallel computing paradigm
called invasive computing where an application can dynamically claim, execute,
and release resources. Furthermore, we show how invasive computing can be used
as an enabler for power management. Finally, we will introduce ideas on how to
realize fault-tolerant loop execution on such massively parallel architectures
through employing on-demand spatial redundancies at the processor array level.Comment: Presented at 1st Workshop on Resource Awareness and Adaptivity in
Multi-Core Computing (Racing 2014) (arXiv:1405.2281
CNNLab: a Novel Parallel Framework for Neural Networks using GPU and FPGA-a Practical Study with Trade-off Analysis
Designing and implementing efficient, provably correct parallel neural
network processing is challenging. Existing high-level parallel abstractions
like MapReduce are insufficiently expressive while low-level tools like MPI and
Pthreads leave ML experts repeatedly solving the same design challenges.
However, the diversity and large-scale data size have posed a significant
challenge to construct a flexible and high-performance implementation of deep
learning neural networks. To improve the performance and maintain the
scalability, we present CNNLab, a novel deep learning framework using GPU and
FPGA-based accelerators. CNNLab provides a uniform programming model to users
so that the hardware implementation and the scheduling are invisible to the
programmers. At runtime, CNNLab leverages the trade-offs between GPU and FPGA
before offloading the tasks to the accelerators. Experimental results on the
state-of-the-art Nvidia K40 GPU and Altera DE5 FPGA board demonstrate that the
CNNLab can provide a universal framework with efficient support for diverse
applications without increasing the burden of the programmers. Moreover, we
analyze the detailed quantitative performance, throughput, power, energy, and
performance density for both approaches. Experimental results leverage the
trade-offs between GPU and FPGA and provide useful practical experiences for
the deep learning research community
- …