4,271 research outputs found
Reconfigurable Hardware Accelerators: Opportunities, Trends, and Challenges
With the emerging big data applications of Machine Learning, Speech
Recognition, Artificial Intelligence, and DNA Sequencing in recent years,
computer architecture research communities are facing the explosive scale of
various data explosion. To achieve high efficiency of data-intensive computing,
studies of heterogeneous accelerators which focus on latest applications, have
become a hot issue in computer architecture domain. At present, the
implementation of heterogeneous accelerators mainly relies on heterogeneous
computing units such as Application-specific Integrated Circuit (ASIC),
Graphics Processing Unit (GPU), and Field Programmable Gate Array (FPGA). Among
the typical heterogeneous architectures above, FPGA-based reconfigurable
accelerators have two merits as follows: First, FPGA architecture contains a
large number of reconfigurable circuits, which satisfy requirements of high
performance and low power consumption when specific applications are running.
Second, the reconfigurable architectures of employing FPGA performs prototype
systems rapidly and features excellent customizability and reconfigurability.
Nowadays, in top-tier conferences of computer architecture, emerging a batch of
accelerating works based on FPGA or other reconfigurable architectures. To
better review the related work of reconfigurable computing accelerators
recently, this survey reserves latest high-level research products of
reconfigurable accelerator architectures and algorithm applications as the
basis. In this survey, we compare hot research issues and concern domains,
furthermore, analyze and illuminate advantages, disadvantages, and challenges
of reconfigurable accelerators. In the end, we prospect the development
tendency of accelerator architectures in the future, hoping to provide a
reference for computer architecture researchers
Synergy: A HW/SW Framework for High Throughput CNNs on Embedded Heterogeneous SoC
Convolutional Neural Networks (CNN) have been widely deployed in diverse
application domains. There has been significant progress in accelerating both
their training and inference using high-performance GPUs, FPGAs, and custom
ASICs for datacenter-scale environments. The recent proliferation of mobile and
IoT devices have necessitated real-time, energy-efficient deep neural network
inference on embedded-class, resource-constrained platforms. In this context,
we present {\em Synergy}, an automated, hardware-software co-designed,
pipelined, high-throughput CNN inference framework on embedded heterogeneous
system-on-chip (SoC) architectures (Xilinx Zynq). {\em Synergy} leverages,
through multi-threading, all the available on-chip resources, which includes
the dual-core ARM processor along with the FPGA and the NEON SIMD engines as
accelerators. Moreover, {\em Synergy} provides a unified abstraction of the
heterogeneous accelerators (FPGA and NEON) and can adapt to different network
configurations at runtime without changing the underlying hardware accelerator
architecture by balancing workload across accelerators through work-stealing.
{\em Synergy} achieves 7.3X speedup, averaged across seven CNN models, over a
well-optimized software-only solution. {\em Synergy} demonstrates substantially
better throughput and energy-efficiency compared to the contemporary CNN
implementations on the same SoC architecture.Comment: 34 pages, submitted to ACM Transactions on Embedded Computing Systems
(TECS
High Level Synthesis with a Dataflow Architectural Template
In this work, we present a new approach to high level synthesis (HLS), where
high level functions are first mapped to an architectural template, before
hardware synthesis is performed. As FPGA platforms are especially suitable for
implementing streaming processing pipelines, we perform transformations on
conventional high level programs where they are turned into multi-stage
dataflow engines [1]. This target template naturally overlaps slow memory data
accesses with computations and therefore has much better tolerance towards
memory subsystem latency. Using a state-of-the-art HLS tool for the actual
circuit generation, we observe up to 9x improvement in overall performance when
the dataflow architectural template is used as an intermediate compilation
target.Comment: Presented at 2nd International Workshop on Overlay Architectures for
FPGAs (OLAF 2016) arXiv:1605.0814
Resource-Aware Just-in-Time OpenCL Compiler for Coarse-Grained FPGA Overlays
FPGA vendors have recently started focusing on OpenCL for FPGAs because of
its ability to leverage the parallelism inherent to heterogeneous computing
platforms. OpenCL allows programs running on a host computer to launch
accelerator kernels which can be compiled at run-time for a specific
architecture, thus enabling portability. However, the prohibitive compilation
times (specifically the FPGA place and route times) are a major stumbling block
when using OpenCL tools from FPGA vendors. The long compilation times mean that
the tools cannot effectively use just-in-time (JIT) compilation or runtime
performance scaling. Coarse-grained overlays represent a possible solution by
virtue of their coarse granularity and fast compilation. In this paper, we
present a methodology for run-time compilation of OpenCL kernels to a DSP block
based coarse-grained overlay, rather than directly to the fine-grained FPGA
fabric. The proposed methodology allows JIT compilation and on-demand
resource-aware kernel replication to better utilize available overlay
resources, raising the abstraction level while reducing compile times
significantly. We further demonstrate that this approach can even be used for
run-time compilation of OpenCL kernels on the ARM processor of the embedded
heterogeneous Zynq device.Comment: Presented at 3rd International Workshop on Overlay Architectures for
FPGAs (OLAF 2017) arXiv:1704.0880
FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
Due to recent advances in digital technologies, and availability of credible
data, an area of artificial intelligence, deep learning, has emerged, and has
demonstrated its ability and effectiveness in solving complex learning problems
not possible before. In particular, convolution neural networks (CNNs) have
demonstrated their effectiveness in image detection and recognition
applications. However, they require intensive CPU operations and memory
bandwidth that make general CPUs fail to achieve desired performance levels.
Consequently, hardware accelerators that use application specific integrated
circuits (ASICs), field programmable gate arrays (FPGAs), and graphic
processing units (GPUs) have been employed to improve the throughput of CNNs.
More precisely, FPGAs have been recently adopted for accelerating the
implementation of deep learning networks due to their ability to maximize
parallelism as well as due to their energy efficiency. In this paper, we review
recent existing techniques for accelerating deep learning networks on FPGAs. We
highlight the key features employed by the various techniques for improving the
acceleration performance. In addition, we provide recommendations for enhancing
the utilization of FPGAs for CNNs acceleration. The techniques investigated in
this paper represent the recent trends in FPGA-based accelerators of deep
learning networks. Thus, this review is expected to direct the future advances
on efficient hardware accelerators and to be useful for deep learning
researchers.Comment: This article has been accepted for publication in IEEE Access
(December, 2018
Criteria and Approaches for Virtualization on Modern FPGAs
Modern field programmable gate arrays (FPGAs) can produce high performance in
a wide range of applications, and their computational capacity is becoming
abundant in personal computers. Regardless of this fact, FPGA virtualization is
an emerging research field. Nowadays, challenges of the research area come from
not only technical difficulties but also from the ambiguous standards of
virtualization. In this paper, we introduce novel criteria of FPGA
virtualization and discuss several approaches to accomplish those criteria. In
addition, we present and describe in detail the specific FPGA virtualization
architecture that we developed on Intel Arria 10 FPGA. We evaluate our solution
with a combination of applications and microbenchmarks. The result shows that
our virtualization solution can provide a full abstraction of FPGA device in
both user and developer perspective while maintaining a reasonable performance
compared to native FPGA
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions
In the past decade, Convolutional Neural Networks (CNNs) have demonstrated
state-of-the-art performance in various Artificial Intelligence tasks. To
accelerate the experimentation and development of CNNs, several software
frameworks have been released, primarily targeting power-hungry CPUs and GPUs.
In this context, reconfigurable hardware in the form of FPGAs constitutes a
potential alternative platform that can be integrated in the existing deep
learning ecosystem to provide a tunable balance between performance, power
consumption and programmability. In this paper, a survey of the existing
CNN-to-FPGA toolflows is presented, comprising a comparative study of their key
characteristics which include the supported applications, architectural
choices, design space exploration methods and achieved performance. Moreover,
major challenges and objectives introduced by the latest trends in CNN
algorithmic research are identified and presented. Finally, a uniform
evaluation methodology is proposed, aiming at the comprehensive, complete and
in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal,
201
Massively Parallel Processor Architectures for Resource-aware Computing
We present a class of massively parallel processor architectures called
invasive tightly coupled processor arrays (TCPAs). The presented processor
class is a highly parameterizable template, which can be tailored before
runtime to fulfill costumers' requirements such as performance, area cost, and
energy efficiency. These programmable accelerators are well suited for
domain-specific computing from the areas of signal, image, and video processing
as well as other streaming processing applications. To overcome future scaling
issues (e.g., power consumption, reliability, resource management, as well as
application parallelization and mapping), TCPAs are inherently designed in a
way to support self-adaptivity and resource awareness at hardware level. Here,
we follow a recently introduced resource-aware parallel computing paradigm
called invasive computing where an application can dynamically claim, execute,
and release resources. Furthermore, we show how invasive computing can be used
as an enabler for power management. Finally, we will introduce ideas on how to
realize fault-tolerant loop execution on such massively parallel architectures
through employing on-demand spatial redundancies at the processor array level.Comment: Presented at 1st Workshop on Resource Awareness and Adaptivity in
Multi-Core Computing (Racing 2014) (arXiv:1405.2281
Analytical Cost Metrics : Days of Future Past
As we move towards the exascale era, the new architectures must be capable of
running the massive computational problems efficiently. Scientists and
researchers are continuously investing in tuning the performance of
extreme-scale computational problems. These problems arise in almost all areas
of computing, ranging from big data analytics, artificial intelligence, search,
machine learning, virtual/augmented reality, computer vision, image/signal
processing to computational science and bioinformatics. With Moore's law
driving the evolution of hardware platforms towards exascale, the dominant
performance metric (time efficiency) has now expanded to also incorporate
power/energy efficiency. Therefore, the major challenge that we face in
computing systems research is: "how to solve massive-scale computational
problems in the most time/power/energy efficient manner?"
The architectures are constantly evolving making the current performance
optimizing strategies less applicable and new strategies to be invented. The
solution is for the new architectures, new programming models, and applications
to go forward together. Doing this is, however, extremely hard. There are too
many design choices in too many dimensions. We propose the following strategy
to solve the problem: (i) Models - Develop accurate analytical models (e.g.
execution time, energy, silicon area) to predict the cost of executing a given
program, and (ii) Complete System Design - Simultaneously optimize all the cost
models for the programs (computational problems) to obtain the most
time/area/power/energy efficient solution. Such an optimization problem evokes
the notion of codesign
NTX: An Energy-efficient Streaming Accelerator for Floating-point Generalized Reduction Workloads in 22nm FD-SOI
Specialized coprocessors for Multiply-Accumulate (MAC) intensive workloads
such as Deep Learning are becoming widespread in SoC platforms, from GPUs to
mobile SoCs. In this paper we revisit NTX (an efficient accelerator developed
for training Deep Neural Networks at scale) as a generalized MAC and reduction
streaming engine. The architecture consists of a set of 32 bit floating-point
streaming co-processors that are loosely coupled to a RISC-V core in charge of
orchestrating data movement and computation. Post-layout results of a recent
silicon implementation in 22 nm FD-SOI technology show the accelerator's
capability to deliver up to 20 Gflop/s at 1.25 GHz and 168 mW. Based on these
results we show that a version of NTX scaled down to 14 nm can achieve a 3x
energy efficiency improvement over contemporary GPUs at 10.4x less silicon
area, and a compute performance of 1.4 Tflop/s for training large
state-of-the-art networks with full floating-point precision. An extended
evaluation of MAC-intensive kernels shows that NTX can consistently achieve up
to 87% of its peak performance across general reduction workloads beyond
machine learning. Its modular architecture enables deployment at different
scales ranging from high-performance GPU-class to low-power embedded scenarios.Comment: 6 pages, invited paper at DATE 201
- …